Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
NewsThe Newsletter of the R Project Volume 31 June 2003
Editorialby Friedrich Leisch
Welcome to the first issue of R News in 2003 whichis also the first issue where I serve as Editor-in-ChiefThis year has already seen a lot of exciting activitiesin the R project one of them was the Distributed Sta-tistical Computing Workshop that took place here inVienna DSC 2003 is also partly responsible for therather late release of this newsletter both membersof the editorial board and authors of regular columnswere also involved in conference organization
R 170 was released on 2003-04-16 see ldquoChangesin Rrdquo for detailed release information Two majornew features of this release are support for namespaces in packages and stronger support of S4-styleclasses and methods While the methods package byJohn Chambers provides R with S4 classes for sometime now R 170 is the first version of R where themethods package is attached by default every timeR is started I know of several users who were ratheranxious about the possible side effects of R going fur-ther in the direction of S4 but the transitions seemsto have gone very well (and we were not flooded bybug reports) Two articles by Luke Tierney and DougBates discuss name spaces and converting packagesto S4 respectively
In another article Jun Yan and Tony Rossini ex-plain how Microsoft Windows versions of R and Rpackages can be cross-compiled on Linux machines
Changing my hat from R News editor to CRANmaintainer I want to take this opportunity to an-nounce Uwe Ligges as new maintainer of the win-dows packages on CRAN Until now these have beenbuilt by Brian Ripley who has spent a lot of time andeffort on the task Thanks a lot Brian
Also in this issue Greg Warnes introduces the ge-netics package Juumlrgen Groszlig discusses the compu-tation of variance inflation factors Thomas Lumleyshows how to analyze survey data in R and CarsonMurison and Mason use parallel computing on a Be-owulf cluster to speed up gene-shaving
We also have a new column with book reviews2002 has seen the publication of several books explic-itly written for R (or R and S-Plus) users We hope tomake book reviews a regular column of this newslet-ter If you know of a book that should be reviewedplease tell the publisher to send a free review copy toa member of the editorial board (who will forward itto a referee)
Last but not least we feature a recreational pagein this newsletter Barry Rowlingson has written acrossword on R and even offers a prize for the cor-rect solution In summary We hope you will enjoyreading R News 31
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
Contents of this issue
Editorial 1Name Space Management for R 2Converting Packages to S4 6The genetics Package 9Variance Inflation Factors 13Building Microsoft Windows Versions of R
and R packages under Intel Linux 15
Analysing Survey Data in R 17Computational Gains Using RPVM on a Be-
owulf Cluster 21R Help Desk 26Book Reviews 28Changes in R 170 31Changes on CRAN 36Crossword 39Recent Events 40
Vol 31 June 2003 2
Name Space Management for RLuke Tierney
Introduction
In recent years R has become a major platform for thedevelopment of new data analysis software As thecomplexity of software developed in R and the num-ber of available packages that might be used in con-junction increase the potential for conflicts amongdefinitions increases as well The name space mech-anism added in R 170 provides a framework for ad-dressing this issue
Name spaces allow package authors to controlwhich definitions provided by a package are visibleto a package user and which ones are private andonly available for internal use By default definitionsare private a definition is made public by exportingthe name of the defined variable This allows pack-age authors to create their main functions as com-positions of smaller utility functions that can be in-dependently developed and tested Only the mainfunctions are exported the utility functions used inthe implementation remain private An incrementaldevelopment strategy like this is often recommendedfor high level languages like Rmdasha guideline oftenused is that a function that is too large to fit into asingle editor window can probably be broken downinto smaller units This strategy has been hard to fol-low in developing R packages since it would havelead to packages containing many utility functionsthat complicate the userrsquos view of the main function-ality of a package and that may conflict with defini-tions in other packages
Name spaces also give the package author ex-plicit control over the definitions of global variablesused in package code For example a package thatdefines a function mydnorm as
mydnorm lt- function(z)
1sqrt(2 pi) exp(- z^2 2)
most likely intends exp sqrt and pi to referto the definitions provided by the base packageHowever standard packages define their functionsin the global environment shown in Figure 1
packagebase
GlobalEnv
packagectest
Figure 1 Dynamic global environment
The global environment is dynamic in the sense
that top level assignments and calls to library andattach can insert shadowing definitions ahead of thedefinitions in base If the assignment pi lt- 3 hasbeen made at top level or in an attached packagethen the value returned by mydnorm(0) is not likelyto be the value the author intended Name spacesensure that references to global variables in the basepackage cannot be shadowed by definitions in thedynamic global environment
Some packages also build on functionality pro-vided by other packages For example the packagemodreg uses the function asstepfun from the step-fun package The traditional approach for dealingwith this is to use calls to require or library to addthe additional packages to the search path This hastwo disadvantages First it is again subject to shad-owing Second the fact that the implementation of apackage A uses package B does not mean that userswho add A to their search path are interested in hav-ing B on their search path as well Name spaces pro-vide a mechanism for specifying package dependen-cies by importing exported variables from other pack-ages Formally importing variables form other pack-ages again ensures that these cannot be shadowed bydefinitions in the dynamic global environment
From a userrsquos point of view packages with aname space are much like any other package A callto library is used to load the package and place itsexported variables in the search path If the packageimports definitions from other packages then thesepackages will be loaded but they will not be placedon the search path as a result of this load
Adding a name space to a package
A package is given a name space by placing alsquoNAMESPACErsquo file at the top level of the packagesource directory This file contains name space direc-tives that define the name space This declarativeapproach to specifying name space information al-lows tools that collect package information to extractthe package interface and dependencies without pro-cessing the package source code Using a separatefile also has the benefit of making it easy to add aname space to and remove a name space from a pack-age
The main name space directives are export andimport directives The export directive specifiesnames of variables to export For example the di-rective
export(asstepfun ecdf isstepfun stepfun)
exports four variables Quoting is optional for stan-dard names such as these but non-standard namesneed to be quoted as in
R News ISSN 1609-3631
Vol 31 June 2003 3
export([lt-tracker)
As a convenience exports can also be specified witha pattern The directive
exportPattern(test$)
exports all variables ending with testImport directives are used to import definitions
from other packages with name spaces The direc-tive
import(mva)
imports all exported variables from the packagemva The directive
importFrom(stepfun asstepfun)
imports only the asstepfun function from the step-fun package
Two additional directives are available TheS3method directive is used to register methods forS3 generic functions this directive is discussed fur-ther in the following section The final directive isuseDynLib This directive allows a package to declar-atively specify that a DLL is needed and should beloaded when the package is loaded
For a package with a name space the two op-erations of loading the package and attaching thepackage to the search path are logically distinct Apackage loaded by a direct call to library will firstbe loaded if it is not already loaded and then beattached to the search path A package loaded tosatisfy an import dependency will be loaded butwill not be attached to the search path As a resultthe single initialization hook function Firstlibis no longer adequate and is not used by packageswith name spaces Instead a package with a namespace can provide two hook functions onLoad andonAttach These functions if defined should not beexported Many packages will not need either hookfunction since import directives take the place ofrequire calls and useDynLib directives can replacedirect calls to librarydynam
Name spaces are sealed This means that once apackage with a name space is loaded it is not pos-sible to add or remove variables or to change thevalues of variables defined in a name space Seal-ing serves two purposes it simplifies the implemen-tation and it ensures that the locations of variablebindings cannot be changed at run time This will fa-cilitate the development of a byte code compiler forR (Tierney 2001)
Adding a name space to a package may com-plicate debugging package code The function fixfor example is no longer useful for modifying aninternal package function since it places the newdefinition in the global environment and that newdefinition remains shadowed within the package bythe original definition As a result it is a goodidea not to add a name space to a package until it
is completely debugged and to remove the namespace if further debugging is needed this can bedone by temporarily renaming the lsquoNAMESPACErsquofile Other alternatives are being explored and R170 contains some experimental functions such asfixInNamespace that may help
Name spaces and method dispatch
R supports two styles of object-oriented program-ming the S3 style based on UseMethod dispatch andthe S4 style provided by the methods package S3style dispatch is still used the most and some sup-port is provided in the name space implementationreleased in R 170
S3 dispatch is based on concatenating the genericfunction name and the name of a class to formthe name of a method The environment in whichthe generic function is called is then searched for amethod definition With name spaces this creates aproblem if a package is imported but not attachedthen the method definitions it provides may not bevisible at the call site The solution provided by thename space system is a mechanism for registering S3methods with the generic function A directive of theform
S3method(print dendrogram)
in the lsquoNAMESPACErsquo file of a package registers thefunction printdendrogram defined in the packageas the print method for class dendrogram Thevariable printdendrogram does not need to beexported Keeping the method definition privateensures that users will not inadvertently call themethod directly
S4 classes and generic functions are by designmore formally structured than their S3 counterpartsand should therefore be conceptually easier to in-tegrate with name spaces than S3 generic functionsand classes For example it should be fairly easyto allow for both exported and private S4 classesthe concatenation-based approach of S3 dispatchdoes not seem to be compatible with having privateclasses except by some sort of naming conventionHowever the integration of name spaces with S4 dis-patch could not be completed in time for the releaseof R 170 It is quite likely that the integration canbe accomplished by adding only two new directivesexportClass and importClassFrom
A simple example
A simple artificial example may help to illustratehow the import and export directives work Con-sider two packages named foo and bar The R codefor package foo in file lsquofooRrsquo is
x lt- 1
f lt- function(y) c(xy)
R News ISSN 1609-3631
Vol 31 June 2003 4
The lsquoNAMESPACErsquo file contains the single directive
export(f)
Thus the variable x is private to the package and thefunction f is public The body of f references a globalvariable x and a global function c The global re-frence x corresponds to the definition in the packageSince the package does not provide a definition forc and does not import any definitions the variable crefers to the definition provided by the base package
The second package bar has code file lsquobarRrsquo con-taining the definitions
c lt- function() sum()
g lt- function(y) f(c(y 7))
and lsquoNAMESPACErsquo file
import(foo)
export(g)
The function c is private to the package The func-tion g calls a global function c Definitions providedin the package take precedence over imports and def-initions in the base package therefore the definitionof c used by g is the one provided in bar
Calling library(bar) loads bar and attaches itsexports to the search path Package foo is also loadedbut not attached to the search path A call to g pro-duces
gt g(6)
[1] 1 13
This is consistent with the definitions of c in the twosettings in bar the function c is defined to be equiv-alent to sum but in foo the variable c refers to thestandard function c in base
Some details
This section provides some details on the currentimplementation of name spaces These details maychange as name space support evolves
Name spaces use R environments to providestatic binding of global variables for function def-initions In a package with a name space func-tions defined in the package are defined in a namespace environment as shown in Figure 2 This en-vironment consists of a set of three static framesfollowed by the usual dynamic global environ-ment The first static frame contains the local defi-nitions of the package The second frame contains
Dyn
amic
packagebase
GlobalEnv
packagectest
internal defs
base
imports
Stat
ic
Figure 2 Name space environment
imported definitions and the third frame is the basepackage Suppose the function mydnorm shown inthe introduction is defined in a package with a namespace that does not explicitly import any definitionsThen a call to mydnorm will cause R to evaluate thevariable pi by searching for a binding in the namespace environment Unless the package itself definesa variable pi the first binding found will be the onein the third frame the definition provided in the basepackage Definitions in the dynamic global environ-ment cannot shadow the definition in base
When library is called to load a package with aname space library first calls loadNamespace andthen attachNamespace loadNamespace first checkswhether the name space is already loaded and reg-istered with the internal registry of loaded namespaces If so the already loaded name space is re-turned If not then loadNamespace is called on allimported name spaces and definitions of exportedvariables of these packages are copied to the importsframe the second frame of the new name spaceThen the package code is loaded and run and ex-ports S3 methods and shared libraries are regis-tered Finally the onLoad hook is run if the packagedefines one and the name space is sealed Figure 3illustrates the dynamic global environment and theinternal data base of loaded name spaces after twopackages A and B have been loaded by explicit callsto library and package C has been loaded to satisfyimport directives in package B
The serialization mechanism used to save Rworkspaces handles name spaces by writing out areference for a name space consisting of the pack-age name and version When the workspace isloaded the reference is used to construct a callto loadNamespace The current implementation ig-nores the version information in the future this maybe used to signal compatibility warnings or selectamong several available versions
The registry of loaded name spaces can be ex-amined using loadedNamespaces In the currentimplementation loaded name spaces are not un-loaded automatically Detaching a package with a
R News ISSN 1609-3631
Vol 31 June 2003 5
Loaded Name Spaces
internal defs
base
imports
A exports
packageB
packageA
GlobalEnv
packagebase
Global Search Path
internal defs
base
imports
C exports
internal defs
base
imports
B exports
Figure 3 Internal implementation
name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development
Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already
Discussion
Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches
for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code
The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible
Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear
Bibliography
John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5
Samuel P Harbison Modula-3 Prentice Hall 1992 5
PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5
Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3
Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5
Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5
Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu
R News ISSN 1609-3631
Vol 31 June 2003 6
Converting Packages to S4by Douglas Bates
Introduction
R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations
lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods
Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)
The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session
There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4
Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including
conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so
S3 versus S4 classes and methods
S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names
The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary
Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this
There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned
By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class
R News ISSN 1609-3631
Vol 31 June 2003 7
is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class
S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature
S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)
In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package
Package conversion
Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4
S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic
Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we
had more slots in the S4 classes than components inthe S3 objects
Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully
Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed
We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned
We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language
The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We
R News ISSN 1609-3631
Vol 31 June 2003 8
would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version
Pitfalls
S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods
We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition
Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments
Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed
One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-
wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound
Conclusions
Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win
S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7
J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6
R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6
W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7
Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 2
Name Space Management for RLuke Tierney
Introduction
In recent years R has become a major platform for thedevelopment of new data analysis software As thecomplexity of software developed in R and the num-ber of available packages that might be used in con-junction increase the potential for conflicts amongdefinitions increases as well The name space mech-anism added in R 170 provides a framework for ad-dressing this issue
Name spaces allow package authors to controlwhich definitions provided by a package are visibleto a package user and which ones are private andonly available for internal use By default definitionsare private a definition is made public by exportingthe name of the defined variable This allows pack-age authors to create their main functions as com-positions of smaller utility functions that can be in-dependently developed and tested Only the mainfunctions are exported the utility functions used inthe implementation remain private An incrementaldevelopment strategy like this is often recommendedfor high level languages like Rmdasha guideline oftenused is that a function that is too large to fit into asingle editor window can probably be broken downinto smaller units This strategy has been hard to fol-low in developing R packages since it would havelead to packages containing many utility functionsthat complicate the userrsquos view of the main function-ality of a package and that may conflict with defini-tions in other packages
Name spaces also give the package author ex-plicit control over the definitions of global variablesused in package code For example a package thatdefines a function mydnorm as
mydnorm lt- function(z)
1sqrt(2 pi) exp(- z^2 2)
most likely intends exp sqrt and pi to referto the definitions provided by the base packageHowever standard packages define their functionsin the global environment shown in Figure 1
packagebase
GlobalEnv
packagectest
Figure 1 Dynamic global environment
The global environment is dynamic in the sense
that top level assignments and calls to library andattach can insert shadowing definitions ahead of thedefinitions in base If the assignment pi lt- 3 hasbeen made at top level or in an attached packagethen the value returned by mydnorm(0) is not likelyto be the value the author intended Name spacesensure that references to global variables in the basepackage cannot be shadowed by definitions in thedynamic global environment
Some packages also build on functionality pro-vided by other packages For example the packagemodreg uses the function asstepfun from the step-fun package The traditional approach for dealingwith this is to use calls to require or library to addthe additional packages to the search path This hastwo disadvantages First it is again subject to shad-owing Second the fact that the implementation of apackage A uses package B does not mean that userswho add A to their search path are interested in hav-ing B on their search path as well Name spaces pro-vide a mechanism for specifying package dependen-cies by importing exported variables from other pack-ages Formally importing variables form other pack-ages again ensures that these cannot be shadowed bydefinitions in the dynamic global environment
From a userrsquos point of view packages with aname space are much like any other package A callto library is used to load the package and place itsexported variables in the search path If the packageimports definitions from other packages then thesepackages will be loaded but they will not be placedon the search path as a result of this load
Adding a name space to a package
A package is given a name space by placing alsquoNAMESPACErsquo file at the top level of the packagesource directory This file contains name space direc-tives that define the name space This declarativeapproach to specifying name space information al-lows tools that collect package information to extractthe package interface and dependencies without pro-cessing the package source code Using a separatefile also has the benefit of making it easy to add aname space to and remove a name space from a pack-age
The main name space directives are export andimport directives The export directive specifiesnames of variables to export For example the di-rective
export(asstepfun ecdf isstepfun stepfun)
exports four variables Quoting is optional for stan-dard names such as these but non-standard namesneed to be quoted as in
R News ISSN 1609-3631
Vol 31 June 2003 3
export([lt-tracker)
As a convenience exports can also be specified witha pattern The directive
exportPattern(test$)
exports all variables ending with testImport directives are used to import definitions
from other packages with name spaces The direc-tive
import(mva)
imports all exported variables from the packagemva The directive
importFrom(stepfun asstepfun)
imports only the asstepfun function from the step-fun package
Two additional directives are available TheS3method directive is used to register methods forS3 generic functions this directive is discussed fur-ther in the following section The final directive isuseDynLib This directive allows a package to declar-atively specify that a DLL is needed and should beloaded when the package is loaded
For a package with a name space the two op-erations of loading the package and attaching thepackage to the search path are logically distinct Apackage loaded by a direct call to library will firstbe loaded if it is not already loaded and then beattached to the search path A package loaded tosatisfy an import dependency will be loaded butwill not be attached to the search path As a resultthe single initialization hook function Firstlibis no longer adequate and is not used by packageswith name spaces Instead a package with a namespace can provide two hook functions onLoad andonAttach These functions if defined should not beexported Many packages will not need either hookfunction since import directives take the place ofrequire calls and useDynLib directives can replacedirect calls to librarydynam
Name spaces are sealed This means that once apackage with a name space is loaded it is not pos-sible to add or remove variables or to change thevalues of variables defined in a name space Seal-ing serves two purposes it simplifies the implemen-tation and it ensures that the locations of variablebindings cannot be changed at run time This will fa-cilitate the development of a byte code compiler forR (Tierney 2001)
Adding a name space to a package may com-plicate debugging package code The function fixfor example is no longer useful for modifying aninternal package function since it places the newdefinition in the global environment and that newdefinition remains shadowed within the package bythe original definition As a result it is a goodidea not to add a name space to a package until it
is completely debugged and to remove the namespace if further debugging is needed this can bedone by temporarily renaming the lsquoNAMESPACErsquofile Other alternatives are being explored and R170 contains some experimental functions such asfixInNamespace that may help
Name spaces and method dispatch
R supports two styles of object-oriented program-ming the S3 style based on UseMethod dispatch andthe S4 style provided by the methods package S3style dispatch is still used the most and some sup-port is provided in the name space implementationreleased in R 170
S3 dispatch is based on concatenating the genericfunction name and the name of a class to formthe name of a method The environment in whichthe generic function is called is then searched for amethod definition With name spaces this creates aproblem if a package is imported but not attachedthen the method definitions it provides may not bevisible at the call site The solution provided by thename space system is a mechanism for registering S3methods with the generic function A directive of theform
S3method(print dendrogram)
in the lsquoNAMESPACErsquo file of a package registers thefunction printdendrogram defined in the packageas the print method for class dendrogram Thevariable printdendrogram does not need to beexported Keeping the method definition privateensures that users will not inadvertently call themethod directly
S4 classes and generic functions are by designmore formally structured than their S3 counterpartsand should therefore be conceptually easier to in-tegrate with name spaces than S3 generic functionsand classes For example it should be fairly easyto allow for both exported and private S4 classesthe concatenation-based approach of S3 dispatchdoes not seem to be compatible with having privateclasses except by some sort of naming conventionHowever the integration of name spaces with S4 dis-patch could not be completed in time for the releaseof R 170 It is quite likely that the integration canbe accomplished by adding only two new directivesexportClass and importClassFrom
A simple example
A simple artificial example may help to illustratehow the import and export directives work Con-sider two packages named foo and bar The R codefor package foo in file lsquofooRrsquo is
x lt- 1
f lt- function(y) c(xy)
R News ISSN 1609-3631
Vol 31 June 2003 4
The lsquoNAMESPACErsquo file contains the single directive
export(f)
Thus the variable x is private to the package and thefunction f is public The body of f references a globalvariable x and a global function c The global re-frence x corresponds to the definition in the packageSince the package does not provide a definition forc and does not import any definitions the variable crefers to the definition provided by the base package
The second package bar has code file lsquobarRrsquo con-taining the definitions
c lt- function() sum()
g lt- function(y) f(c(y 7))
and lsquoNAMESPACErsquo file
import(foo)
export(g)
The function c is private to the package The func-tion g calls a global function c Definitions providedin the package take precedence over imports and def-initions in the base package therefore the definitionof c used by g is the one provided in bar
Calling library(bar) loads bar and attaches itsexports to the search path Package foo is also loadedbut not attached to the search path A call to g pro-duces
gt g(6)
[1] 1 13
This is consistent with the definitions of c in the twosettings in bar the function c is defined to be equiv-alent to sum but in foo the variable c refers to thestandard function c in base
Some details
This section provides some details on the currentimplementation of name spaces These details maychange as name space support evolves
Name spaces use R environments to providestatic binding of global variables for function def-initions In a package with a name space func-tions defined in the package are defined in a namespace environment as shown in Figure 2 This en-vironment consists of a set of three static framesfollowed by the usual dynamic global environ-ment The first static frame contains the local defi-nitions of the package The second frame contains
Dyn
amic
packagebase
GlobalEnv
packagectest
internal defs
base
imports
Stat
ic
Figure 2 Name space environment
imported definitions and the third frame is the basepackage Suppose the function mydnorm shown inthe introduction is defined in a package with a namespace that does not explicitly import any definitionsThen a call to mydnorm will cause R to evaluate thevariable pi by searching for a binding in the namespace environment Unless the package itself definesa variable pi the first binding found will be the onein the third frame the definition provided in the basepackage Definitions in the dynamic global environ-ment cannot shadow the definition in base
When library is called to load a package with aname space library first calls loadNamespace andthen attachNamespace loadNamespace first checkswhether the name space is already loaded and reg-istered with the internal registry of loaded namespaces If so the already loaded name space is re-turned If not then loadNamespace is called on allimported name spaces and definitions of exportedvariables of these packages are copied to the importsframe the second frame of the new name spaceThen the package code is loaded and run and ex-ports S3 methods and shared libraries are regis-tered Finally the onLoad hook is run if the packagedefines one and the name space is sealed Figure 3illustrates the dynamic global environment and theinternal data base of loaded name spaces after twopackages A and B have been loaded by explicit callsto library and package C has been loaded to satisfyimport directives in package B
The serialization mechanism used to save Rworkspaces handles name spaces by writing out areference for a name space consisting of the pack-age name and version When the workspace isloaded the reference is used to construct a callto loadNamespace The current implementation ig-nores the version information in the future this maybe used to signal compatibility warnings or selectamong several available versions
The registry of loaded name spaces can be ex-amined using loadedNamespaces In the currentimplementation loaded name spaces are not un-loaded automatically Detaching a package with a
R News ISSN 1609-3631
Vol 31 June 2003 5
Loaded Name Spaces
internal defs
base
imports
A exports
packageB
packageA
GlobalEnv
packagebase
Global Search Path
internal defs
base
imports
C exports
internal defs
base
imports
B exports
Figure 3 Internal implementation
name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development
Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already
Discussion
Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches
for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code
The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible
Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear
Bibliography
John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5
Samuel P Harbison Modula-3 Prentice Hall 1992 5
PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5
Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3
Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5
Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5
Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu
R News ISSN 1609-3631
Vol 31 June 2003 6
Converting Packages to S4by Douglas Bates
Introduction
R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations
lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods
Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)
The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session
There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4
Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including
conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so
S3 versus S4 classes and methods
S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names
The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary
Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this
There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned
By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class
R News ISSN 1609-3631
Vol 31 June 2003 7
is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class
S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature
S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)
In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package
Package conversion
Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4
S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic
Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we
had more slots in the S4 classes than components inthe S3 objects
Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully
Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed
We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned
We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language
The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We
R News ISSN 1609-3631
Vol 31 June 2003 8
would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version
Pitfalls
S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods
We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition
Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments
Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed
One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-
wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound
Conclusions
Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win
S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7
J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6
R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6
W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7
Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 3
export([lt-tracker)
As a convenience exports can also be specified witha pattern The directive
exportPattern(test$)
exports all variables ending with testImport directives are used to import definitions
from other packages with name spaces The direc-tive
import(mva)
imports all exported variables from the packagemva The directive
importFrom(stepfun asstepfun)
imports only the asstepfun function from the step-fun package
Two additional directives are available TheS3method directive is used to register methods forS3 generic functions this directive is discussed fur-ther in the following section The final directive isuseDynLib This directive allows a package to declar-atively specify that a DLL is needed and should beloaded when the package is loaded
For a package with a name space the two op-erations of loading the package and attaching thepackage to the search path are logically distinct Apackage loaded by a direct call to library will firstbe loaded if it is not already loaded and then beattached to the search path A package loaded tosatisfy an import dependency will be loaded butwill not be attached to the search path As a resultthe single initialization hook function Firstlibis no longer adequate and is not used by packageswith name spaces Instead a package with a namespace can provide two hook functions onLoad andonAttach These functions if defined should not beexported Many packages will not need either hookfunction since import directives take the place ofrequire calls and useDynLib directives can replacedirect calls to librarydynam
Name spaces are sealed This means that once apackage with a name space is loaded it is not pos-sible to add or remove variables or to change thevalues of variables defined in a name space Seal-ing serves two purposes it simplifies the implemen-tation and it ensures that the locations of variablebindings cannot be changed at run time This will fa-cilitate the development of a byte code compiler forR (Tierney 2001)
Adding a name space to a package may com-plicate debugging package code The function fixfor example is no longer useful for modifying aninternal package function since it places the newdefinition in the global environment and that newdefinition remains shadowed within the package bythe original definition As a result it is a goodidea not to add a name space to a package until it
is completely debugged and to remove the namespace if further debugging is needed this can bedone by temporarily renaming the lsquoNAMESPACErsquofile Other alternatives are being explored and R170 contains some experimental functions such asfixInNamespace that may help
Name spaces and method dispatch
R supports two styles of object-oriented program-ming the S3 style based on UseMethod dispatch andthe S4 style provided by the methods package S3style dispatch is still used the most and some sup-port is provided in the name space implementationreleased in R 170
S3 dispatch is based on concatenating the genericfunction name and the name of a class to formthe name of a method The environment in whichthe generic function is called is then searched for amethod definition With name spaces this creates aproblem if a package is imported but not attachedthen the method definitions it provides may not bevisible at the call site The solution provided by thename space system is a mechanism for registering S3methods with the generic function A directive of theform
S3method(print dendrogram)
in the lsquoNAMESPACErsquo file of a package registers thefunction printdendrogram defined in the packageas the print method for class dendrogram Thevariable printdendrogram does not need to beexported Keeping the method definition privateensures that users will not inadvertently call themethod directly
S4 classes and generic functions are by designmore formally structured than their S3 counterpartsand should therefore be conceptually easier to in-tegrate with name spaces than S3 generic functionsand classes For example it should be fairly easyto allow for both exported and private S4 classesthe concatenation-based approach of S3 dispatchdoes not seem to be compatible with having privateclasses except by some sort of naming conventionHowever the integration of name spaces with S4 dis-patch could not be completed in time for the releaseof R 170 It is quite likely that the integration canbe accomplished by adding only two new directivesexportClass and importClassFrom
A simple example
A simple artificial example may help to illustratehow the import and export directives work Con-sider two packages named foo and bar The R codefor package foo in file lsquofooRrsquo is
x lt- 1
f lt- function(y) c(xy)
R News ISSN 1609-3631
Vol 31 June 2003 4
The lsquoNAMESPACErsquo file contains the single directive
export(f)
Thus the variable x is private to the package and thefunction f is public The body of f references a globalvariable x and a global function c The global re-frence x corresponds to the definition in the packageSince the package does not provide a definition forc and does not import any definitions the variable crefers to the definition provided by the base package
The second package bar has code file lsquobarRrsquo con-taining the definitions
c lt- function() sum()
g lt- function(y) f(c(y 7))
and lsquoNAMESPACErsquo file
import(foo)
export(g)
The function c is private to the package The func-tion g calls a global function c Definitions providedin the package take precedence over imports and def-initions in the base package therefore the definitionof c used by g is the one provided in bar
Calling library(bar) loads bar and attaches itsexports to the search path Package foo is also loadedbut not attached to the search path A call to g pro-duces
gt g(6)
[1] 1 13
This is consistent with the definitions of c in the twosettings in bar the function c is defined to be equiv-alent to sum but in foo the variable c refers to thestandard function c in base
Some details
This section provides some details on the currentimplementation of name spaces These details maychange as name space support evolves
Name spaces use R environments to providestatic binding of global variables for function def-initions In a package with a name space func-tions defined in the package are defined in a namespace environment as shown in Figure 2 This en-vironment consists of a set of three static framesfollowed by the usual dynamic global environ-ment The first static frame contains the local defi-nitions of the package The second frame contains
Dyn
amic
packagebase
GlobalEnv
packagectest
internal defs
base
imports
Stat
ic
Figure 2 Name space environment
imported definitions and the third frame is the basepackage Suppose the function mydnorm shown inthe introduction is defined in a package with a namespace that does not explicitly import any definitionsThen a call to mydnorm will cause R to evaluate thevariable pi by searching for a binding in the namespace environment Unless the package itself definesa variable pi the first binding found will be the onein the third frame the definition provided in the basepackage Definitions in the dynamic global environ-ment cannot shadow the definition in base
When library is called to load a package with aname space library first calls loadNamespace andthen attachNamespace loadNamespace first checkswhether the name space is already loaded and reg-istered with the internal registry of loaded namespaces If so the already loaded name space is re-turned If not then loadNamespace is called on allimported name spaces and definitions of exportedvariables of these packages are copied to the importsframe the second frame of the new name spaceThen the package code is loaded and run and ex-ports S3 methods and shared libraries are regis-tered Finally the onLoad hook is run if the packagedefines one and the name space is sealed Figure 3illustrates the dynamic global environment and theinternal data base of loaded name spaces after twopackages A and B have been loaded by explicit callsto library and package C has been loaded to satisfyimport directives in package B
The serialization mechanism used to save Rworkspaces handles name spaces by writing out areference for a name space consisting of the pack-age name and version When the workspace isloaded the reference is used to construct a callto loadNamespace The current implementation ig-nores the version information in the future this maybe used to signal compatibility warnings or selectamong several available versions
The registry of loaded name spaces can be ex-amined using loadedNamespaces In the currentimplementation loaded name spaces are not un-loaded automatically Detaching a package with a
R News ISSN 1609-3631
Vol 31 June 2003 5
Loaded Name Spaces
internal defs
base
imports
A exports
packageB
packageA
GlobalEnv
packagebase
Global Search Path
internal defs
base
imports
C exports
internal defs
base
imports
B exports
Figure 3 Internal implementation
name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development
Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already
Discussion
Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches
for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code
The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible
Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear
Bibliography
John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5
Samuel P Harbison Modula-3 Prentice Hall 1992 5
PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5
Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3
Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5
Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5
Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu
R News ISSN 1609-3631
Vol 31 June 2003 6
Converting Packages to S4by Douglas Bates
Introduction
R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations
lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods
Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)
The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session
There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4
Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including
conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so
S3 versus S4 classes and methods
S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names
The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary
Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this
There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned
By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class
R News ISSN 1609-3631
Vol 31 June 2003 7
is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class
S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature
S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)
In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package
Package conversion
Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4
S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic
Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we
had more slots in the S4 classes than components inthe S3 objects
Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully
Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed
We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned
We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language
The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We
R News ISSN 1609-3631
Vol 31 June 2003 8
would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version
Pitfalls
S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods
We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition
Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments
Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed
One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-
wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound
Conclusions
Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win
S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7
J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6
R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6
W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7
Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 4
The lsquoNAMESPACErsquo file contains the single directive
export(f)
Thus the variable x is private to the package and thefunction f is public The body of f references a globalvariable x and a global function c The global re-frence x corresponds to the definition in the packageSince the package does not provide a definition forc and does not import any definitions the variable crefers to the definition provided by the base package
The second package bar has code file lsquobarRrsquo con-taining the definitions
c lt- function() sum()
g lt- function(y) f(c(y 7))
and lsquoNAMESPACErsquo file
import(foo)
export(g)
The function c is private to the package The func-tion g calls a global function c Definitions providedin the package take precedence over imports and def-initions in the base package therefore the definitionof c used by g is the one provided in bar
Calling library(bar) loads bar and attaches itsexports to the search path Package foo is also loadedbut not attached to the search path A call to g pro-duces
gt g(6)
[1] 1 13
This is consistent with the definitions of c in the twosettings in bar the function c is defined to be equiv-alent to sum but in foo the variable c refers to thestandard function c in base
Some details
This section provides some details on the currentimplementation of name spaces These details maychange as name space support evolves
Name spaces use R environments to providestatic binding of global variables for function def-initions In a package with a name space func-tions defined in the package are defined in a namespace environment as shown in Figure 2 This en-vironment consists of a set of three static framesfollowed by the usual dynamic global environ-ment The first static frame contains the local defi-nitions of the package The second frame contains
Dyn
amic
packagebase
GlobalEnv
packagectest
internal defs
base
imports
Stat
ic
Figure 2 Name space environment
imported definitions and the third frame is the basepackage Suppose the function mydnorm shown inthe introduction is defined in a package with a namespace that does not explicitly import any definitionsThen a call to mydnorm will cause R to evaluate thevariable pi by searching for a binding in the namespace environment Unless the package itself definesa variable pi the first binding found will be the onein the third frame the definition provided in the basepackage Definitions in the dynamic global environ-ment cannot shadow the definition in base
When library is called to load a package with aname space library first calls loadNamespace andthen attachNamespace loadNamespace first checkswhether the name space is already loaded and reg-istered with the internal registry of loaded namespaces If so the already loaded name space is re-turned If not then loadNamespace is called on allimported name spaces and definitions of exportedvariables of these packages are copied to the importsframe the second frame of the new name spaceThen the package code is loaded and run and ex-ports S3 methods and shared libraries are regis-tered Finally the onLoad hook is run if the packagedefines one and the name space is sealed Figure 3illustrates the dynamic global environment and theinternal data base of loaded name spaces after twopackages A and B have been loaded by explicit callsto library and package C has been loaded to satisfyimport directives in package B
The serialization mechanism used to save Rworkspaces handles name spaces by writing out areference for a name space consisting of the pack-age name and version When the workspace isloaded the reference is used to construct a callto loadNamespace The current implementation ig-nores the version information in the future this maybe used to signal compatibility warnings or selectamong several available versions
The registry of loaded name spaces can be ex-amined using loadedNamespaces In the currentimplementation loaded name spaces are not un-loaded automatically Detaching a package with a
R News ISSN 1609-3631
Vol 31 June 2003 5
Loaded Name Spaces
internal defs
base
imports
A exports
packageB
packageA
GlobalEnv
packagebase
Global Search Path
internal defs
base
imports
C exports
internal defs
base
imports
B exports
Figure 3 Internal implementation
name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development
Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already
Discussion
Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches
for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code
The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible
Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear
Bibliography
John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5
Samuel P Harbison Modula-3 Prentice Hall 1992 5
PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5
Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3
Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5
Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5
Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu
R News ISSN 1609-3631
Vol 31 June 2003 6
Converting Packages to S4by Douglas Bates
Introduction
R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations
lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods
Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)
The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session
There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4
Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including
conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so
S3 versus S4 classes and methods
S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names
The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary
Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this
There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned
By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class
R News ISSN 1609-3631
Vol 31 June 2003 7
is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class
S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature
S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)
In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package
Package conversion
Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4
S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic
Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we
had more slots in the S4 classes than components inthe S3 objects
Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully
Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed
We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned
We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language
The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We
R News ISSN 1609-3631
Vol 31 June 2003 8
would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version
Pitfalls
S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods
We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition
Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments
Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed
One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-
wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound
Conclusions
Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win
S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7
J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6
R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6
W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7
Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 5
Loaded Name Spaces
internal defs
base
imports
A exports
packageB
packageA
GlobalEnv
packagebase
Global Search Path
internal defs
base
imports
C exports
internal defs
base
imports
B exports
Figure 3 Internal implementation
name space loaded by a call to library removes theframe containing the exported variables of the pack-age from the global search path but it does not un-load the name space Future implementations mayautomatically unload name spaces that are no longerneeded to satisfy import dependencies The functionunloadNamespace can be used to force unloading ofa name space this can be useful if a new version ofthe package is to be loaded during development
Variables exported by a package with a namespace can also be referenced using a fully qualifiedvariable reference consisting of the package nameand the variable name separated by a double colonsuch as foof This can be useful in debugging atthe top level It can also occasionally be useful insource code for example in cases where functionalityis only to be used if the package providing it is avail-able However dependencies resulting from use offully qualified references are not visible from thename space directives and therefore may cause someconfusion so explicit importing is usually preferableComputing the value of a fully qualified variable ref-erence will cause the specified package to be loadedif it is not loaded already
Discussion
Name spaces provide a means of making R packagesmore modular Many languages and systems origi-nally developed without name space or module sys-tems have had some form of name space manage-ment added in order to support the development oflarger software projects C++ and Tcl (Welch 1999)are two examples Other languages such as Ada(Barnes 1998) and Modula-3 (Harbison 1992) weredesigned from the start to include powerful modulesystems These languages use a range of approaches
for specifying name space or module informationThe declarative approach with separation of imple-mentation code and name space information used inthe system included in R 170 is closest in spirit tothe approaches of Modula-3 and Ada In contrastTcl and C++ interleave name space specification andsource code
The current R name space implementation whilealready very functional and useful is still experi-mental and incomplete Support for S4 methods stillneeds to be included More exhaustive error check-ing is needed along with code analysis tools to aidin constructing name space specifications To aid inpackage development and debugging it would alsobe useful to explore whether strict sealing can berelaxed somewhat without complicating the imple-mentation Early experiments suggest that this maybe possible
Some languages including Ada ML (Ullman1997) and the MzScheme Scheme implementation(PLT) provide a richer module structure in which pa-rameterized partial modules can be assembled intocomplete modules In the context of R this mightcorrespond to having a parameterized generic database access package that is completed by providing aspecific data base interface Whether this additionalpower is useful and the associated additional com-plexity is warranted in R is not yet clear
Bibliography
John Barnes Programming In Ada 95 Addison-Wesley 2nd edition 1998 5
Samuel P Harbison Modula-3 Prentice Hall 1992 5
PLT PLT MzScheme World Wide Web 2003URL httpwwwplt-schemeorgsoftwaremzscheme 5
Luke Tierney Compiling R A preliminary re-port In Kurt Hornik and Friedrich Leisch ed-itors Proceedings of the 2nd International Work-shop on Distributed Statistical Computing March15-17 2001 Technische Universitaumlt Wien ViennaAustria 2001 URL httpwwwcituwienacatConferencesDSC-2001Proceedings ISSN1609-395X 3
Jeffrey Ullman Elements of ML Programming PrenticeHall 1997 5
Brent Welch Practical Programming in Tcl and TkPrentice Hall 1999 5
Luke TierneyDepartment of Statistics amp Actuarial ScienceUniversity of Iowa Iowa City Iowa USAlukestatuiowaedu
R News ISSN 1609-3631
Vol 31 June 2003 6
Converting Packages to S4by Douglas Bates
Introduction
R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations
lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods
Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)
The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session
There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4
Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including
conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so
S3 versus S4 classes and methods
S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names
The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary
Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this
There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned
By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class
R News ISSN 1609-3631
Vol 31 June 2003 7
is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class
S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature
S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)
In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package
Package conversion
Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4
S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic
Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we
had more slots in the S4 classes than components inthe S3 objects
Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully
Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed
We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned
We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language
The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We
R News ISSN 1609-3631
Vol 31 June 2003 8
would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version
Pitfalls
S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods
We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition
Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments
Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed
One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-
wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound
Conclusions
Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win
S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7
J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6
R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6
W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7
Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 6
Converting Packages to S4by Douglas Bates
Introduction
R now has two systems of classes and methodsknown informally as the lsquoS3rsquo and lsquoS4rsquo systems Bothsystems are based on the assignment of a class to anobject and the use of generic functions that invoke dif-ferent methods according to the class of their argu-ments Classes organize the representation of infor-mation and methods organize the actions that are ap-plied to these representations
lsquoS3rsquo classes and methods for the S language wereintroduced in Chambers and Hastie (1992) (see alsoVenables and Ripley 2000 ch 4) and have been im-plemented in R from its earliest public versions Be-cause many widely-used R functions such as printplot and summary are S3 generics anyone using Rinevitably (although perhaps unknowingly) uses theS3 system of classes and methods
Authors of R packages can take advantage of theS3 class system by assigning a class to the objectscreated in their package and defining methods forthis class and either their own generic functions orgeneric functions defined elsewhere especially thosein the base package Many R packages do exactlythis To get an idea of how many S3 classes are usedin R attach the packages that you commonly useand call methods(print) The result is a vector ofnames of methods for the print generic It will prob-ably list dozens if not hundreds of methods (Eventhat list may be incomplete because recent versionsof some popular packages use namespaces to hidesome of the S3 methods defined in the package)
The S3 system of classes and methods is bothpopular and successful However some aspects ofits design are at best inconvenient and at worst dan-gerous To address these inadequacies John Cham-bers introduced what is now called the lsquoS4rsquo system ofclasses and methods (Chambers 1998 Venables andRipley 2000) John implemented this system for R inthe methods package which beginning with the 170release of R is one of the packages that are attachedby default in each R session
There are many more packages that use S3 classesand methods than use S4 Probably the greatest useof S4 at present is in packages from the Bioconductorproject (Gentleman and Carey 2002) I hope by thisarticle to encourage package authors to make greateruse of S4
Conversion from S3 to S4 is not automatic Apackage author must perform this conversion man-ually and the conversion can require a considerableamount of effort I speak from experience Saikat De-bRoy and I have been redesigning the linear mixed-effects models part of the nlme package including
conversion to S4 and I can attest that it has been alot of work The purpose of this article is to indicatewhat is gained by conversion to S4 and to describesome of our experiences in doing so
S3 versus S4 classes and methods
S3 classes are informal the class of an object is de-termined by its class attribute which should consistof one or more character strings and methods arefound by combining the name of the generic func-tion with the class of the first argument to the func-tion If a function having this combined name is onthe search path it is assumed to be the appropriatemethod Classes and their contents are not formallydefined in the S3 system - at best there is a ldquogentle-manrsquos agreementrdquo that objects in a class will havecertain structure with certain component names
The informality of S3 classes and methods is con-venient but dangerous There are obvious dangers inthat any R object can be assigned a class say foowithout any attempt to validate the names and typesof components in the object That is there is no guar-antee that an object that claims to have class foois compatible with methods for that class Also amethod is recognized solely by its name so a func-tion named printfoo is assumed to be the methodfor the print generic applied to an object of class fooThis can lead to surprising and unpleasant errors forthe unwary
Another disadvantage of using function names toidentify methods is that the class of only one argu-ment the first argument is used to determine themethod that the generic will use Often when creat-ing a plot or when fitting a statistical model we wantto examine the class of more than one argument todetermine the appropriate method but S3 does notallow this
There are more subtle disadvantages to the S3system Often it is convenient to describe a class asbeing a special case of another class for example amodel fit by aov is a special case of a linear model(class lm) We say that class aov inherits from classlm In the informal S3 system this inheritance is de-scribed by assigning the class c(aovlm) to anobject fit by aov Thus the inheritance of classes be-comes a property of the object not a property of theclass and there is no guarantee that it will be con-sistent across all members of the class Indeed therewere examples in some packages where the class in-heritance was not consistently assigned
By contrast S4 classes must be defined explicitlyThe number of slots in objects of the class and thenames and classes of the slots are established at thetime of class definition When an object of the class
R News ISSN 1609-3631
Vol 31 June 2003 7
is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class
S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature
S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)
In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package
Package conversion
Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4
S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic
Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we
had more slots in the S4 classes than components inthe S3 objects
Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully
Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed
We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned
We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language
The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We
R News ISSN 1609-3631
Vol 31 June 2003 8
would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version
Pitfalls
S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods
We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition
Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments
Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed
One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-
wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound
Conclusions
Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win
S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7
J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6
R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6
W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7
Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 7
is created and at some points during computationwith the object it is validated against the definitionInheritance of classes is also specified at the time ofclass definition and thus becomes a property of theclass not a (possibly inconsistent) property of objectsin the class
S4 also requires formal declarations of methodsunlike the informal system of using function namesto identify a method in S3 An S4 method is declaredby a call to setMethod giving the name of the genericand the ldquosignaturerdquo of the arguments The signa-ture identifies the classes of one or more named ar-guments to the generic function Special meta-classesnamed ANY and missing can be used in the signature
S4 generic functions are automatically createdwhen a method is declared for an existing functionin which case the function becomes generic and thecurrent definition becomes the default method Anew generic function can be declared explicitly by acall to setGeneric When a method for the genericis declared with setMethod the number names andorder of its arguments are matched against those ofthe generic (If the generic has a argument themethod can add new arguments but it can never omitor rename arguments of the generic)
In summary the S4 system is much more formalregarding classes generics and methods than is theS3 system This formality means that more care mustbe taken in the design of classes and methods for S4In return S4 provides greater security a more well-defined organization and in my opinion a cleanerstructure to the package
Package conversion
Chambers (1998 ch 78) and Venables and Ripley(2000 ch 5) discuss creating new packages based onS4 Here I will concentrate on converting an existingpackage from S3 to S4
S4 requires formal definitions of generics classesand methods The generic functions are usually theeasiest to convert If the generic is defined externallyto the package then the package author can simplybegin defining methods for the generic taking careto ensure that the argument sequences of the methodare compatible with those of the generic As de-scribed above assigning an S4 method for say coefautomatically creates an S4 generic for coef with thecurrent externally-defined coef function as the de-fault method If an S3 generic was defined internallyto the package then it is easy to convert it to an S4generic
Converting classes is less easy We found thatwe could use the informal set of classes from the S3version as a guide when formulating the S4 classesbut we frequently reconsidered the structure of theclasses during the conversion We frequently foundourselves adding slots during the conversion so we
had more slots in the S4 classes than components inthe S3 objects
Increasing the number of slots may be an in-evitable consequence of revising the package (wetend to add capabilities more frequently than we re-move them) but it may also be related to the fact thatS4 classes must be declared explicitly and hence wemust consider the components or slots of the classesand the relationships between the classes more care-fully
Another reason that we incorporating more slotsin the S4 classes than in the S3 prototype is becausewe found it convenient that the contents of slots inS4 objects can easily be accessed and modified in Ccode called through the Call interface Several Cmacros for working with S4 classed objects includ-ing GET_SLOT SET_SLOT MAKE_CLASS and NEW (all de-scribed in Chambers (1998)) are available in R Thecombination of the formal classes of S4 the Callinterface and these macros allows a programmer tomanipulate S4 classed objects in C code nearly as eas-ily as in R code Furthermore when the C code iscalled from a method the programmer can be con-fident of the classes of the objects passed in the calland the classes of the slots of those objects Much ofthe checking of classes or modes and possible coer-cion of modes that is common in C code called fromR can be bypassed
We found that we would initially write methodsin R then translate them into C if warranted Thenature of our calculations frequently involving mul-tiple decompositions and manipulations of sectionsof arrays was such that the calculations could be ex-pressed in R but not very cleanly Once we had theR version working satisfactorily we could translateinto C the parts that were critical for performance orwere awkward to write in R An important advan-tage of this mode of development is that we coulduse the same slots in the C version as in the R versionand create the same types of objects to be returned
We feel that defining S4 classes and methods inR then translating parts of method definitions to Cfunctions called through Call is an extremely ef-fective mode for numerical computation Program-mers who have experience working in C++ or Javamay initially find it more convenient to define classesand methods in the compiled language and perhapsdefine a parallel series of classes in R (We did ex-actly that when creating the Matrix package for R)We encourage such programmers to try instead thismethod of defining only one set of classes the S4classes in R and use these classes in both the inter-preted language and the compiled language
The definition of S4 methods is more formal thanin S3 but also more flexible because of the ability tomatch an argument signature rather than being con-trained to matching just the class of the first argu-ment We found that we used this extensively whendefining methods for generics that fit models We
R News ISSN 1609-3631
Vol 31 June 2003 8
would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version
Pitfalls
S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods
We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition
Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments
Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed
One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-
wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound
Conclusions
Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win
S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7
J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6
R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6
W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7
Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 8
would define the ldquocanonicalrdquo form of the argumentswhich frequently was a rather wordy form and oneldquocollectorrdquo method that matched this form of the ar-guments The collector method is the one that ac-tually does the work such as fitting the model Allthe other methods are designed to take more conve-niently expressed forms of the arguments and rear-range them into the canonical form Our subjectiveimpression is that the resulting methods are mucheasier to understand than those in the S3 version
Pitfalls
S4 classes and methods are powerful tools Withthese tools a programmer can exercise fine-grainedcontrol over the definition of classes and methodsThis encourages a programming style where manyspecialized classes and methods are defined One ofthe difficulties that authors of packages then face isdocumenting all these classes generics and meth-ods
We expect that generic functions and classes willbe described in detail in the documentation usuallyin a separate documentation file for each class andeach generic function although in some cases closelyrelated classes could be described in a single file Itis less obvious whether how and where to docu-ment methods In the S4 system methods are associ-ated with a generic function and an argument signa-ture It is not clear if a given method should be docu-mented separately or in conjunction with the genericfunction or with a class definition
Any of these places could make sense for somemethods All of them have disadvantages If allmethods are documented separately there will bean explosion of the number of documentation filesto create and maintain That is an unwelcome bur-den Documenting methods in conjunction with thegeneric can work for internally defined generics butnot for those defined externally to the package Doc-umenting methods in conjunction with a class onlyworks well if the method signature consists of one ar-gument only but part of the power of S4 is the abilityto dispatch on the signature of multiple arguments
Others working with S4 packages including theBioconductor contributors are faced with this prob-lem of organizing documentation Gordon Smythand Vince Carey have suggested some strategies fororganizing documentation but more experience isdefinitely needed
One problem with creating many small meth-ods that rearrange arguments and then callcallGeneric() to get eventually to some collectormethod is that the sequence of calls has to be un-
wound when returning the value of the call In thecase of a model-fitting generic it is customary to pre-serve the original call in a slot or component namedcall so that it can be displayed in summaries andalso so it can be used to update the fitted modelTo preserve the call each of the small methods thatjust manipulate the arguments must take the resultof callGeneric reassign the call slot and returnthis object We discovered to our chagrin that thiscaused the entire object which can be very large insome of our examples to be copied in its entirety aseach method call was unwound
Conclusions
Although the formal structure of the S4 system ofclasses and methods requires greater discipline bythe programmer than does the informal S3 systemthe resulting clarity and security of the code makesS4 worthwhile Moreover the ability in S4 to workwith a single class structure in R code and in C codeto be called by R is a big win
S4 encourages creating many related classes andmany methods for generics Presently this createsdifficult decisions on how to organize documenta-tion and how unwind nested method calls withoutunwanted copying of large objects However it isstill early days with regard to the construction oflarge packages based on S4 and as more experienceis gained we will expect that knowledge of the bestpractices will be disseminated in the community sowe can all benefit from the S4 system
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 6 7
J M Chambers and T J Hastie Statistical Models inS Chapman amp Hall London 1992 ISBN 0-412-83040-X 6
R Gentleman and V Carey Bioconductor RNews 2(1)11ndash16 March 2002 URL httpCRANR-projectorgdocRnews 6
W N Venables and B D Ripley S ProgrammingSpringer 2000 URL httpwwwstatsoxacukpubMASS3Sprog ISBN 0-387-98966-8 6 7
Douglas BatesUniversity of WisconsinndashMadison USABatesstatwiscedu
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 9
The genetics PackageUtilities for handling genetic data
by Gregory R Warnes
Introduction
In my work as a statistician in the Non-ClinicalStatistics and Biostatistical Applications groupwithin Pfizer Global Research and Development Ihave the opportunity to perform statistical analysisin a wide variety of domains One of these domainsis pharmacogenomics in which we attempt to deter-mine the relationship between the genetic variabilityof individual patients and disease status diseaseprogression treatment efficacy or treatment side ef-fect profile
Our normal approach to pharmacogenomics isto start with a small set of candidate genes Wethen look for markers of genetic variability withinthese genes The most common marker typeswe encounter are Single Nucleotide Polymorphisms(SNPs) SNPs are locations where some individualsdiffer from the norm by the substitution one of the 4DNA bases adenine (A) thymine (T) guanine (G)and cytosine (C) by one of the other bases For ex-ample a single cytosine (C) might be replaced bya single tyrosine (T) in the sequence lsquoCCTCAGCrsquoyielding lsquoCCTTAGCrsquo We also encounter simple se-quence length polymorphisms (SSLP) which arealso known as microsatellite DNA SSLP are simplereteating patters of bases where the number of re-peats can vary Eg at a particular position someindividuals might have 3 repeats of the pattern lsquoCTrsquolsquoACCTCTCTAArsquo while others might have 5 repeatslsquoACCTCTCTCTCTAArsquo
Regardless of the type or location of genetic varia-tion each individual has two copies of each chromo-some hence two alleles (variants) and consequentlytwo data values for each marker This information isoften presented together by providing a pair of allelenames Sometimes a separator is used (eg lsquoATrsquo)sometimes they are simply concatenated (eg lsquoATrsquo)
A further layer of complexity arises from theinability of most laboratory methods to determinewhich observed variants comes from which copy ofthe chromosome (Statistical methods are often nec-essary to impute this information when it is needed)For this type of data lsquoATrsquo and lsquoTArsquo are equivalent
The genetics package
The genetics package available from CRAN in-cludes classes and methods for creating represent-ing and manipulating genotypes (unordered allelepairs) and haplotypes (ordered allele pairs) Geno-
types and haplotypes can be annotated with chromo-some locus (location on a chromosome) gene andmarker information Utility functions compute geno-type and allele frequencies flag homozygotes or het-erozygotes flag carriers of certain alleles count thenumber of a specific allele carried by an individualextract one or both alleles These functions make iteasy to create and use single-locus genetic informa-tion in Rrsquos statistical modeling functions
The genetics package also provides a set of func-tions to estimate and test for departure from Hardy-Weinberg equilibrium (HWE) HWE specifies theexpected allele frequencies for a single populationwhen none of the variant alleles impart a survivalbenefit Departure from HWE is often indicative ofa problem with the laboratory assay and is often thefirst statistical method applied to genetic data Inaddition the genetics package provides functions totest for linkage disequilibrium (LD) the non-randomassociation of marker alleles which can arise frommarker proximity or from selection bias Furtherto assist in sample size calculations when consider-ing sample sizes needed when investigating poten-tial markers we provide a function which computesthe probability of observing all alleles with a giventrue frequency
My primary motivation in creating the geneticspackage was to overcome the difficulty in represent-ing and manipulating genotype in general-purposestatistical packages Without an explicit genotypevariable type handling genetic variables requiresconsiderable string manipulation which can be quitemessy and tedious The genotype function hasbeen designed to remove the need to perform stringmanupulation by allowing allele pairs to be specifiedin any of four commonly occuring notations
bull A single vector with a character separator
g1 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquo
NArsquoAArsquorsquoACrsquorsquoACrsquo) )
g3 lt- genotype( c(rsquoA ArsquorsquoA CrsquorsquoC CrsquorsquoC Arsquo
rsquorsquorsquoA ArsquorsquoA CrsquorsquoA Crsquo)
sep=rsquo rsquo removespaces=F)
bull A single vector with a positional separator
g2 lt- genotype( c(rsquoAArsquorsquoACrsquorsquoCCrsquorsquoCArsquorsquorsquo
rsquoAArsquorsquoACrsquorsquoACrsquo) sep=1 )
bull Two separate vectors
g4 lt- genotype(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo)
)
bull A dataframe or matrix with two columns
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 10
gm lt- cbind(
c(rsquoArsquorsquoArsquorsquoCrsquorsquoCrsquorsquorsquorsquoArsquorsquoArsquorsquoArsquo)
c(rsquoArsquorsquoCrsquorsquoCrsquorsquoArsquorsquorsquorsquoArsquorsquoCrsquorsquoCrsquo) )
g5 lt- genotype( gm )
For simplicity the functions makeGenotype andmakeHaplotype can be used to convert all of the ge-netic variables contained in a dataframe in a singlepass (See the help page for details)
A second difficulty in using genotypes is the needto represent the information in different forms at dif-ferent times To simplify the use of genotype vari-ables each of the three basic ways of modeling theeffect of the allele combinations is directly supportedby the genetics package
categorical Each allele combination acts differently
This situation is handled by entering thegenotype variable without modification into amodel In this case it will be treated as a factor
lm( outcome ~ genotypevar + confounder )
additive The effect depends on the number of copiesof a specific allele (0 1 or 2)
The function allelecount( gene allele )returns the number of copies of the specifiedallele
lm( outcome ~ allelecount(genotypevarrsquoArsquo)
+ confounder )
dominantrecessive The effect depends only on thepresence or absence of a specific allele
The function carrier( gene allele ) re-turns a boolean flag if the specified allele ispresent
lm( outcome ~ carrier(genotypevarrsquoArsquo)
+ confounder )
Implementation
The basic functionality of the genetics package isprovided by the genotype class and the haplotypeclass which is a simple extension of the formerFriedrich Leisch and I collaborated on the design ofthe genotype class We had four goals First wewanted to be able to manipulate both alleles as asingle variable Second we needed a clean way ofaccessing the individual alleles when this was re-quired Third a genotype variable should be ableto be stored in dataframes as they are currently im-plemented in R Fourth the implementation of geno-type variables should be space-efficient
After considering several potential implemen-tations we chose to implement the genotype
class as an extension to the in-built factor vari-able type with additional information stored in at-tributes Genotype objects are stored as factorsand have the class list c(genotypefactor)The names of the factor levels are constructed aspaste(allele1allele2sep=) Since mostgenotyping methods do not indicate which allelecomes from which member of a chromosome pairthe alleles for each individual are placed in a con-sistent order controlled by the reorder argumentIn cases when the allele order is informative thehaplotype class which preserves the allele ordershould be used instead
The set of allele names is stored in the attributeallelenames A translation table from the factorlevels to the names of each of the two alleles is storedin the attribute allelemap This map is a two col-umn character matrix with one row per factor levelThe columns provide the individual alleles for eachfactor level Accesing the individual alleles as per-formed by the allele function is accomplished bysimply indexing into this table
allelex lt- attrib(xallelemap)allelesx[genotypevarwhich]
where which is 1 2 or c(12) as appropriateFinally there is often additional meta-
information associated with a genotype The func-tions locus gene and marker create objects to storeinformation respectively about genetic loci genesand markers Any of these objects can be included aspart of a genotype object using the locus argumentThe print and summary functions for genotype ob-jects properly display this information when it ispresent
This implementation of the genotype class metour four design goals and offered an additional ben-efit in most contexts factors behave the same as thedesired default behavior for genotype objects Con-sequently relatively few additional methods neededto written Further in the absence of the geneticspackage the information stored in genotype objectsis still accessible in a reasonable way
The genotype class is accompanied by a full com-plement of helper methods for standard R operators( [] [lt- == etc ) and object methods ( summaryprint isgenotype asgenotype etc ) Additionalfunctions for manipulating genotypes include
allele Extracts individual alleles matrix
allelenames Extracts the set of allele names
homozygote Creates a logical vector indicatingwhether both alleles of each observation are thesame
heterozygote Creates a logical vector indicatingwhether the alleles of each observation differ
carrier Creates a logical vector indicating whetherthe specified alleles are present
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 11
allelecount Returns the number of copies of thespecified alleles carried by each observation
getlocus Extracts locus gene or marker informa-tion
makeGenotypes Convert appropriate columns in adataframe to genotypes or haplotypes
writepopfile Creates a rsquopoprsquo data file as used bythe GenePop (httpwbiomedcurtineduaugenepop) and LinkDos (httpwbiomedcurtineduaugenepoplinkdoshtml) soft-are packages
writepedigreefile Creates a rsquopedigreersquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
writemarkerfile Creates a rsquomarkerrsquo data file asused by the QTDT software package (httpwwwsphumichedustatgenabecasisQTDT)
The genetics package provides four functions re-lated to Hardy-Weinberg Equilibrium
diseq Estimate or compute confidence interval forthe single marker Hardy-Weinberg disequilib-rium
HWEchisq Performs a Chi-square test for Hardy-Weinberg equilibrium
HWEexact Performs a Fisherrsquos exact test of Hardy-Weinberg equilibrium for two-allele markers
HWEtest Computes estimates and bootstrap confi-dence intervals as well as testing for Hardy-Weinberg equilibrium
as well as three related to linkage disequilibrium(LD)
LD Compute pairwise linkage disequilibrium be-tween genetic markers
LDtable Generate a graphical table showing the LDestimate number of observations and p-valuefor each marker combination color coded bysignificance
LDplot Plot linkage disequilibrium as a function ofmarker location
and one function for sample size calculation
gregorius Probability of Observing All Alleles witha Given Frequency in a Sample of a SpecifiedSize
The algorithms used in the HWE and LD functionsare beyond the scope of this article but details areprovided in the help pages or the correspondingpackage documentation
Example
Here is a partial session using tools from the geno-type package to examine the features of 3 simulatedmarkers and their relationships with a continuousoutcome
gt library(genetics)
[]
gt Load the data from a CSV file
gt data lt- readcsv(example_datacsv)
gt
gt Convert genotype columns to genotype variables
gt data lt- makeGenotypes(data)
gt
gt Annotate the genes
gt marker(data$a1691g) lt-
+ marker(name=A1691G
+ type=SNP
+ locusname=MBP2
+ chromosome=9
+ arm=q
+ indexstart=35
+ bpstart=1691
+ relativeto=intron 1)
[]
gt
gt Look at some of the data
gt data[15]
PID DELTABMI c104t a1691g c2249t
1 1127409 062 CC GG TT
2 246311 131 CC AA TT
3 295185 015 CC GG TT
4 34301 072 CT AA TT
5 96890 037 CC AA TT
gt
gt Get allele information for c104t
gt summary(data$c104t)
Marker MBP2C-104T (9q35-104) Type SNP
Allele Frequency
Count Proportion
C 137 068
T 63 032
Genotype Frequency
Count Proportion
CC 59 059
CT 19 019
TT 22 022
gt
gt
gt Check Hardy-Weinberg Equilibrium
gt HWEtest(data$c104t)
-----------------------------------
Test for Hardy-Weinberg-Equilibrium
-----------------------------------
Call
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 12
HWEtestgenotype(x = data$c104t)
Raw Disequlibrium for each allele pair (D)
C T
C 012
T 012
Scaled Disequlibrium for each allele pair (Drsquo)
C T
C 056
T 056
Correlation coefficient for each allele pair (r)
C T
C 100 -056
T -056 100
Overall Values
Value
D 012
Drsquo 056
r -056
Confidence intervals computed via bootstrap
using 1000 samples
Observed 95 CI NArsquos
Overall D 0121 ( 0073 0159) 0
Overall Drsquo 0560 ( 0373 0714) 0
Overall r -0560 (-0714 -0373) 0
Contains Zero
Overall D NO
Overall Drsquo NO
Overall r NO
Significance Test
Exact Test for Hardy-Weinberg Equilibrium
data data$c104t
N11 = 59 N12 = 19 N22 = 22 N1 = 137 N2
= 63 p-value = 3463e-08
gt
gt Check Linkage Disequilibrium
gt ld lt- LD(data)
Warning message
Non-genotype variables or genotype variables with
more or less than two alleles detected These
variables will be omitted PID DELTABMI
in LDdataframe(data)
gt ld text display
Pairwise LD
-----------
a1691g c2249t
c104t D -001 -003
c104t Drsquo 005 100
c104t Corr -003 -021
c104t X^2 016 851
c104t P-value 069 00035
c104t n 100 100
a1691g D -001
a1691g Drsquo 031
a1691g Corr -008
a1691g X^2 130
a1691g P-value 025
a1691g n 100
gt
gt LDtable(ld) graphical display
Marker 2
Mar
ker
1
a1691g c2249t
c104
ta1
691g
minus00487(100)
068780
minus09978(100)
000354
minus03076(100)
025439
(minus)Drsquo(N)
Pminusvalue
Linkage Disequilibrium
gt fit a model
gt summary(lm( DELTABMI ~
+ homozygote(c104trsquoCrsquo) +
+ allelecount(a1691g rsquoGrsquo) +
+ c2249t data=data))
Call
lm(formula = DELTABMI ~ homozygote(c104t C) +
allelecount(a1691g G) + c2249t
data = data)
Residuals
Min 1Q Median 3Q Max
-29818 -05917 -00303 06666 27101
Coefficients
Estimate Std Error
(Intercept) -01807 05996
homozygote(c104t C)TRUE 10203 02290
allelecount(a1691g G) -00905 01175
c2249tTC 04291 06873
c2249tTT 03476 05848
t value Pr(gt|t|)
(Intercept) -030 076
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 13
homozygote(c104t C)TRUE 446 23e-05
allelecount(a1691g G) -077 044
c2249tTC 062 053
c2249tTT 059 055
---
Signif codes 0 lsquorsquo 0001 lsquorsquo 001
lsquorsquo 005 lsquorsquo 01 lsquo rsquo 1
Residual standard error 11 on 95 degrees of
freedom
Multiple R-Squared 0176
Adjusted R-squared 0141
F-statistic 506 on 4 and 95 DF
p-value 0000969
Conclusion
The current release of the genetics package 100provides a complete set of classes and methods forhandling single-locus genetic data as well as func-tions for computing and testing for departure fromHardy-Weinberg and linkage disequilibrium using a
variety of estimatorsAs noted earlier Friedrich Leisch and I collabo-
rated on the design of the data structures WhileI was primarily motivated by the desire to providea natural way to include single-locus genetic vari-ables in statistical models Fritz also wanted to sup-port multiple genetic changes spread across one ormore genes As of the current version my goal haslargely been realized but more work is necessary tofully support Fritzrsquos goal
In the future I intend to add functions to performhaplotype imputation and generate standard genet-ics plots
I would like to thank Friedrich Leisch for hisassistance in designing the genotype data struc-ture David Duffy for contributing the code for thegregarious and HWEexact functions and MichaelMan for error reports and helpful discussion
I welcome comments and contributions
Gregory R WarnesPfizer Global Research and Developmentgregory_r_warnesgrotonpfizercom
Variance Inflation Factorsby Juumlrgen Groszlig
A short discussion of centered and uncentered vari-ance inflation factors It is recommended to reportboth types of variance inflation for a linear model
Centered variance inflation factors
A popular method in the classical linear regressionmodel
y = Xntimesp
β +ε X = (1n x2 xp) ε sim (0σ2 In)
to diagnose collinearity is to measure how much thevariance var(β j) of the j-th element β j of the ordi-nary least squares estimator β = (X primeX)minus1X primey is in-flated by near linear dependencies of the columns ofX This is done by putting var(β j) in relation to thevariance of β j as it would have been when
(a) the nonconstant columns x2 xp of X hadbeen centered and
(b) the centered columns had been orthogonal toeach other
This variance is given by
v j =σ2
xprimejCx j C = In minus
1n
1n1primen
where C is the usual centering matrix Then the cen-tered variance inflation factor for the j-th variable is
VIF j =var(β j)
v j= σminus2 var(β j)xprimejCx j j = 2 p
It can also be written in the form
VIF j =1
1minus R2j
where
R2j = 1minus
xprimej M jx j
xprimejCx j M j = In minus X j(X prime
jX j)minus1X primej
is the centered coefficient of determination when thevariable x j is regressed on the remaining indepen-dent variables (including the intercept) The matrixX j is obtained from X by deleting its j-th columnFrom this it is also intuitively clear that the centeredVIF only works satisfactory in a model with intercept
The centered variance inflation factor (as wellas a generalized variance-inflation factor) is imple-mented in the R package car by Fox (2002) see alsoFox (1997) and can also be computed with the func-tion listed below which has been communicated byVenables (2002) via the r-help mailing list
gt vif lt- function(object )
UseMethod(vif)
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 14
gt vifdefault lt- function(object )
stop(No default method for vif Sorry)
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
else
v1 lt- diag(V)
v2 lt- diag(Vi)
warning(paste(No intercept term
detected Results may surprise))
structure(v1 v2 names = nam)
Uncentered variance inflation fac-tors
As pointed out by Belsley (1991 p 2829) the cen-tered VIF can be disadvantageous when collinearityeffects are related to the intercept To illustrate thispoint let us consider the following small ad-hoc ex-ample
gt n lt- 50
gt x2 lt- 5 + rnorm(n 0 01)
gt x3 lt- 1n
gt x4 lt- 10 + x2 + x3 + rnorm(n 0 10)
gt x5 lt- (x3100)^5 100
gt y lt- 10 + 4 + x2 + 1x3 + 2x4 + 4x5 +
rnorm(n 0 50)
gt cllm lt- lm(y ~ x2 + x3 + x4 + x5)
Then centered variance inflation factors obtainedvia viflm(cllm) are
x2 x3 x4 x5
1017126 6774374 4329862 3122899
Several runs produce similar results Usuallyvalues greater than 10 are said to indicate harmfulcollinearity which is not the case above although theVIFs report moderate collinearity related to x3 and x4Since the smallest possible VIF is 1 the above resultseems to indicate that x2 is completely unharmed bycollinearity
No harmful collinearity in the model when thesecond column x2 of the matrix X is nearly five timesthe first column and the scaled condition number seeBelsley (1991 Sec 33) is 1474653
As a matter of fact the centered VIF requires an inter-cept in the model but at the same time denies the status ofthe intercept as an independent lsquovariablersquo being possiblyrelated to collinearity effects
Now as indicated by Belsley (1991) an alterna-tive way to measure variance inflation is simply toapply the classical (uncentered) coefficient of deter-mination
R2j = 1minus
xprimej M jx j
xprimejx j
instead of the centered R2j This can also be done for
the intercept as an independent lsquovariablersquo For theabove example we obtain
(Intercept) x2 x3 x4 x5
2540378 2510568 2792701 2489071 4518498
as uncentered VIFs revealing that the intercept andx2 are strongly effected by collinearity which is infact the case
It should be noted that xprimejCx j le xprimejx j so that
always R2j le R2
j Hence the usual critical value
R2j gt 09 (corresponding to centered VIF gt 10) for
harmful collinearity should be greater when the un-centered VIFs are regarded
Conclusion
When we use collinearity analysis for finding a possi-ble relationship between impreciseness of estimationand near linear dependencies then an intercept inthe model should not be excluded from the analysissince the corresponding parameter is as important asany other parameter (eg for prediction purposes)See also (Belsley 1991 Sec 63 68) for a discussionof mean centering and the constant term in connec-tion with collinearity
By reporting both centered and uncentered VIFsfor a linear model we can obtain an impressionabout the possible involvement of the intercept incollinearity effects This can easily be done by com-plementing the above viflm function since the un-centered VIFs can be computed as in the else partof the function namely as diag(V) diag(Vi) Bythis the uncentered VIFs are computed anyway andnot only in the case that no intercept is in the model
gt viflm lt- function(object )
V lt- summary(object)$covunscaled
Vi lt- crossprod(modelmatrix(object))
nam lt- names(coef(object))
k lt- match((Intercept) nam
nomatch = FALSE)
v1 lt- diag(V)
v2 lt- diag(Vi)
ucstruct lt- structure(v1 v2 names = nam)
if(k)
v1 lt- diag(V)[-k]
v2 lt- diag(Vi)[-k] - Vi[k -k]^2 Vi[k k]
nam lt- nam[-k]
cstruct lt- structure(v1 v2 names = nam)
return(CenteredVIF = cstruct
UncenteredVIF = ucstruct)
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 15
else
warning(paste(No intercept term
detected Uncentered VIFs computed))
return(UncenteredVIF = ucstruct)
The user of such a function should allow greatercritical values for the uncentered than for the cen-tered VIFs
Bibliography
D A Belsley (1991) Conditioning DiagnosticsCollinearity and Weak Data in Regression WileyNew York 14
J Fox (1997) Applied Regression Linear Models and Re-lated Methods Sage Publications Thousand Oaks13
J Fox (2002) An R and S-Plus Companion to AppliedRegression Sage Publications Thousand Oaks 13
W N Venables (2002) [R] RE [S] VIF Variance In-flation Factor httpwwwR-projectorgnocvsmailr-help20028566html 13
Juumlrgen GroszligUniversity of DortmundVogelpothsweg 87D-44221 Dortmund Germanygrossstatistikuni-dortmundde
Building Microsoft Windows Versions ofR and R packages under Intel LinuxA Package Developerrsquos Tool
by Jun Yan and AJ Rossini
It is simple to build R and R packages for MicrosoftWindows from an ix86 Linux machine This is veryuseful to package developers who are familiar withUnix tools and feel that widespread disseminationof their work is important The resulting R binariesand binary library packages require only a minimalamount of work to install under Microsoft WindowsWhile testing is always important for quality assur-ance and control we have found that the procedureusually results in reliable packages This documentprovides instructions for obtaining installing andusing the cross-compilation tools
These steps have been put into a Makefile whichaccompanies this document for automating the pro-cess The Makefile is available from the contributeddocumentation area on Comprehensive R ArchiveNetwork (CRAN) The current version has beentested with R-170
For the impatient and trusting if a current ver-sion of Linux R is in the search path then
make CrossCompileBuild
will run all Makefile targets described up to thesection Building R Packages and Bundles This as-sumes (1) you have the Makefile in a directoryRCrossBuild (empty except for the makefile) and (2)that RCrossBuild is your current working direc-tory After this one should manually set up the pack-ages of interest for building though the makefile willstill help with many important steps We describe
this in detail in the section on Building R Packages andBundles
Setting up the build area
We first create a separate directory for R cross-building say RCrossBuild and will put the Make-file into this directory From now on this directoryis stored in the environment variable $RCB Makesure the file name is Makefile unless you are famil-iar with using the make program with other controlfile names
Create work area
The following steps should take place in theRCrossBuild directory Within this directory theMakefile assumes the following subdirectories eitherexist or can be created for use as described
bull downloads is the location to store all the sourcesneeded for cross-building
bull cross-tools is the location to unpack thecross-building tools
bull WinR is the location to unpack the R source andcross-build R
bull LinuxR is the location to unpack the R sourceand build a fresh Linux R which is only neededif your system doesnrsquot have the current versionof R
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 16
bull pkgsrc is the location of package sources(tarred and gzipped from R CMD build) thatneed to cross-build
bull WinRlibs is the location to put the resultingWindows binaries of the packages
These directories can be changed by modifying theMakefile One may find what exactly each step doesby looking into the Makefile
Download tools and sources
make down
This command downloads the i386-mingw32 cross-compilers and R source (starting from R-170 othersources that used to be necessary for cross build-ing pcre and bzip2 are no longer needed) Itplaces all sources into a separate directory calledRCrossBuilddownloads The wget program is usedto get all needed sources Other downloading toolssuch as curl or linkselinks can be used as well Weplace these sources into the RCrossBuilddownloadsdirectory The URLs of these sources are available infile srcgnuwin32INSTALL which is located withinthe R source tree (or see the Makefile associatedwith this document)
Cross-tools setup
make xtools
This rule creates a separate subdirectorycross-tools in the build area for the cross-tools It unpacks the cross-compiler tools into thecross-tools subdirectory
Prepare source code
make prepsrc
This rule creates a separate subdirectory WinR tocarry out the cross-building process of R and un-packs the R sources into it As of 170 the sourcefor pcre and bzip2 are included in the R source tararchive before that we needed to worry about them
Build Linux R if needed
make LinuxR
This step is required as of R-170 if you donrsquot havea current version of Linux R on your system and youdonrsquot have the permission to install one for systemwide use This rule will build and install a Linux Rin $RCBLinuxRR
Building R
Configuring
If a current Linux R is available from the system andon your search path then run the following com-mand
make mkrules
This rule modifies the file srcgnuwin32Mkrulesfrom the R source tree to set BUILD=CROSS andHEADER=$RCBcross-toolsmingw32include
If a current Linux R is not available from thesystem and a Linux R has just been built by makeLinuxR from the end of the last section
make LinuxFresh=YES mkrules
This rule will set R_EXE=$RCBLinuxRRbinR inaddition to the variables above
Compiling
Now we can cross-compile R
make R
The environment variable $PATH is modified inthis make target to ensure that the cross-tools areat the beginning of the search path This step aswell as initiation of the compile process is accom-plished by the R makefile target This may take awhile to finish If everything goes smoothly a com-pressed file Win-R-170tgz will be created in theRCrossBuildWinR directory
This gziprsquod tar-file contains the executables andsupporting files for R which will run on a Mi-crosoft Windows machine To install transfer andunpack it on the desired machine Since there isno InstallShield-style installer executable one willhave to manually create any desired desktop short-cuts to the executables in the bin directory such asRguiexe Remember though this isnrsquot necessarilythe same as the R for Windows version on CRAN
Building R packages and bundles
Now we have reached the point of interest As onemight recall the primary goal of this document is tobe able to build binary packages for R libraries whichcan be simply installed for R running on MicrosoftWindows All the work up to this point simply ob-tains the required working build of R for MicrosoftWindows
Now create the pkgsrc subdirectory to put thepackage sources into and the WinRlibs subdirectoryto put the windows binaries into
We will use the geepack package as an ex-ample for cross-building First put the sourcegeepack_02-4targz into the subdirectorypkgsrc and then do
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 17
make pkg-geepack_02-4
If there is no error the Windows binary geepackzipwill be created in the WinRlibs subdirectory whichis then ready to be shipped to a Windows machinefor installation
We can easily build bundled packages as wellFor example to build the packages in bundle VR weplace the source VR_70-11targz into the pkgsrcsubdirectory and then do
make bundle-VR_70-11
The Windows binaries of packages MASS classnnet and spatial in the VR bundle will appear inthe WinRlibs subdirectory
This Makefile assumes a tarred and gzippedsource for an R package which ends with ldquotargzrdquoThis is usually created through the R CMD buildcommand It takes the version number together withthe package name
The Makefile
The associated Makefile is used to automate manyof the steps Since many Linux distributions comewith the make utility as part of their installation ithopefully will help rather than confuse people cross-compiling R The Makefile is written in a format sim-ilar to shell commands in order to show what exactlyeach step does
The commands can also be cut-and-pasted outof the Makefile with minor modification (such aschange $$ to $ for environmental variable names)and run manually
Possible pitfalls
We have very little experience with cross-buildingpackages (for instance Matrix) that depend on ex-ternal libraries such as atlas blas lapack or Java li-braries Native Windows building or at least a sub-stantial amount of testing may be required in thesecases It is worth experimenting though
Acknowledgments
We are grateful to Ben Bolker Stephen Eglen andBrian Ripley for helpful discussions
Jun YanUniversity of WisconsinndashMadison USAjyanstatwiscedu
AJ RossiniUniversity of Washington USArossiniuwashingtonedu
Analysing Survey Data in Rby Thomas Lumley
Introduction
Survey statistics has always been a somewhat spe-cialised area due in part to its view of the world Inthe rest of the statistical world data are random andwe model their distributions In traditional surveystatistics the population data are fixed and only thesampling is random The advantage of this lsquodesign-basedrsquo approach is that the sampling unlike the pop-ulation is under the researcherrsquos control
The basic question of survey statistics is
If we did exactly this analysis on thewhole population what result would weget
If individuals were sampled independently withequal probability the answer would be very straight-forward They arenrsquot and it isnrsquot
To simplify the operations of a large survey it isroutine to sample in clusters For example a randomsample of towns or counties can be chosen and then
individuals or families sampled from within thoseareas There can be multiple levels of cluster sam-pling eg individuals within families within townsCluster sampling often results in people having anunequal chance of being included for example thesame number of individuals might be surveyed re-gardless of the size of the town
For statistical efficiency and to make the resultsmore convincing it is also usual to stratify the popu-lation and sample a prespecified number of individ-uals or clusters from each stratum ensuring that allstrata are fairly represented Strata for a national sur-vey might divide the country into geographical re-gions and then subdivide into rural and urban En-suring that smaller strata are well represented mayalso involve sampling with unequal probabilities
Finally unequal probabilities might be deliber-ately used to increase the representation of groupsthat are small or particularly important In US healthsurveys there is often interest in the health of thepoor and of racial and ethnic minority groups whomight thus be oversampled
The resulting sample may be very different in
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 18
structure from the population but it is different inways that are precisely known and can be accountedfor in the analysis
Weighted estimation
The major technique is inverse-probability weight-ing to estimate population totals If the probabilityof sampling individual i is πi and the value of somevariable or statistic for that person is Yi then an unbi-ased estimate of the population total is
sumi
1πi
Yi
regardless of clustering or stratification This iscalled the HorvitzndashThompson estimator (Horvitz ampThompson 1951) Inverse-probability weighting canbe used in an obvious way to compute populationmeans and higher moments Less obviously it canbe used to estimate the population version of log-likelihoods and estimating functions thus allowingalmost any standard models to be fitted The result-ing estimates answer the basic survey question theyare consistent estimators of the result that would beobtained if the same analysis was done to the wholepopulation
It is important to note that estimates obtained bymaximising an inverse-probability weighted loglike-lihood are not in general maximum likelihood esti-mates under the model whose loglikelihood is beingused In some cases semiparametric maximum like-lihood estimators are known that are substantiallymore efficient than the survey-weighted estimatorsif the model is correctly specified An important ex-ample is a two-phase study where a large sample istaken at the first stage and then further variables aremeasured on a subsample
Standard errors
The standard errors that result from the more com-mon variance-weighted estimation are incorrect forprobability weighting and in any case are model-based and so do not answer the basic question Tomake matters worse stratified and clustered sam-pling also affect the variance the former decreasingit and the latter (typically) increasing it relative to in-dependent sampling
The formulas for standard error estimation aredeveloped by first considering the variance of asum or mean under independent unequal probabil-ity sampling This can then be extended to clustersampling where the clusters are independent andto stratified sampling by adding variance estimatesfrom each stratum The formulas are similar to thelsquosandwich variancesrsquo used in longitudinal and clus-tered data (but not quite the same)
If we index strata by s clusters by c and observa-tions by i then
ˆvar[S] = sums
ns
ns minus 1 sumci
(1
πsci
(Ysci minus Ys
))2
where S is the weighted sum Ys are weighted stra-tum means and ns is the number of clusters in stra-tum s
Standard errors for statistics other than means aredeveloped from a first-order Taylor series expansionthat is they use the same formula applied to themean of the estimating functions The computationsfor variances of sums are performed in the svyCprodfunction ensuring that they are consistent across thesurvey functions
Subsetting surveys
Estimates on a subset of a survey require somecare Firstly it is important to ensure that the cor-rect weight cluster and stratum information is keptmatched to each observation Secondly it is not cor-rect simply to analyse a subset as if it were a designedsurvey of its own
The correct analysis corresponds approximatelyto setting the weight to zero for observations not inthe subset This is not how it is implemented sinceobservations with zero weight still contribute to con-straints in R functions such as glm and they still takeup memory Instead the surveydesign object keepstrack of how many clusters (PSUs) were originallypresent in each stratum and svyCprod uses this to ad-just the variance
Example
The examples in this article use data from the Na-tional Health Interview Study (NHIS 2001) con-ducted by the US National Center for Health Statis-tics The data and documentation are availablefrom httpwwwcdcgovnchsnhishtm I usedNCHS-supplied SPSS files to read the data and thenreadspss in the foreign package to load them intoR Unfortunately the dataset is too big to includethese examples in the survey package mdash they wouldincrease the size of the package by a couple of ordersof magnitude
The survey package
In order to keep the stratum cluster and probabilityinformation associated with the correct observationsthe survey package uses a surveydesign object cre-ated by the svydesign function A simple examplecall is
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 19
imdsgnlt-svydesign(id=~PSUstrata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
The data for this the immunization section ofNHIS are in the data frame immunize The strataare specified by the STRATUM variable and the inverseprobability weights are in the WTFAIM variable Thestatement specifies only one level of clustering byPSU The nest=TRUE option asserts that clusters arenested in strata so that two clusters with the samepsuid in different strata are actually different clus-ters
In fact there are further levels of clustering and itwould be possible to write
imdsgnlt-svydesign(id=~PSU+HHX+FMX+PX
strata=~STRATUM
weights=~WTFAIM
data=immunize
nest=TRUE)
to indicate the sampling of households familiesand individuals This would not affect the analysiswhich depends only on the largest level of clusteringThe main reason to provide this option is to allowsampling probabilities for each level to be suppliedinstead of weights
The strata argument can have multiple termseg strata= region+rural specifying strata definedby the interaction of region and rural but NCHSstudies provide a single stratum variable so this isnot needed Finally a finite population correction tovariances can be specified with the fpc argumentwhich gives either the total population size or theoverall sampling fraction of top-level clusters (PSUs)in each stratum
Note that variables to be looked up in the sup-plied data frame are all specified by formulas re-moving the scoping ambiguities in some modellingfunctions
The resulting surveydesign object has methodsfor subscripting and naaction as well as the usualprint and summary The print method gives somedescriptions of the design such as the number oflargest level clusters (PSUs) in each stratum the dis-tribution of sampling probabilities and the names ofall the data variables In this immunisation study thesampling probabilities vary by a factor of 100 so ig-noring the weights may be very misleading
Analyses
Suppose we want to see how many childrenhave had their recommended polio immmunisationsrecorded The command
svytable(~AGEP+POLCTC design=imdsgn)
produces a table of number of polio immunisationsby age in years for children with good immunisation
data scaled by the sampling weights so that it cor-responds to the US population To fit the table moreeasily on the page and since three doses is regardedas sufficient we can do
gt svytable(~AGEP+pmin(3POLCTC)design=imdsgn)
pmin(3 POLCTC)
AGEP 0 1 2 3
0 473079 411177 655211 276013
1 162985 72498 365445 863515
2 144000 84519 126126 1057654
3 89108 68925 110523 1123405
4 133902 111098 128061 1069026
5 137122 31668 19027 1121521
6 127487 38406 51318 838825
where the last column refers to 3 or more doses Forages 3 and older just over 80 of children have re-ceived 3 or more doses and roughly 10 have re-ceived none
To examine whether this varies by city size wecould tabulate
svytable(~AGEP+pmin(3POLCTC)+MSASIZEP
design=imdsgn)
where MSASIZEP is the size of the lsquometropolitan sta-tistical arearsquo in which the child lives with categoriesranging from 5000000+ to under 250000 and a finalcategory for children who donrsquot live in a metropoli-tan statistical area As MSASIZEP has 7 categories thedata end up fairly sparse A regression model mightbe more useful
Regression models
The svyglm and svycoxph functions have similarsyntax to their non-survey counterparts but use adesign rather than a data argument For simplicityof implementation they do require that all variablesin the model be found in the design object ratherthan floating free in the calling environment
To look at associations with city size we fit the lo-gistic regression models in Figure 1 The resultingsvyglm objects have a summary method similar tothat for glm There isnrsquot a consistent association butcategory 2 cities from 25 million to 5 million peopledoes show some evidence of lower vaccination ratesSimilar analyses using family income donrsquot show anyreal evidence of an assocation
General weighted likelihood estimation
The svymle function maximises a specified weightedlikelihood It takes as arguments a loglikelihoodfunction and formulas or fixed values for each pa-rameter in the likelihood For example a linear re-gression could be implemented by the code in Fig-ure 2 In this example the log=TRUE argument ispassed to dnorm and is the only fixed argument Theother two arguments mean and sd are specified by
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 20
svyglm(I(POLCTCgt=3)~factor(AGEP)+factor(MSASIZEP) design=imdsgn family=binomial)
svyglm(I(POLCTCgt=3)~factor(AGEP)+MSASIZEP design=imdsgn family=binomial)
Figure 1 Logistic regression models for vaccination status
gr lt- function(xmeansdlog)
dm lt- 2(x - mean)(2sd^2)
ds lt- (x-mean)^2(2(2 sd))(2sd^2)^2 - sqrt(2pi)(sdsqrt(2pi))
cbind(dmds)
m2 lt- svymle(loglike=dnormgradient=gr design=dxi
formulas=list(mean=y~x+z sd=~1)
start=list(c(250) c(4))
log=TRUE)
Figure 2 Weighted maximum likelihood estimation
formulas Exactly one of these formulas should havea left-hand side specifying the response variable
It is necessary to specify the gradient in order toget survey-weighted standard errors If the gradientis not specified you can get standard errors based onthe information matrix With independent weightedsampling the information matrix standard errors willbe accurate if the model is correctly specified As isalways the case with general-purpose optimisation itis important to have reasonable starting values youcould often get them from an analysis that ignoredthe survey design
Extending the survey package
It is easy to extend the survey package to includeany new class of model where weighted estimationis possible The svycoxph function shows how this isdone
1 Create a call to the underlying model (coxph)adding a weights argument with weights 1πi(or a rescaled version)
2 Evaluate the call
3 Extract the one-step delta-betas ∆i or computethem from the information matrix I and scorecontributions Ui as ∆ = Iminus1Ui
4 Pass ∆i and the cluster strata and finite pop-ulation correction information to svyCprod tocompute the variance
5 Add a svywhatever class to the object and acopy of the design object
6 Override the print and printsummarymethods to include print(x$designvarnames=FALSE designsummaries=FALSE)
7 Override any likelihood extraction methods tofail
It may also be necessary to override other meth-ods such as residuals as is shown for Pearson resid-uals in residualssvyglm
A second important type of extension is to addprediction methods for small-area extrapolation ofsurveys That is given some variables measured in asmall geographic area (a new cluster) predict othervariables using the regression relationship from thewhole survey The predict methods for survey re-gression methods current give predictions only forindividuals
Extensions in third direction would handle differ-ent sorts of survey design The current software can-not handle surveys where strata are defined withinclusters partly because I donrsquot know what the rightvariance formula is
Summary
Analysis of complex sample surveys used to requireexpensive software on expensive mainframe hard-ware Computers have advanced sufficiently for areasonably powerful PC to analyze data from largenational surveys such as NHIS and R makes it pos-sible to write a useful survey analysis package in afairly short time Feedback from practising surveystatisticians on this package would be particularlywelcome
References
Horvitz DG Thompson DJ (1951) A Generalizationof Sampling without Replacement from a Finite Uni-verse Journal of the American Statistical Association47 663-685
Thomas LumleyBiostatistics University of Washingtontlumleyuwashingtonedu
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 21
Computational Gains Using RPVM on aBeowulf Clusterby Brett Carson Robert Murison and Ian A Mason
Introduction
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) is a recent advance incomputing technology that harnesses the power ofa network of desktop computers using communica-tion software such as PVM Geist et al (1994) and MPIMessage Passing Interface Forum (1997) Whilst thepotential of a computing cluster is obvious expertisein programming is still developing in the statisticalcommunity
Recent articles in R-news Li and Rossini (2001)and Yu (2002) entice statistical programmers to con-sider whether their solutions could be effectively cal-culated in parallel Another R package SNOW Tier-ney (2002) Rossini et al (2003) aims to skillfullyprovide a wrapper interface to these packages in-dependent of the underlying cluster communicationmethod used in parallel computing This article con-centrates on RPVM and wishes to build upon the con-tribution of Li and Rossini (2001) by taking an exam-ple with obvious orthogonal components and detail-ing the R code necessary to allocate the computationsto each node of the cluster The statistical techniqueused to motivate our RPVM application is the gene-shaving algorithm Hastie et al (2000ba) for whichS-PLUS code has been written by Do and Wen (2002)to perform the calculations serially
The first section is a brief description of the Be-owulf cluster used to run the R programs discussedin this paper This is followed by an explanation ofthe gene-shaving algorithm identifying the opportu-nities for parallel computing of bootstrap estimatesof the ldquostrength of a cluster and the rendering ofeach matrix row orthogonal to the ldquoeigen-gene Thecode for spawning child processes is then explainedcomprehensively and the conclusion compares thespeed of RPVM on the Beowulf to serial computing
A parallel computing environment
The Beowulf cluster Becker et al (1995) Scyld Com-puting Corporation (1998) used to perform the par-allel computation outlined in the following sectionsis a sixteen node homogeneous cluster Each nodeconsists of an AMD XP1800+ with 15Gb of memoryrunning Redhat Linux 73 The nodes are connectedby 100Mbps Ethernet and are accessible via the dualprocessor AMD XP1800+ front-end Each node hasinstalled versions of R PVM Geist et al (1994) and RrsquosPVM wrapper RPVM Li and Rossini (2001)
Gene-shaving
The gene-shaving algorithm Hastie et al (2000ba)is one of the many DNA micro-array gene cluster-ing techniques to be developed in recent times Theaim is to find clusters with small variance betweengenes and large variance between samples Hastieet al (2000b) The algorithm consists of a numberof steps
1 The process starts with the matrix XNp whichrepresents expressions of genes in the micro ar-ray Each row of X is data from a gene thecolumns contain samples N is usually of size103 and p usually no greater than 102
2 The leading principal component of X is calcu-lated and termed the ldquoeigen-gene
3 The correlation of each row with the eigen-geneis calculated and the rows yielding the low-est 10 of R2 are shaved This process is re-peated to produce successive reduced matri-ces Xlowast
1 XlowastN There are [09N] genes in Xlowast
1 [081N] in Xlowast
2 and 1 in XlowastN
4 The optimal cluster size is selected from XlowastThe selection process requires that we estimatethe distribution of R2 by bootstrap and calcu-late Gk = R2
k minus Rlowast2k where Rlowast2
k is the mean R2
from B bootstrap samples of Xk The statisticGk is termed the gap statistic and the optimalcluster is that which has the largest gap statis-tic
5 Once a cluster has been determined the pro-cess repeats searching for more clusters afterremoving the information of previously deter-mined clusters This is done by ldquosweeping themean gene of the optimal cluster xSk
from eachrow of X
newX[i] = X[i] (I minus xSk(xT
SkxSk
)minus1 xTSk
)︸ ︷︷ ︸IPx
6 The next optimal cluster is found by repeatingthe process using newX
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 22
Figure 1 Clustering of genes
The bootstrap and orthogonalization processesare candidates for parallel computing Figure 1 illus-trates the overall gene-shaving process by showingthe discovery of clusters of genes
Parallel implementation of gene-shaving
There are two steps in the shaving process that canbe computed in parallel Firstly the bootstrapping ofthe matrix X and secondly the orthogonalization ofthe matrix X with respect to xSk
In general for analgorithm that can be fully parallelized we could ex-pect a speedup of no more than 16 on a cluster of 16nodes However given that this particular algorithmwill not be fully parallelized we would not expectsuch a reduction in computation time In fact thecluster will not be fully utilised during the overallcomputation as most of the nodes will see signifi-cant idle time Amdahlrsquos Law Amdahl (1967) can beused to calculate the estimated speedup of perform-ing gene-shaving in parallel Amdahlrsquos Law can berepresented in mathematical form by
Speedup =1
rs + rp
n
where rs is the proportion of the program computedin serial rp is the parallel proportion and n is thenumber of processors used In a perfect world ifa particular step computed in parallel were spreadover five nodes the speedup would equal five Thisis very rarely the case however due to speed limitingfactors like network load PVM overheads communi-cation overheads CPU load and the like
The parallel setup
Both the parallel versions of bootstrapping and or-thogonalization consist of a main R process (the mas-ter) and a number of child processes that the masterspawns (the slaves) The master process controls theentire flow of the shaving process It begins by start-ing the child processes which then proceed to wait
for the master to send data for processing and re-turn an answer to the master There are two varietiesof childrenslaves
bull Bootstrapping children - each slave task per-forms one permutation of the bootstrappingprocess
bull Orthogonalization children - each slave com-putes a portion of the orthogonalization pro-cess
Figure 2 shows a basic master-slave setup with-out any communication between the slaves them-selves The slaves receive some data from the par-ent process it and send the result back to the parentA more elaborate setup would see the slaves com-municating with each other this is not necessary inthis case as each of the slaves is independent of eachother
Slave 2Slave 1 Slave 3 Slave n
Master
Figure 2 The master-slave paradigm without slave-slave communication
Bootstrapping
Bootstrapping is used to find Rlowast2 used in the gapstatistic outlined in step 4 of the gene-shaving algo-rithm to determine whether selected clusters couldhave arisen by chance Elements of each row ofthe matrix are rearranged which requires N itera-tions quite a CPU intensive operation even more sofor larger expression matrices In the serial R ver-sion this would be done using Rrsquos apply functionwhereby each row of X has the sample function ap-plied to it
The matrix X is transformed into Xlowast B times byrandomly permuting within rows For each of theB bootstrap samples (indexed by b) the statistic R2
bis derived and the mean of R2
b b = 1 B is calculatedto represent that R2 which arises by chance only InRPVM for each bootstrap sample a task is spawnedfor B bootstrap samples we spawn B tasks
Orthogonalization
The master R process distributes a portion of the ma-trix X to its children Each portion will be approxi-mately equal in size and the number of portions willbe equal to the number of slaves spawned for orthog-onalization The master also calculates the value ofIPx as defined in step 5 of the gene-shaving algo-rithm
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 23
The value of IPx is passed to each child Eachchild then applies IPx to each row of X that the mas-ter has given it
There are two ways in which X can be dis-tributed The matrix can be split into approximatelyeven portions where the number of portions willbe equal to the number of slaves Alternatively thematrix can be distributed on a row-by-row basiswhere each slave processes a row sends it back isgiven a new row to process and so on The lattermethod becomes slow and inefficient as more send-ing receiving and unpacking of buffers is requiredSuch a method may become useful where the num-ber of rows of the matrix is small and the number ofcolumns is very large
The RPVM code
Li and Rossini (2001) have explained the initial stepsof starting the PVM daemon and spawning RPVM tasksfrom within an R program These steps precede thefollowing code The full source code is availablefrom httpmcsuneeduau~bcarsonRPVM
Bootstrapping
Bootstrapping can be performed in parallel by treat-ing each bootstrap sample as a child process Thenumber of children is the number of bootstrap sam-ples (B) and the matrix is copied to each child(1 B) Child process i permutes elements withineach row of X calculates Rlowast2 and returns this to themaster process The master averages Rlowast2 i = 1 Band computes the gap statistic
The usual procedure for sending objects to PVMchild processes is to initialize the send buffer packthe buffer with data and send the buffer Our func-tion parbootstrap is a function used by the mas-ter process and includes PVMinitsend() to initial-ize the send buffer PVMpkdblmat(X) to pack thebuffer and PVMmcast(bootchildren0) to sendthe data to all children
parbootstrap lt- function(X)
brsq lt- 0
tasks lt- length(bootchildren)
PVMinitsend()
send a copy of X to everyone
PVMpkdblmat(X)
PVMmcast(bootchildren0)
get the R squared values back and
take the mean
for(i in 1tasks)
PVMrecv(bootchildren[i]0)
rsq lt- PVMupkdouble(11)
brsq lt- brsq + (rsqtasks)
return(brsq)
After the master has sent the matrix to thechildren it waits for them to return their calcu-lated R2 values This is controlled by a simplefor loop as the master receives the answer viaPVMrecv(bootchildren[i]0) it is unpackedfrom the receive buffer with
rsq lt- PVMupkdouble(11)
The value obtained is used in calculating an over-all mean for all R2 values returned by the childrenIn the slave tasks the matrix is unpacked and a boot-strap permutation is performed an R2 value calcu-lated and returned to the master The following codeperforms these operations of the children tasks
receive the matrix
PVMrecv(PVMparent()0)
X lt- PVMupkdblmat()
Xb lt- X
dd lt- dim(X)
perform the permutation
for(i in 1dd[1])
permute lt- sample(1dd[2]replace=F)
Xb[i] lt- X[ipermute]
send it back to the master
PVMinitsend()
PVMpkdouble(Dfunct(Xb))
PVMsend(PVMparent()0)
PVMupkdblmat() is used to unpack the matrixAfter the bootstrap is performed the send is initial-ized with PVMinitsend() (as seen in the master)the result of Dfunct(Xb) (R2 calculation) is packedinto the send buffer and then the answer is sent withPVMsend() to the parent of the slave which will bethe master process
Orthogonalization
The orthogonalization in the master process is han-dled by the function parorth This function takes asarguments the IPx matrix and the expression matrixX The first step of this procedure is to broadcast theIPx matrix to all slaves by firstly packing it and thensending it with PVMmcast(orthchildren0) Thenext step is to split the expression matrix into por-tions and distribute to the slaves For simplicity thisexample divides the matrix by the number of slavesso row size must be divisible by the number of chil-dren Using even blocks on a homogeneous clusterwill work reasonably well as the load-balancing is-sues outlined in Rossini et al (2003) are more rele-vant to heterogeneous clusters The final step per-formed by the master is to receive these portions
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 24
back from the slaves and reconstruct the matrix Or-der of rows is maintained by sending blocks of rowsand receiving them back in the order the children sitin the array of task ids (orthchildren) The follow-ing code is the parorth function used by the masterprocess
parorth lt- function(IPx X)
rows lt- dim(X)[1]
tasks lt- length(orthchildren)
intialize the send buffer
PVMinitsend()
pack the IPx matrix and send it to
all children
PVMpkdblmat(IPx)
PVMmcast(orthchildren0)
divide the X matrix into blocks and
send a block to each child
nrows = asinteger(rowstasks)
for(i in 1tasks)
start lt- (nrows (i-1))+1
end lt- nrows i
if(end gt rows)
end lt- rows
Xpart lt- X[startend]
PVMpkdblmat(Xpart)
PVMsend(orthchildren[i]start)
wait for the blocks to be returned
and re-construct the matrix
for(i in 1tasks)
PVMrecv(orthchildren[i]0)
rZ lt- PVMupkdblmat()
if(i==1)
Z lt- rZ
else
Z lt- rbind(ZrZ)
dimnames(Z) lt- dimnames(X)
Z
The slaves execute the following code which isquite similar to the code of the bootstrap slaves Eachslave receives both matrices then performs the or-thogonalization and sends back the result to the mas-ter
wait for the IPx and X matrices to arrive
PVMrecv(PVMparent()0)
IPx lt- PVMupkdblmat()
PVMrecv(PVMparent()-1)
Xpart lt- PVMupkdblmat()
Z lt- Xpart
orthogonalize X
for(i in 1dim(Xpart)[1])
Z[i] lt- Xpart[i] IPx
send back the new matrix Z
PVMinitsend()
PVMpkdblmat(Z)
PVMsend(PVMparent()0)
Gene-shaving results
Serial gene-shaving
The results for the serial implementation of gene-shaving are a useful guide to the speedup that canbe achieved from parallelizing The proportion ofthe total computation time the bootstrapping andorthogonalization use is of interest If these are ofsignificant proportion they are worth performing inparallel
The results presented were obtained from run-ning the serial version on a single cluster node RPVMis not used in the serial version and the programcontains no parallel code at all The data used arerandomly generated matrices of size N = 1600 32006400 and 12800 respectively with a constant p of size80 to simulate variable-sized micro-arrays In eachcase the first two clusters are found and ten permu-tations used in each bootstrap Figure 3 is a plot ofthe overall computation time the total time of all in-vocations of the bootstrapping process and the totaltime of all invocations of the orthogonalization pro-cess
2000 4000 6000 8000 10000 12000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
orthogonalization
bootstrapping
geneminusshaving
Figure 3 Serial Gene-shaving results
Clearly the bootstrapping (blue) step is quitetime consuming with the mean proportion being078 of the time taken to perform gene-shaving (red)The orthogonalization (black) equates to no morethan 1 so any speedup through parallelization isnot going to have any significant effect on the overalltime
We have two sets of parallel results one usingtwo cluster nodes the other using all sixteen nodesof the cluster In the two node runs we could ex-pect a reduction in the bootstrapping time to be nomore than half ie a speedup of 2 In full cluster runswe could expect a speedup of no more than 10 as
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 25
we can use ten of the sixteen processors (one pro-cessor for each permutation) It is now possible toestimate the expected maximum speedup of paral-lelization with Amdahlrsquos law as described earlier forexpression matrices of size N = 12800 and p = 80given the parallel proportion of 078
Application of Amdahlrsquos law fot the cases (n =2 10) gives speedups of 163 and 335 respectivelyWe have implemented the orthogonalization in par-allel more for demonstration purposes as it couldbecome useful with very large matrices
Parallel gene-shaving
The parallel version of gene-shaving only containsparallel code for the bootstrapping and orthogonal-ization procedures the rest of the process remainsthe same as the serial version Both the runs usingtwo nodes and all sixteen nodes use this same R pro-gram In both cases ten tasks are spawned for boot-strapping and sixteen tasks are spawned for orthog-onalization Clearly this is an under-utilisation of theBeowulf cluster in the sixteen node runs as most ofthe nodes are going to see quite a lot of idle time andin fact six of the nodes will not be used to computethe bootstrap the most CPU intensive task of the en-tire algorithm
Bootstrap Comparisons Orthogonalization Comparisons
2000 6000 10000
0
200
400
600
800
1000
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
2000 6000 10000
0
1
2
3
4
5
6
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
Overall Comparisons
2000 6000 10000
0
200
400
600
800
1000
1200
size of expression matrix (rows)
com
puta
tion
time
(sec
onds
)
serial2 nodes16 nodes
Figure 4 Comparison of computation times for vari-able sized expression matrices for overall orthogo-nalization and bootstrapping under the different en-vironments
The observed speedup for the two runs of par-allel gene-shaving using the N = 12800 expressionmatrix is as follows
S =Tserial
Tparallel
S2nodes =13011882534
= 16
S16nodes =13011854959
= 24
The observed speedup for two nodes is very closeto the estimate of 163 The results for sixteen nodesare not as impressive due to greater time spent com-puting than expected for bootstrapping Althoughthere is a gain with the orthogonalization here it isbarely noticeable in terms of the whole algorithmThe orthogonalization is only performed once foreach cluster found so it may well become an issuewhen searching for many clusters on much largerdatasets than covered here
Conclusions
Whilst the gene-shaving algorithm is not completelyparallelizable a large portion of it (almost 80) of itcan be implemented to make use of a Beowulf clus-ter This particular problem does not fully utilise thecluster each node other than the node which runs themaster process will see a significant amount of idletime Regardless the results presented in this papershow the value and potential of Beowulf clusters inprocessing large sets of statistical data A fully paral-lelizable algorithm which can fully utilise each clus-ter node will show greater performance over a serialimplementation than the particular example we havepresented
Bibliography
Amdahl G M (1967) Validity of the single proces-sor approach to achieving large scale computingcapabilities In AFIPS spring joint computer confer-ence Sunnyvale California IBM 22
Becker D J Sterling T Savarese D DorbandJ E Ranawak U A and Packer C V (1995)BEOWULF A PARALLEL WORKSTATION FORSCIENTIFIC COMPUTATION In InternationalConference on Parallel Processing 21
Do K-A and Wen S (2002) Documen-tation Technical report Department ofBiostatistics MD Anderson Cancer Centerhttpodinmdacctmcedu kim 21
Geist A Beguelin A Dongarra J Jiang WManchek R and Sunderam V (1994) PVM Par-allel Virtual Machine A userrsquos guide and tutorial fornetworked parallel computing MIT Press 21
Hastie T Tibshirani R Eisen M Alizadeh ALevy R Staudt L Chan W Botstein D andBrown P (2000a) rsquoGene shavingrsquo as a method foridentifying distinct sets of genes with similar ex-pression patterns Genome Biology 21
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 26
Hastie T Tibshirani R Eisen M Brown P ScherfU Weinstein J Alizadeh A Staudt L and Bot-stein D (2000b) Gene shaving a new class ofclustering methods for expression arrays Techni-cal report Stanford University 21
Li M N and Rossini A (2001) RPVM Cluster sta-tistical computing in R R News 1(3)4ndash7 21 23
Message Passing Interface Forum (1997) MPI-2 Extensions to the Message-Passing Interfacehttpwwwmpi-forumorgdocsdocshtml 21
Rossini A Tierney L and Li N (2003)Simple parallel statistical computing inr Technical Report Working Paper 193UW Biostatistics Working Paper Serieshttpwwwbepresscomuwbiostatpaper19321 23
Scyld Computing Corporation (1998)Beowulf introduction amp overviewhttpwwwbeowulforgintrohtml 21
Tierney L (2002) Simple Network of Work-stations for R Technical report Schoolof Statistics University of Minnesotahttpwwwstatuiowaedu lukeRclusterclusterhtml21
Yu H (2002) Rmpi Parallel statistical computing inR R News 2(2)10ndash14 21
Brett Carson Robert Murison Ian A MasonThe University of New England Australiabcarsonturinguneeduaurmurisonturinguneeduauiamturinguneeduau
R Help DeskGetting Help ndash Rrsquos Help Facilities and Manuals
Uwe Ligges
Introduction
This issue of the R Help Desk deals with its probablymost fundamental topic Getting help
Different amounts of knowledge about R and Rrelated experiences require different levels of helpTherefore a description of the available manualshelp functions within R and mailing lists will begiven as well as an ldquoinstructionrdquo how to use thesefacilities
A secondary objective of the article is to pointout that manuals help functions and mailing listarchives are commonly much more comprehensivethan answers on a mailing list In other wordsthere are good reasons to read manuals use helpfunctions and search for information in mailing listarchives
Manuals
The manuals described in this Section are shippedwith R (in directory lsquodocmanualrsquo of a regular Rinstallation) Nevertheless when problems occur be-fore R is installed or the available R version is out-dated recent versions of the manuals are also avail-able from CRAN1 Let me cite the ldquoR Installation andAdministrationrdquo manual by the R Development Core
Team (R Core for short 2003b) as a reference for in-formation on installation and updates not only for Ritself but also for contributed packages
I think it is not possible to use adequately a pro-gramming language and environment like R with-out reading any introductory material Fundamen-tals for using R are described in ldquoAn Introductionto Rrdquo (Venables et al 2003) and the ldquoR Data Im-portExportrdquo manual (R Core 2003a) Both manualsare the first references for any programming tasks inR
Frequently asked questions (and correspondinganswers) are collected in the ldquoR FAQrdquo (Hornik 2003)an important resource not only for beginners It is agood idea to look into the ldquoR FAQrdquo when a questionarises that cannot be answered by reading the othermanuals cited above Specific FAQs for Macintosh(by Stefano M Iacus) and Windows (by Brian D Rip-ley) are available as well2
Users who gained some experiences might wantto write their own functions making use of the lan-guage in an efficient manner The manual ldquoR Lan-guage Definitionrdquo (R Core 2003c still labelled asldquodraftrdquo) is a comprehensive reference manual for acouple of important language aspects eg objectsworking with expressions working with the lan-guage (objects) description of the parser and debug-ging I recommend this manual to all users who planto work regularly with R since it provides really il-luminating insights
For those rather experienced users who plan tocreate their own packages writing help pages andoptimize their code ldquoWriting R Extensionsrdquo (R Core
1httpCRANR-projectorg2httpCRANR-projectorgfaqshtml
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 27
2003d) is the appropriate reference It includes alsodescriptions on interfacing to C C++ or Fortrancode and compiling and linking this code into R
Occasionally library(help=PackageName) indi-cates that package specific manuals (or papers corre-sponding to packages) so-called vignettes are avail-able in directory lsquolibraryPackageNamedocrsquo
Beside the manuals several books dealing with Rare available these days eg introductory textbooksbooks specialized on certain applications such as re-gression analysis and books focussed on program-ming and the language Here I do not want to adver-tise any of those books because the ldquooptimalrdquo choiceof a book depends on the tasks a user is going toperform in R Nevertheless I would like to mentionVenables and Ripley (2000) as a reference for pro-gramming and the language and Venables and Rip-ley (2002) as a comprehensive book dealing with sev-eral statistical applications and how to apply those inR (and S-PLUS)
Help functions
Access to the documentation on the topic help (in thiscase a function name) is provided by help The ldquordquois the probably most frequently used way for gettinghelp whereas the corresponding function help() israther flexible As the default help is shown as for-matted text This can be changed (by argumentsto help() or for a whole session with options())to eg HTML (htmlhelp=TRUE) if available TheHTML format provides convenient features like linksto other related help topics On Windows I pre-fer to get help in Compiled HTML format by settingoptions(chmhelp=TRUE) in file lsquoRprofilersquo
When exact names of functions or other top-ics are unknown the user can search for rele-vant documentation (matching a given characterstring in eg name title or keyword entries) withhelpsearch() This function allows quite flex-ible searching using either regular expression orfuzzy matching see helpsearch for details Ifit is required to find an object by its partial nameapropos() is the right choice ndash it also accepts regularexpressions
Documentation and help in HTML format canalso be accessed by helpstart() This functionsstarts a Web browser with an index that links toHTML versions of the manuals mentioned above theFAQs a list of available packages of the current in-stallation several other information and a page con-taining a search engine and keywords On the lat-ter page the search engine can be used to ldquosearchfor keywords function and data names and text inhelp page titlesrdquo of installed packages The list of
keywords is useful when the search using other fa-cilities fails the user gets a complete list of functionsand data sets corresponding to a particular keyword
Of course it is not possible to search for func-tionality of (unknown) contributed packages or doc-umentation packages (and functions or data sets de-fined therein) that are not available in the current in-stallation of R A listing of contributed packages andtheir titles is given on CRAN as well as their func-tion indices and reference manuals Nevertheless itmight be still hard to find the function or package ofinterest For that purpose an R site search3 providedby Jonathan Baron is linked on the search page ofCRAN Help files manuals and mailing list archives(see below) can be searched
Mailing lists
Sometimes manuals and help functions do not an-swer a question or there is need for a discussion Forthese purposes there are the mailing lists4 r-announcefor important announcements r-help for questionsand answers about problems and solutions using Rand r-devel for the development of R
Obviously r-help is the appropriate list for ask-ing questions that are not covered in the manuals(including the ldquoR FAQrdquo) or the help pages and reg-ularly those questions are answered within a fewhours (or even minutes)
Before asking it is a good idea to look into thearchives of the mailing list5 These archives con-tain a huge amount of information and may be muchmore helpful than a direct answer on a question be-cause most questions already have been asked ndash andanswered ndash several times (focussing on slightly dif-ferent aspects hence these are illuminating a biggercontext)
It is really easy to ask questions on r-help butI would like to point out the advantage of readingmanuals before asking Of course reading (a sectionof) some manual takes much more time than ask-ing and answers from voluntary help providers ofr-help will probably solve the recent problem of theasker but quite similar problems will certainly arisepretty soon Instead the time one spends readingthe manuals will be saved in subsequent program-ming tasks because many side aspects (eg conve-nient and powerful functions programming tricksetc) will be learned from which one benefits in otherapplications
3httpfinzipsychupennedusearchhtml see also httpCRANR-projectorgsearchhtml4accessible via httpwwwR-projectorgmailhtml5archives are searchable see httpCRANR-projectorgsearchhtml
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 28
Bibliography
Hornik K (2003) The R FAQ Version 17-3 httpwwwcituwienacat~hornikR ISBN 3-901167-51-X 26
R Development Core Team (2003a) R Data Im-portExport URL httpCRANR-projectorgmanualshtml ISBN 3-901167-53-6 26
R Development Core Team (2003b) R Installation andAdministration URL httpCRANR-projectorgmanualshtml ISBN 3-901167-52-8 26
R Development Core Team (2003c) R LanguageDefinition URL httpCRANR-projectorgmanualshtml ISBN 3-901167-56-0 26
R Development Core Team (2003d) Writing RExtensions URL httpCRANR-projectorgmanualshtml ISBN 3-901167-54-4 26
Venables W N and Ripley B D (2000) S Program-ming New York Springer-Verlag 27
Venables W N and Ripley B D (2002) Modern Ap-plied Statistics with S New York Springer-Verlag4th edition 27
Venables W N Smith D M and the R De-velopment Core Team (2003) An Introduction toR URL httpCRANR-projectorgmanualshtml ISBN 3-901167-55-2 26
Uwe LiggesFachbereich Statistik Universitaumlt Dortmund Germanyliggesstatistikuni-dortmundde
Book ReviewsMichael Crawley Statistical Com-puting An Introduction to DataAnalysis Using S-Plus
John Wiley and Sons New York USA 2002770 pages ISBN 0-471-56040-5httpwwwbioicacukresearchmjcrawstatcomp
The number of monographs related to the Sprogramming language is sufficiently small (a fewdozen at most) that there are still great opportuni-ties for new books to fill some voids in the literatureStatistical Computing (hereafter referred to as SC) isone such text presenting an array of modern statisti-cal methods in S-Plus at an introductory level
The approach taken by author is to emphasizegraphical data inspection parameter estimation andmodel criticism rather than classical hypothesis test-ing (p ix) Within this framework the text coversa vast range of applied data analysis topics Back-ground material on S-Plus probability and distri-butions is included in the text The traditional top-ics such as linear models regression and analysis ofvariance are covered sometimes to a greater extentthan is common (for example a split-split-split-splitplot experiment and a whole chapter on ANCOVA)More advanced modeling techniques such as boot-strap generalized linear models mixed effects mod-els and spatial statistics are also presented For eachtopic some elementary theory is presented and usu-ally an example is worked by hand Crawley alsodiscusses more philosophical issues such as random-ization replication model parsimony appropriatetransformations etc The book contains a very thor-
ough index of more than 25 pages In the index allS-Plus language words appear in bold type
A supporting web site contains all the data filesscripts with all the code from each chapter of thebook and a few additional (rough) chapters Thescripts pre-suppose that the data sets have alreadybeen imported into S-Plus by some mechanismndashnoexplanation of how to do this is givenndashpresentingthe novice S-Plus user with a minor obstacle Oncethe data is loaded the scripts can be used to re-dothe analyses in the book Nearly every worked ex-ample begins by attaching a data frame and thenworking with the variable names of the data frameNo detach command is present in the scripts so thework environment can become cluttered and evencause errors if multiple data frames have commoncolumn names Avoid this problem by quitting theS-Plus (or R) session after working through the scriptfor each chapter
SC contains a very broad use of S-Plus commandsto the point the author says about subplot Hav-ing gone to all this trouble I canrsquot actually see whyyou would ever want to do this in earnest (p 158)Chapter 2 presents a number of functions (grepregexpr solve sapply) that are not used within thetext but are usefully included for reference
Readers of this newsletter will undoubtedly wantto know how well the book can be used with R (ver-sion 162 base with recommended packages) Ac-cording to the preface of SC ldquoThe computing is pre-sented in S-Plus but all the examples will also workin the freeware program called Rrdquo Unfortunatelythis claim is too strong The bookrsquos web site doescontain (as of March 2003) three modifications forusing R with the book Many more modifications
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 29
are needed A rough estimate would be that 90of the S-Plus code in the scripts needs no modifica-tion to work in R Basic calculations and data ma-nipulations are nearly identical in S-Plus and R Inother cases a slight change is necessary (ksgof in S-Plus kstest in R) alternate functions must be used(multicomp in S-Plus TukeyHSD in R) andor addi-tional packages (nlme) must be loaded into R Othertopics (bootstrapping trees) will require more exten-sive code modification efforts and some functionalityis not available in R (varcomp subplot)
Because of these differences and the minimal doc-umentation for R within the text users inexperiencedwith an S language are likely to find the incompati-bilities confusing especially for self-study of the textInstructors wishing to use the text with R for teach-ing should not find it too difficult to help studentswith the differences in the dialects of S A completeset of online complements for using R with SC wouldbe a useful addition to the bookrsquos web site
In summary SC is a nice addition to the litera-ture of S The breadth of topics is similar to that ofModern Applied Statistics with S (Venables and Rip-ley Springer 2002) but at a more introductory levelWith some modification the book can be used withR The book is likely to be especially useful as anintroduction to a wide variety of data analysis tech-niques
Kevin WrightPioneer Hi-Bred InternationalKevinWrightpioneercom
Peter DalgaardIntroductory Statistics with R
Springer Heidelberg Germany 2002282 pages ISBN 0-387-95475-9httpwwwbiostatkudk~pdISwRhtml
This is the first book other than the manuals thatare shipped with the R distribution that is solely de-voted to R (and its use for statistical analysis) Theauthor would be familiar to the subscribers of the Rmailing lists Prof Dalgaard is a member of R Coreand has been providing people on the lists with in-sightful elegant solutions to their problems Theauthorrsquos R wizardry is demonstrated in the manyvaluable R tips and tricks sprinkled throughout thebook although care is taken to not use too-cleverconstructs that are likely to leave novices bewildered
As R is itself originally written as a teaching tooland many are using it as such (including the author)I am sure its publication is to be welcomed by manyThe targeted audience of the book is ldquononstatisticianscientists in various fields and students of statisticsrdquoIt is based upon a set of notes the author developedfor the course Basic Statistics for Health Researchers
at the University of Copenhagen A couple of caveatswere stated in the Preface Given the origin of thebook the examples used in the book were mostly ifnot all from health sciences The author also madeit clear that the book is not intended to be used as astand alone text For teaching another standard in-troductory text should be used as a supplement asmany key concepts and definitions are only brieflydescribed For scientists who have had some statis-tics however this book itself may be sufficient forself-study
The titles of the chapters are1 Basics2 Probability and distributions3 Descriptive statistics and graphics4 One- and two-sample tests5 Regression and correlation6 ANOVA and Kruskal-Wallis7 Tabular data8 Power and computation of sample sizes9 Multiple regression
10 Linear Models11 Logistic regression12 Survival analysis
plus three appendices Obtaining and installing RData sets in the ISwR package (help files for data setsthe package) and Compendium (short summary ofthe R language) Each chapter ends with a few ex-ercises involving further analysis of example dataWhen I first opened the book I was a bit surprised tosee the last four chapters I donrsquot think most peopleexpected to see those topics covered in an introduc-tory text I do agree with the author that these topicsare quite essential for data analysis tasks for practicalresearch
Chapter one covers the basics of the R languageas much as needed in interactive use for most dataanalysis tasks People with some experience withS-PLUS but new to R will soon appreciate manynuggets of features that make R easier to use (egsome vectorized graphical parameters such as colpch etc the functions subset transform amongothers) As much programming as is typicallyneeded for interactive use (and may be a little moreeg the use of deparse(substitute() as defaultargument for the title of a plot) is also described
Chapter two covers the calculation of probabil-ity distributions and random number generation forsimulations in R Chapter three describes descrip-tive statistics (mean SD quantiles summary tablesetc) as well as how to graphically display summaryplots such as Q-Q plots empirical CDFs boxplotsdotcharts bar plots etc Chapter four describes the tand Wilcoxon tests for one- and two-sample compar-ison as well as paired data and the F test for equalityof variances
Chapter five covers simple linear regression andbivariate correlations Chapter six contains quite abit of material from oneway ANOVA pairwise com-
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 30
parisons Welchrsquos modified F test Barlettrsquos test forequal variances to twoway ANOVA and even a re-peated measures example Nonparametric proce-dures are also described Chapter seven covers basicinferences for categorical data (tests for proportionscontingency tables etc) Chapter eight covers powerand sample size calculations for means and propor-tions in one- and two-sample problems
The last four chapters provide a very nice briefsummary on the more advanced topics The graphi-cal presentations of regression diagnostics (eg colorcoding the magnitudes of the diagnostic measures)can be very useful However I would have pre-ferred the omission of coverage on variable selectionin multiple linear regression entirely The coverageon these topics is necessarily brief but do provideenough material for the reader to follow through theexamples It is impressive to cover this many topicsin such brevity without sacrificing clarity of the de-scriptions (The linear models chapter covers poly-nomial regression ANCOVA regression diagnosticsamong others)
As with (almost) all first editions typos are in-evitable (eg on page 33 in the modification to vec-torize Newtonrsquos method for calculating square rootsany should have been all on page 191 the logitshould be defined as log[p(1 minus p)]) On the wholehowever this is a very well written book with log-ical ordering of the content and a nice flow Amplecode examples are interspersed throughout the textmaking it easy for the reader to follow along on acomputer I whole-heartedly recommend this bookto people in the ldquointended audiencerdquo population
Andy LiawMerck Research Laboratoriesandy_liawmerckcom
John Fox An R and S-Plus Com-panion to Applied Regression
Sage Publications Thousand Oaks CA USA 2002328 pages ISBN 0-7619-2280-6httpwwwsocscimcmastercajfoxBookscompanion
The aim of this book is to enable people withknowledge on linear regression modelling to analyzetheir data using R or S-Plus Any textbook on linearregression may provide the necessary backgroundof course both language and contents of this com-panion are closely aligned with the authors own Ap-plied Regression Linear Models and Related Metods (FoxSage Publications 1997)
Chapters 1ndash2 give a basic introduction to R andS-Plus including a thorough treatment of data im-portexport and handling of missing values Chap-ter 3 complements the introduction with exploratory
data analysis and Box-Cox data transformations Thetarget audience of the first three sections are clearlynew users who might never have used R or S-Plusbefore
Chapters 4ndash6 form the main part of the book andshow how to apply linear models generalized linearmodels and regression diagnostics respectively Thisincludes detailed explanations of model formulaedummy-variable regression and contrasts for cate-gorical data and interaction effects Analysis of vari-ance is mostly explained using ldquoType IIrdquo tests as im-plemented by the Anova() function in package carThe infamous ldquoType IIIrdquo tests which are requestedfrequently on the r-help mailing list are providedby the same function but their usage is discouragedquite strongly in both book and help page
Chapter 5 reads as a natural extension of linearmodels and concentrates on error distributions pro-vided by the exponential family and correspondinglink functions The main focus is on models for cate-gorical responses and count data Much of the func-tionality provided by package car is on regression di-agnostics and has methods both for linear and gen-eralized linear models Heavy usage of graphicsidentify() and text-based menus for plots empha-size the interactive and exploratory side of regressionmodelling Other sections deal with Box-Cox trans-formations detection of non-linearities and variableselection
Finally Chapters 7 amp 8 extend the introductoryChapters 1ndash3 with material on drawing graphics andS programming Although the examples use regres-sion models eg plotting the surface of fitted val-ues of a generalized linear model using persp()both chapters are rather general and not ldquoregression-onlyrdquo
Several real world data sets are used throughoutthe book All S code necessary to reproduce numer-ical and graphical results is shown and explained indetail Package car which contains all data sets usedin the book and functions for ANOVA and regressiondiagnostics is available from CRAN and the bookrsquoshomepage
The style of presentation is rather informal andhands-on software usage is demonstrated using ex-amples without lengthy discussions of theory Overwide parts the text can be directly used in class todemonstrate regression modelling with S to students(after more theoretical lectures on the topic) How-ever the book contains enough ldquoremindersrdquo aboutthe theory such that it is not necessary to switch to amore theoretical text while reading it
The book is self-contained with respect to S andshould easily get new users started The placementof R before S-Plus in the title is not only lexicographicthe author uses R as the primary S engine Dif-ferences of S-Plus (as compared with R) are clearlymarked in framed boxes the graphical user interfaceof S-Plus is not covered at all
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 31
The only drawback of the book follows directlyfrom the target audience of possible S novices Moreexperienced S users will probably be disappointedby the ratio of S introduction to material on regres-sion analysis Only about 130 out of 300 pages dealdirectly with linear models the major part is an in-troduction to S The online appendix at the bookrsquoshomepage (mirrored on CRAN) contains several ex-tension chapters on more advanced topics like boot-
strapping time series nonlinear or robust regres-sion
In summary I highly recommend the book toanyone who wants to learn or teach applied regres-sion analysis with S
Friedrich LeischUniversitaumlt Wien AustriaFriedrichLeischR-projectorg
Changes in R 170by the R Core Team
User-visible changes
bull solve() chol() eigen() and svd() nowuse LAPACK routines unless a new back-compatibility option is turned on The signsand normalization of eigensingular vectorsmay change from earlier versions
bull The lsquomethodsrsquo lsquomodregrsquo lsquomvarsquo lsquonlsrsquo andlsquotsrsquo packages are now attached by default atstartup (in addition to lsquoctestrsquo) The optiondefaultPackages has been added whichcontains the initial list of packages SeeStartup and options for details Note thatFirst() is no longer used by R itself
class() now always (not just when lsquometh-odsrsquo is attached) gives a non-null class andUseMethod() always dispatches on the classthat class() returns This means that methodslike foomatrix and foointeger will be usedFunctions oldClass() and oldClasslt-() getand set the class attribute as R without lsquometh-odsrsquo used to
bull The default random number generators havebeen changed to lsquoMersenne-Twisterrsquo and lsquoIn-versionrsquo A new RNGversion() function allowsyou to restore the generators of an earlier R ver-sion if reproducibility is required
bull Namespaces can now be defined for packagesother than lsquobasersquo see lsquoWriting R ExtensionsrsquoThis hides some internal objects and changesthe search path from objects in a namespaceAll the base packages (except methods andtcltk) have namespaces as well as the rec-ommended packages lsquoKernSmoothrsquo lsquoMASSrsquolsquobootrsquo lsquoclassrsquo lsquonnetrsquo lsquorpartrsquo and lsquospatialrsquo
bull Formulae are not longer automatically simpli-fied when terms() is called so the formulae inresults may still be in the original form ratherthan the equivalent simplified form (which
may have reordered the terms) the results arenow much closer to those of S
bull The tables for plotmath Hershey and Japanesehave been moved from the help pages(example(plotmath) etc) to demo(plotmath)etc
bull Errors and warnings are sent to stderr not std-out on command-line versions of R (Unix andWindows)
bull The R_X11 module is no longer loaded until itis needed so do test that x11() works in a newUnix-alike R installation
New features
bull if() and while() give a warning if called witha vector condition
bull Installed packages under Unix without com-piled code are no longer stamped with the plat-form and can be copied to other Unix-alikeplatforms (but not to other OSes because ofpotential problems with line endings and OS-specific help files)
bull The internal random number generators willnow never return values of 0 or 1 for runif()This might affect simulation output in ex-tremely rare cases Note that this is not guar-anteed for user-supplied random-number gen-erators nor when the standalone Rmath libraryis used
bull When assigning names to a vector a value thatis too short is padded by character NAs (Wish-list part of PR2358)
bull It is now recommended to use the rsquoSystemRe-quirementsrsquo field in the DESCRIPTION file forspecifying dependencies external to the R sys-tem
bull Output text connections no longer have a line-length limit
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 32
bull On platforms where vsnprintf does not returnthe needed buffer size the output line-lengthlimit for fifo() gzfile() and bzfile() hasbeen raised from 10k to 100k chars
bull The Math group generic does not check thenumber of arguments supplied before dis-patch it used to if the default method had oneargument but not if it had two This allowstruncPOSIXt() to be called via the groupgeneric trunc()
bull Logical matrix replacement indexing of dataframes is now implemented (interpreted as ifthe lhs was a matrix)
bull Recursive indexing of lists is allowed sox[[c(42)]] is shorthand for x[[4]][[2]] etc(Wishlist PR1588)
bull Most of the time series functions now check ex-plicitly for a numeric time series rather thanfail at a later stage
bull The postscript output makes use of relativemoves and so is somewhat more compact
bull and crossprod() for complex argumentsmake use of BLAS routines and so may bemuch faster on some platforms
bull arima() has coef() logLik() (and hence AIC)and vcov() methods
bull New function asdifftime() for time-intervaldata
bull basename() and dirname() are now vector-ized
bull biplotdefault() mva allows lsquoxlabrsquo andlsquoylabrsquo parameters to be set (without partiallymatching to lsquoxlabsrsquo and lsquoylabsrsquo) (Thanks toUwe Ligges)
bull New function captureoutput() to sendprinted output from an expression to a connec-tion or a text string
bull ccf() (pckage ts) now coerces its x and y argu-ments to class ts
bull chol() and chol2inv() now use LAPACK rou-tines by default
bull asdist() is now idempotent ie works fordist objects
bull Generic function confint() and lsquolmrsquo method(formerly in package MASS which has lsquoglmrsquoand lsquonlsrsquo methods)
bull New function constrOptim() for optimisationunder linear inequality constraints
bull Add lsquodifftimersquo subscript method and meth-ods for the group generics (Thereby fixingPR2345)
bull downloadfile() can now use HTTP proxieswhich require lsquobasicrsquo usernamepassword au-thentication
bull dump() has a new argument lsquoenvirrsquo The searchfor named objects now starts by default in theenvironment from which dump() is called
bull The editmatrix() and editdataframe()editors can now handle logical data
bull New argument lsquolocalrsquo for example() (sug-gested by Andy Liaw)
bull New function filesymlink() to create sym-bolic file links where supported by the OS
bull New generic function flush() with a methodto flush connections
bull New function force() to force evaluation of aformal argument
bull New functions getFromNamespace()fixInNamespace() and getS3method() to fa-cilitate developing code in packages withnamespaces
bull glm() now accepts lsquoetastartrsquo and lsquomustartrsquo asalternative ways to express starting values
bull New function gzcon() which wraps a connec-tion and provides (de)compression compatiblewith gzip
load() now uses gzcon() so can read com-pressed saves from suitable connections
bull helpsearch() can now reliably match indi-vidual aliases and keywords provided that allpackages searched were installed using R 170or newer
bull histdefault() now returns the nominalbreak points not those adjusted for numericaltolerances
To guard against unthinking use lsquoin-cludelowestrsquo in histdefault() is now ig-nored with a warning unless lsquobreaksrsquo is a vec-tor (It either generated an error or had no ef-fect depending how prettification of the rangeoperated)
bull New generic functions influence()hatvalues() and dfbeta() with lm and glmmethods the previously normal functionsrstudent() rstandard() cooksdistance()and dfbetas() became generic These havechanged behavior for glm objects ndash all originat-ing from John Foxrsquo car package
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 33
bull interactionplot() has several new argu-ments and the legend is not clipped anymoreby default It internally uses axis(1) insteadof mtext() This also addresses bugs PR820PR1305 PR1899
bull New isoreg() function and class for isotonicregression (lsquomodregrsquo package)
bull Lachol() and Lachol2inv() now give inter-pretable error messages rather than LAPACKerror codes
bull legend() has a new lsquoplotrsquo argument Settingit lsquoFALSErsquo gives size information without plot-ting (suggested by ULigges)
bull library() was changed so that when themethods package is attached it no longer com-plains about formal generic functions not spe-cific to the library
bull listfiles()dir() have a new argument lsquore-cursiversquo
bull lminfluence() has a new lsquodocoefrsquo argumentallowing not to compute casewise changedcoefficients This makes plotlm() muchquicker for large data sets
bull load() now returns invisibly a character vec-tor of the names of the objects which were re-stored
bull New convenience function loadURL() to allowloading data files from URLs (requested byFrank Harrell)
bull New function mapply() a multivariatelapply()
bull New function md5sum() in package tools to cal-culate MD5 checksums on files (eg on parts ofthe R installation)
bull medpolish() package eda now has an lsquonarmrsquoargument (PR2298)
bull methods() now looks for registered methodsin namespaces and knows about many objectsthat look like methods but are not
bull mosaicplot() has a new default for lsquomainrsquoand supports the lsquolasrsquo argument (contributedby Uwe Ligges and Wolfram Fischer)
bull An attempt to open() an already open connec-tion will be detected and ignored with a warn-ing This avoids improperly closing some typesof connections if they are opened repeatedly
bull optim(method = SANN) can now cover com-binatorial optimization by supplying a movefunction as the lsquogrrsquo argument (contributed byAdrian Trapletti)
bull PDF files produced by pdf() have more exten-sive information fields including the version ofR that produced them
bull On Unix(-alike) systems the default PDFviewer is now determined during configura-tion and available as the rsquopdfviewerrsquo option
bull pie() has always accepted graphical parsbut only passed them on to title() Nowpie( cex=15) works
bull plotdendrogram (lsquomvarsquo package) now drawsleaf labels if present by default
bull New plotdesign() function as in S
bull The postscript() and PDF() drivers now al-low the title to be set
bull New function poweranovatest() con-tributed by Claus Ekstroslashm
bull powerttest() now behaves correctly fornegative delta in the two-tailed case
bull powerttest() and powerproptest() nowhave a lsquostrictrsquo argument that includes rejectionsin the wrong tail in the power calculation(Based in part on code suggested by UlrichHalekoh)
bull prcomp() is now fast for nm inputs with m gtgtn
bull princomp() no longer allows the use of morevariables than units use prcomp() instead
bull princompformula() now has principal argu-ment lsquoformularsquo so update() can be used
bull Printing an object with attributes now dis-patches on the class(es) of the attributes Seeprintdefault for the fine print (PR2506)
bull printmatrix() and prmatrix() are now sep-arate functions prmatrix() is the old S-compatible function and printmatrix() isa proper print method currently identical toprintdefault() prmatrix() and the oldprintmatrix() did not print attributes of amatrix but the new printmatrix() does
bull printsummarylm() and printsummaryglm()now default to symboliccor = FALSEbut symboliccor can be passed tothe print methods from the summarymethods printsummarylm() andprintsummaryglm() print correlations to2 decimal places and the symbolic printoutavoids abbreviating labels
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 34
bull If a prompt() method is called with rsquofilenamersquoas rsquoNArsquo a list-style representation of the doc-umentation shell generated is returned Newfunction promptData() for documenting ob-jects as data sets
bull qqnorm() and qqline() have an optional log-ical argument lsquodataxrsquo to transpose the plot (S-PLUS compatibility)
bull qr() now has the option to use LAPACK rou-tines and the results can be used by the helperroutines qrcoef() qrqy() and qrqty()The LAPACK-using versions may be muchfaster for large matrices (using an optimizedBLAS) but are less flexible
bull QR objects now have class qr andsolveqr() is now just the method for solve()for the class
bull New function r2dtable() for generating ran-dom samples of two-way tables with givenmarginals using Patefieldrsquos algorithm
bull rchisq() now has a non-centrality parameterlsquoncprsquo and therersquos a C API for rnchisq()
bull New generic function reorder() with a den-drogram method new orderdendrogram()and heatmap()
bull require() has a new argument namedcharacteronly to make it align with library
bull New functions rmultinom() and dmultinom()the first one with a C API
bull New function runmed() for fast runnning me-dians (lsquomodregrsquo package)
bull New function sliceindex() for identifyingindexes with respect to slices of an array
bull solvedefault(a) now gives the dimnamesone would expect
bull stepfun() has a new lsquorightrsquo argument forright-continuous step function construction
bull str() now shows ordered factors differentfrom unordered ones It also differentiatesNA and ascharacter(NA) also for factorlevels
bull symnum() has a new logical argumentlsquoabbrcolnamesrsquo
bull summary(ltlogicalgt) now mentions NArsquos assuggested by Goumlran Brostroumlm
bull summaryRprof() now prints times with a pre-cision appropriate to the sampling intervalrather than always to 2dp
bull New function Sysgetpid() to get the processID of the R session
bull table() now allows exclude= with factor argu-ments (requested by Michael Friendly)
bull The tempfile() function now takes an op-tional second argument giving the directoryname
bull The ordering of terms for
termsformula(keeporder=FALSE)
is now defined on the help page and used con-sistently so that repeated calls will not alter theordering (which is why deleteresponse()was failing see the bug fixes) The formula isnot simplified unless the new argument lsquosim-plifyrsquo is true
bull added [ method for terms objects
bull New argument lsquosilentrsquo to try()
bull ts() now allows arbitrary values for y instartend = c(x y) it always allowed y lt1 but objected to y gt frequency
bull uniquedefault() now works for POSIXct ob-jects and hence so does factor()
bull Package tcltk now allows return values fromthe R side to the Tcl side in callbacks and theR_eval command If the return value from theR function or expression is of class tclObj thenit will be returned to Tcl
bull A new HIGHLY EXPERIMENTAL graphicaluser interface using the tcltk package is pro-vided Currently little more than a proof ofconcept It can be started by calling R -g Tk(this may change in later versions) or by eval-uating tkStartGUI() Only Unix-like systemsfor now It is not too stable at this point in par-ticular signal handling is not working prop-erly
bull Changes to support name spaces
ndash Placing base in a name spacecan no longer be disabled bydefining the environment variableR_NO_BASE_NAMESPACE
ndash New function topenv() to determine thenearest top level environment (usuallyGlobalEnv or a name space environ-ment)
ndash Added name space support for packagesthat do not use methods
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 35
bull Formal classes and methods can be lsquosealedrsquoby using the corresponding argument tosetClass or setMethod New functionsisSealedClass() and isSealedMethod() testsealing
bull packages can now be loaded with version num-bers This allows for multiple versions of filesto be installed (and potentially loaded) Someserious testing will be going on but it shouldhave no effect unless specifically asked for
Installation changes
bull TITLE files in packages are no longer used theTitle field in the DESCRIPTION file being pre-ferred TITLE files will be ignored in both in-stalled packages and source packages
bull When searching for a Fortran 77 compiler con-figure by default now also looks for Fujitsursquos frtand Compaqrsquos fort but no longer for cf77 andcft77
bull Configure checks that mixed CFortran codecan be run before checking compatibility onints and doubles the latter test was sometimesfailing because the Fortran libraries were notfound
bull PCRE and bzip2 are built from versions in the Rsources if the appropriate library is not found
bull New configure option lsquo--with-lapackrsquo to al-low high-performance LAPACK libraries to beused a generic LAPACK library will be used iffound This option is not the default
bull New configure options lsquo--with-libpngrsquolsquo--with-jpeglibrsquo lsquo--with-zlibrsquo lsquo--with-bzlibrsquoand lsquo--with-pcrersquo principally to allow theselibraries to be avoided if they are unsuitable
bull If the precious variable R_BROWSER is set atconfigure time it overrides the automatic selec-tion of the default browser It should be set tothe full path unless the browser appears at dif-ferent locations on different client machines
bull Perl requirements are down again to 5004 ornewer
bull Autoconf 257 or later is required to build theconfigure script
bull Configure provides a more comprehensivesummary of its results
bull Index generation now happens when installingsource packages using R code in package toolsAn existing rsquoINDEXrsquo file is used as is other-wise it is automatically generated from the
name and title entries in the Rd files Datademo and vignette indices are computed fromall available files of the respective kind andthe corresponding index information (in theRd files the rsquodemo00Indexrsquo file and theVignetteIndexEntry entries respectively)These index files as well as the package Rdcontents data base are serialized as R ob-jects in the rsquoMetarsquo subdirectory of the top-level package directory allowing for faster andmore reliable index-based computations (egin helpsearch())
bull The Rd contents data base is now computedwhen installing source packages using R codein package tools The information is repre-sented as a data frame without collapsing thealiases and keywords and serialized as an Robject (The rsquoCONTENTSrsquo file in Debian Con-trol Format is still written as it is used by theHTML search engine)
bull A NAMESPACE file in root directory of asource package is copied to the root of the pack-age installation directory Attempting to in-stall a package with a NAMESPACE file usinglsquo--saversquo signals an error this is a temporarymeasure
Deprecated amp defunct
bull The assignment operator lsquo_rsquo will be removed inthe next release and users are now warned onevery usage you may even see multiple warn-ings for each usage
If environment variable R_NO_UNDERLINEis set to anything of positive length then use oflsquo_rsquo becomes a syntax error
bull machine() Machine() and Platform() are de-funct
bull restart() is defunct Use try() as has longbeen recommended
bull The deprecated arguments lsquopkgrsquo and lsquolibrsquo ofsystemfile() have been removed
bull printNoClass() methods is deprecated (andmoved to base since it was a copy of a basefunction)
bull Primitives dataClass() and objWithClass()have been replaced by class() and classlt-()they were internal support functions for use bypackage methods
bull The use of SIGUSR2 to quit a running R processunder Unix is deprecated the signal may needto be reclaimed for other purposes
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 36
Utilities
bull R CMD check more compactly displays thetests of DESCRIPTION meta-information Itnow reports demos and vignettes withoutavailable index information Unless installa-tion tests are skipped checking is aborted ifthe package dependencies cannot be resolvedat run time Rd files are now also explicitlychecked for empty name and title entriesThe examples are always run with T and F re-defined to give an error if used instead of TRUEand FALSE
bull The Perl code to build help now removes anexisting example file if there are no examplesin the current help file
bull R CMD Rdindex is now deprecated in favor offunction Rdindex() in package tools
bull Sweave() now encloses the Sinput and Soutputenvironments of each chunk in an Schunk envi-ronment This allows to fix some vertical spac-ing problems when using the latex class slides
C-level facilities
bull A full double-precision LAPACK shared li-brary is made available as -lRlapack To usethis include $(LAPACK_LIBS) $(BLAS_LIBS) inPKG_LIBS
bull Header file R_extLapackh added Cdeclarations of BLAS routines moved toR_extBLASh and included in R_extApplichand R_extLinpackh for backward compatibil-ity
bull R will automatically call initialization andunload routines if present in sharedlibrariesDLLs during dynload() anddynunload() calls The routines are namedR_init_ltdll namegt and R_unload_ltdll namegtrespectively See the Writing R ExtensionsManual for more information
bull Routines exported directly from the R exe-cutable for use with C() Call() Fortran()and External() are now accessed via theregistration mechanism (optionally) used bypackages The ROUTINES file (in srcappl)and associated scripts to generate FFTabh andFFDeclh are no longer used
bull Entry point Rf_append is no longer in the in-stalled headers (but is still available) It is ap-parently unused
bull Many conflicts between other headersand Rrsquos can be avoided by definingSTRICT_R_HEADERS andor R_NO_REMAPndash see lsquoWriting R Extensionsrsquo for details
bull New entry point R_GetX11Image and formerlyundocumented ptr_R_GetX11Image are in newheader R_extGetX11Image These are used bypackage tkrplot
Changes on CRANby Kurt Hornik and Friedrich Leisch
New contributed packages
Davies useful functions for the Davies quantilefunction and the Generalized Lambda distribu-tion By Robin Hankin
GRASS Interface between GRASS 50 geographicalinformation system and R based on starting Rfrom within the GRASS environment using val-ues of environment variables set in the GISRCfile Interface examples should be run outsideGRASS others may be run within Wrapperand helper functions are provided for a rangeof R functions to match the interface metadatastructures Interface functions by Roger Bi-vand wrapper and helper functions modifiedfrom various originals by interface author
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 03 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
RSvgDevice A graphics device for R that uses thenew w3org xml standard for Scalable VectorGraphics By T Jake Luciani
SenSrivastava Collection of datasets from Sen amp Sri-vastava Regression Analysis Theory Methodsand Applications Springer Sources for indi-vidual data files are more fully documented in
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 37
the book By Kjetil Halvorsen
abind Combine multi-dimensional arrays This isa generalization of cbind and rbind Takesa sequence of vectors matrices or arraysand produces a single array of the same orhigher dimension By Tony Plate and RichardHeiberger
ade4 Multivariate data analysis and graphical dis-play By Jean Thioulouse Anne-Beatrice Du-four and Daniel Chessel
amap Hierarchical Clustering and Principal Compo-nent Analysis (With general and robusts meth-ods) By Antoine Lucas
anm The package contains an analog model for sta-tisticalempirical downscaling By AlexandraImbert amp Rasmus E Benestad
climpact The package contains R functions for re-trieving data making climate analysis anddownscaling of monthly mean and daily meanglobal climate scenarios By Rasmus E Benes-tad
dispmod Functions for modelling dispersion inGLM By Luca Scrucca
effects Graphical and tabular effect displays eg ofinteractions for linear and generalised linearmodels By John Fox
gbm This package implements extensions to Fre-und and Schapirersquos AdaBoost algorithm andJ Friedmanrsquos gradient boosting machine In-cludes regression methods for least squaresabsolute loss logistic Poisson Cox propor-tional hazards partial likelihood and AdaBoostexponential loss By Greg Ridgeway
genetics Classes and methods for handling geneticdata Includes classes to represent geno-types and haplotypes at single markers up tomultiple markers on multiple chromosomesFunction include allele frequencies flagginghomoheterozygotes flagging carriers of cer-tain alleles computing disequilibrium testingHardy-Weinberg equilibrium By GregoryWarnes and Friedrich Leisch
glmmML A Maximum Likelihood approach tomixed models By Goumlran Brostroumlm
gpclib General polygon clipping routines for Rbased on Alan Murtarsquos C library By R interfaceby Roger D Peng GPC library by Alan Murta
grasper R-GRASP uses generalized regressionsanalyses to automate the production of spatialpredictions By A Lehmann JR LeathwickJMcC Overton ported to R by F Fivaz
gstat variogram modelling simple ordinary anduniversal point or block (co)kriging and se-quential Gaussian or indicator (co)simulationvariogram and map plotting utility functionsBy Edzer J Pebesma and others
hierpart Variance partition of a multivariate dataset By Chris Walsh and Ralph MacNally
hwde Fits models for genotypic disequilibria as de-scribed in Huttley and Wilson (2000) Weir(1996) and Weir and Wilson (1986) Contrastterms are available that account for first orderinteractions between loci By JH Maindonald
ismev Functions to support the computations car-ried out in lsquoAn Introduction to Statistical Mod-eling of Extreme Valuesrsquo by Stuart Coles Thefunctions may be divided into the follow-ing groups maximaminima order statisticspeaks over thresholds and point processesOriginal S functions by Stuart Coles R port andR documentation files by Alec Stephenson
locfit Local Regression Likelihood and density esti-mation By Catherine Loader
mimR An R interface to MIM for graphical mod-elling in R By Soslashren Hoslashjsgaard
mix Estimationmultiple imputation programs formixed categorical and continuous data Origi-nal by Joseph L Schafer R port by Brian Ripley
multidim multidimensional descritptive statisticsfactorial methods and classification Originalby Andreacute Carlier amp Alain Croquette R port byMathieu Ros amp Antoine Lucas
normalp A collection of utilities refered to normalof order p distributions (General Error Distri-butions) By Angelo M Mineo
plspcr Multivariate regression by PLS and PCR ByRon Wehrens
polspline Routines for the polynomial spline fit-ting routines hazard regression hazard estima-tion with flexible tails logspline lspec poly-class and polymars by C Kooperberg and co-authors By Charles Kooperberg
rimage This package provides functions for imageprocessing including sobel filter rank filtersfft histogram equalization and reading JPEGfile This package requires fftw httpwwwfftworg and libjpeg httpwwwijgorggtBy Tomomi Takashina
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 38
session Utility functions for interacting with R pro-cesses from external programs This packageincludes functions to save and restore sessioninformation (including loaded packages andattached data objects) as well as functions toevaluate strings containing R commands andreturn the printed results or an execution tran-script By Gregory R Warnes
snow Support for simple parallel computing in RBy Luke Tierney A J Rossini and Na Li
statmod Miscellaneous biostatistical modellingfunctions By Gordon Smyth
survey Summary statistics generalised linear mod-els and general maximum likelihood estima-tion for stratified cluster-sampled unequallyweighted survey samples By Thomas Lumley
vcd Functions and data sets based on the bookldquoVisualizing Categorical Datardquo by MichaelFriendly By David Meyer Achim ZeileisAlexandros Karatzoglou and Kurt Hornik
New Omegahat packages
REventLoop An abstract event loop mechanism thatis toolkit independent and can be used to to re-place the R event loop This allows one to usethe Gtk or TclTkrsquos event loop which gives bet-ter response and coverage of all event sources(eg idle events timers etc) By Duncan Tem-ple Lang
RGdkPixbuf S language functions to access the fa-cilities in GdkPixbuf for manipulating imagesThis is used for loading icons to be used in wid-gets such as buttons HTML renderers etc ByDuncan Temple Lang
RGtkExtra A collection of S functions that providean interface to the widgets in the gtk+extra li-brary such as the GtkSheet data-grid display
icon list file list and directory tree By DuncanTemple Lang
RGtkGlade S language bindings providing an inter-face to Glade the interactive Gnome GUI cre-ator This allows one to instantiate GUIs builtwith the interactive Glade tool directly withinS The callbacks can be specified as S functionswithin the GUI builder By Duncan TempleLang
RGtkHTML A collection of S functions that pro-vide an interface to creating and controlling anHTML widget which can be used to displayHTML documents from files or content gener-ated dynamically in S This also supports em-bedded objects created and managed within Scode include graphics devices etc By DuncanTemple Lang
RGtk Facilities in the S language for programminggraphical interfaces using Gtk the Gnome GUItoolkit By Duncan Temple Lang
SWinRegistry Provides access from within R to readand write the Windows registry By DuncanTemple Lang
SWinTypeLibs This provides ways to extract typeinformation from type libraries andor DCOMobjects that describes the methods propertiesetc of an interface This can be used to discoveravailable facilities in a COM object or automat-ically generate S and C code to interface to suchobjects By Duncan Temple Lang
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
Friedrich LeischTechnische Universitaumlt Wien AustriaFriedrichLeischR-projectorg
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 39
Crosswordby Barry Rowlingson
Following a conversation with 1 down at DSC 2003post-conference dinner I have constructed a cross-word for R-news Most of the clues are based onmatters relating to R or computing in general Theclues are lsquocrypticrsquo in that they rely on word-play but
most will contain a definition somewhere A prizeof EUR50 is on offer See httpwwwmathslancsacuk~rowlingsCrossword for details
Barry RowlingsonLancaster University UKBRowlingsonlancasteracuk
1 2 3 4 5 6 7 8
9 10
11 12
13 14
15
16 17 18
19 20 21 22
23
24 25
26 27
Across9 Possessing a file or boat we hear (9)10 20 down Developer code is binary peril (56)11 Maturer code has no need for this (7)12 And a volcano implodes moderately slowly (7)13 Partly not an hyperbolic function (4)14 Give 500 euro to R Foundation then stir your cafeacute Breton(10)16 Catches crooked parents (7)17 23 down Developer code bloated Gauss (75)19 Fast header code is re-entrant (6-4)22 4 down Developer makes toilet compartments (48)24 Former docks ship files (7)25 Analysis of a skinned banana and eggs (25)26 Rearrange array or a hairstyle (5)27 Atomic number 13 is a component (9)
Down1 Scots king goes with lady president (69)2 Assemble rain tent for private space (8)3 Output sounds sound (5)4 See 22ac5 Climb tree to get new software (6)6 Naughty number could be 8 down (135)7 Failed or applied (33)8 International body member is ordered back before baker butcanrsquot be shown (15)15 I hear you consort with a skeleton having rows (9)17 Left dead encrypted file in zip archive (8)18 Guards riot loot souk (8)20 See 10ac21 A computer sets my puzzle (6)23 See 17ac
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
Vol 31 June 2003 40
Recent EventsDSC 2003
Inspired by the success of its predecessors in 1999and 2001 the Austrian Association for StatisticalComputing and the R Foundation for StatisticalComputing jointly organised the third workshop ondistributed statistical computing which took placeat the Technische Universitaumlt Vienna from March20 to March 22 2003 More than 150 participantsfrom all over the world enjoyed this most interest-ing event The talks presented at DSC 2003 cov-ered a wide range of topics from bioinformatics vi-sualisation resampling and combine methods spa-tial statistics user-interfaces and office integrationgraphs and graphical models database connectivityand applications Compared to DSC 2001 a greaterfraction of talks addressed problems not directly re-lated to R however most speakers mentioned R astheir favourite computing engine An online pro-ceedings volume is currently under preparation
Two events took place in Vienna just prior to DSC
2003 Future directions of R developments were dis-cussed at a two-day open R-core meeting Tutorialson the Bioconductor project R programming and Rgraphics attended a lot of interest the day before theconference opening
While the talks were given in the beautiful roomsof the main building of the Technische Universitaumlt atKarlsplatz in the city centre the fruitful discussionscontinued in Viennarsquos beer pubs and at the confer-ence dinner in one of the famous Heurigen restau-rants
Finally one participant summarised this eventmost to the point in an email sent to me right afterDSC 2003 rdquoAt most statistics conferences over 90of the stuff is not interesting to me at this one over90 of it was interestingrdquo
Torsten HothornFriedrich-Alexander-Universitaumlt Erlangen-NuumlrnbergGermanyTorstenHothornrzmailuni-erlangende
Editor-in-ChiefFriedrich LeischInstitut fuumlr Statistik und WahrscheinlichkeitstheorieTechnische Universitaumlt WienWiedner Hauptstraszlige 8-101071A-1040 Wien Austria
Editorial BoardDouglas Bates and Thomas Lumley
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631