Rnews_2007-3

Embed Size (px)

Citation preview

  • 8/22/2019 Rnews_2007-3

    1/62

    NewsThe Newsletter of the R Project Volume 7/3, December 2007

    Editorialby Torsten Hothorn

    Shortly before the end of 2007 its a great pleasure forme to welcome you to the third and Christmas issueof R News. Also, it is the last issue for me as edi-torial board member and before John Fox takes overas editor-in-chief, I would like to thank Doug Bates,Paul Murrell, John Fox and Vince Carey whom I hadthe pleasure to work with during the last three years.

    It is amazing to see how many new packageshave been submitted to CRAN since October whenKurt Hornik previously provided us with the latest

    CRAN news. Kurts new list starts at page 57. Mostof us have already installed the first patch release inthe 2.6.0 series. The most important facts about R2.6.1 are given on page 56.

    The contributed papers in the present issue maybe divided into two groups. The first group fo-cuses on applications and the second group reportson tools that may help to interact with R. SanfordWeisberg and Hadley Wickham give us some hintswhen our brain fails to remember the name of someimportant R function. Patrick Mair and ReinholdHatzinger started a CRAN Psychometrics Task Viewand give us a snapshot of current developments.Robin Hankin deals with very large numbers in Rusing his Brobdingnag package. Two papers focuson graphical user interfaces. From a high-level point

    of view, John Fox shows how the functionality of hisR Commander can be extended by plug-in packages.John Verzani gives an introduction to low-level GUIprogramming using the gWidgets package.

    Applications presented here include a study onthe performance of financial advices given in theMad Money television show on CNBC, as investi-gated by Bill Alpert. Hee-Seok Oh and DonghohKim present a package for the analysis of scatteredspherical data, such as certain environmental condi-tions measured over some area. Sebastin Luque fol-lows aquatic animals into the depth of the sea and

    analyzes their diving behavior. Three packages con-centrate on bringing modern statistical methodologyto our computers: Parametric and semi-parametricBayesian inference is implemented in the DPpackageby Alejandro Jara, Guido Schwarzer reports on themeta package for meta-analysis and, finally, a newversion of the well-known multtest package is de-scribed by Sandra L. Taylor and her colleagues.

    The editorial board wants to thank all authorsand referees who worked with us in 2007 and wishesall of you a Merry Christmas and a Happy New Year2008!

    Torsten HothornLudwigMaximiliansUniversitt Mnchen, [email protected]

    Contents of this issue:

    Editorial . . . . . . . . . . . . . . . . . . . . . . 1SpherWave: An R Package for Analyzing Scat-

    tered Spherical Data by Spherical Wavelets . 2Diving Behaviour Analysis in R . . . . . . . . . 8Very Large Numbers in R: Introducing Pack-

    age Brobdingnag . . . . . . . . . . . . . . . . 15Applied Bayesian Non- and Semi-parametric

    Inference using DPpackage . . . . . . . . . . 17An Introduction to gWidgets . . . . . . . . . . 26

    Financial Journalism with R . . . . . . . . . . . 34

    Need A Hint? . . . . . . . . . . . . . . . . . . . 36

    Psychometrics Task View . . . . . . . . . . . . . 38

    meta: An R Package for Meta-Analysis . . . . . 40

    Extending the R Commander by Plug-InP a c k a g e s . . . . . . . . . . . . . . . . . . . . . 46

    Improvements to the Multiple Testing Package

    multtest . . . . . . . . . . . . . . . . . . . . . 52Changes in R 2.6.1 . . . . . . . . . . . . . . . . . 56

    Changes on CRAN . . . . . . . . . . . . . . . . 57

    mailto:[email protected]:[email protected]
  • 8/22/2019 Rnews_2007-3

    2/62

    Vol. 7/3, December 2007 2

    SpherWave: An R Package for AnalyzingScattered Spherical Data by Spherical

    Waveletsby Hee-Seok Oh and Donghoh Kim

    Introduction

    Given scattered surface air temperatures observedon the globe, we would like to estimate the temper-ature field for every location on the globe. Since thetemperature data have inherent multiscale character-istics, spherical wavelets with localization properties

    are particularly effective in representing multiscalestructures. Spherical wavelets have been introducedin Narcowich and Ward (1996) and Li (1999). A suc-cessful statistical application has been demonstratedin Oh and Li (2004).

    SpherWave is an R package implementing thespherical wavelets (SWs) introduced by Li (1999) andthe SW-based spatially adaptive methods proposedby Oh and Li (2004). This article provides a generaldescription of SWs and their statistical applications,and it explains the use of the SpherWave packagethrough an example using real data.

    Before explaining the algorithm in detail, wefirst consider the average surface air tempera-tures (in degrees Celsius) during the period fromDecember 1967 to February 1968 observed at939 weather stations, as illustrated in Figure 1.

    Figure 1: Average surface air temperatures observedat 939 weather stations during the years 1967-1968.

    In the SpherWave package, the data are obtainedby

    > library("SpherWave")

    > ### Temperature data from year 1961 to 1990

    > ### list of year, grid, observation

    > data("temperature")

    > temp67 latlon sw.plot(z=temp67, latlon=latlon, type="obs",

    + xlab="", ylab="")

    Similarly, various signals such as meteorologicalor geophysical signal in nature can be measured atscattered and unevenly distributed locations. How-ever, inferring the substantial effect of such signalsat an arbitrary location on the globe is a crucial task.The first objective of using SWs is to estimate the sig-nal at an arbitrary location on the globe by extrap-

    olating the scattered observations. An example isthe representation in Figure 2, which is obtained byextrapolating the observations in Figure 1. This re-sult can be obtained by simply executing the functionsbf(). The details of its arguments will be presentedlater.

    > netlab eta out.pls sw.plot(out.pls, type="field", xlab="Longitude",+ ylab="Latitude")

    Figure 2: An extrapolation for the observations inFigure 1.

    Note that the representation in Figure 2 has inher-ent multiscale characteristics, which originate fromthe observations in Figure 1. For example, observethe global cold patterns near the north pole with lo-cal anomalies of the extreme cold in the central Cana-dian shield. Thus, classical methods such as spheri-cal harmonics or smoothing splines are not very effi-cient in representing temperature data since they do

    not capture local properties. It is important to de-tect and explain local activities and variabilities aswell as global trends. The second objective of usingSWs is to decompose the signal properly accordingto spatial scales so as to capture the various activities

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    3/62

    Vol. 7/3, December 2007 3

    of fields. Finally, SWs can be employed in develop-ing a procedure to denoise the observations that arecorrupted by noise. This article illustrates these pro-cedures through an analysis of temperature data. Insummary, the aim of this article is to explain how the

    SpherWave package is used in the following:1) estimating the temperature field T(x) for an ar-

    bitrary location x on the globe, given the scat-tered observations yi , i = 1 , . . . , n, from themodel

    yi = T(xi) +i, i = 1 , 2 , . . . , n, (1)

    where xi denote the locations of observationson the globe andi are the measurement errors;

    2) decomposing the signal by the multiresolutionanalysis; and

    3) obtaining a SW estimator using a thresholdingapproach.

    As will be described in detail later, the multiresolu-tion analysis and SW estimators of the temperaturefield can be derived from the procedure termed mul-tiscale spherical basis function (SBF) representation.

    Theory

    In this section, we summarize the SWs proposedby Li (1999) and its statistical applications proposedby Oh and Li (2004) for an understanding of themethodology and promoting the usage of the Spher-Wave package.

    Narcowich and Ward (1996) proposed a methodto construct SWs for scattered data on a sphere. Theyproposed an SBF representation, which is a linearcombination of localized SBFs centered at the loca-tions of the observations. However, the Narcowich-Ward method suffers from a serious problem: theSWs have a constant spatial scale regardless of theintended multiscale decomposition. Li (1999) in-troduced a new multiscale SW method that over-comes the single-scale problem of the Narcowich-

    Ward method and truly represents spherical fieldswith multiscale structure.

    When a network ofn observation stationsN1 :={xi}ni=1 is given, we can construct nested networksN1 N2 NL for some L. We re-indexthe subscript of the location xi so that xli belongs to

    Nl \ Nl+1 = {xli}Mli=1 (l = 1, , L; NL+1 := ),and use the convention that the scale moves from thefinest to the smoothest as the resolution level indexl increases. The general principle of the multiscaleSBF representation proposed by Li (1999) is to em-ploy linear combinations of SBFs with various scale

    parameters to approximate the underlying field T(x)of the model in equation (1). That is, for some L

    T1(x) =L

    l=1

    Ml

    i=1

    lil ((x, xli )), (2)

    where l denotes SBFs with a scale parameter land (x, xi) is the cosine of the angle between twolocation x and xi represented by the spherical co-ordinate system. Thus geodetic distance is used forspherical wavelets, which is desirable for the data on

    the globe. An SBF ((x, xi)) for a given sphericallocation xi is a spherical function of x that peaks atx = xi and decays in magnitude as x moves awayfrom xi. A typical example is the Poisson kernel usedby Narcowich and Ward (1996) and Li (1999).

    Now, let us describe a multiresolution analy-sis procedure that decomposes the SBF representa-tion (2) into global and local components. As will beseen later, the networksNl can be arranged in such amanner that the sparseness of stations inNl increasesas the index l increases, and the bandwidth of canalso be chosen to increase with l to compensate for

    the sparseness of stations inNl . By this construction,the index l becomes a true scale parameter. SupposeTl , l = 1 , . . . , L, belongs to the linear subspace of allSBFs that have scales greater than or equal to l. ThenTl can be decomposed as

    Tl (x) = Tl+1(x) + Dl (x),

    where Tl+1 is the projection ofTl onto the linear sub-space of SBFs on the networksNl+1, and Dl is the or-thogonal complement ofTl . Note that the field Dl canbe interpreted as the field containing the local infor-mation. This local information cannot be explained

    by the field Tl+1 which only contains the global trendextrapolated from the coarser networkNl+1. There-fore, Tl+1 is called the global component of scale l + 1and Dl is called the local component of scale l. Thus,the field T1 in its SBF representation (equation (2))can be successively decomposed as

    T1(x) = TL(x) +L1l=1

    Dl (x). (3)

    In general wavelet terminology, the coefficients ofTLand Dl of the SW representation in equation (3) can

    be considered as the smooth coefficients and detailedcoefficients of scale l, respectively.

    The extrapolated field may not be a stable esti-mator of the underlying field T because of the noisein the data. To overcome this problem, Oh and Li(2004) propose the use of thresholding approach pi-oneered by Donoho and Johnstone (1994). Typicalthresholding types are hard and soft thresholding.By hard thresholding, small SW coefficients, consid-ered as originating from the zero-mean noise, are setto zero while the other coefficients, considered asoriginating from the signal, are left unchanged. In

    soft thresholding, not only are the small coefficientsset to zero but the large coefficients are also shrunktoward zero, based on the assumption that they arecorrupted by additive noise. A reconstruction fromthese coefficients yields the SW estimators.

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    4/62

    Vol. 7/3, December 2007 4

    Network design and bandwidth se-

    lection

    As mentioned previously, a judiciously designed net-work

    Nl and properly chosen bandwidths for the

    SBFs are required for a stable multiscale SBF repre-sentation.

    In the SpherWave package, we design a networkfor the observations in Figure 1 as

    > netlab sw.plot(z=netlab, latlon=latlon, type="network",

    + xlab="", ylab="", cex=0.6)

    We then obtain the network in Figure 3, which con-sists of 6 subnetworks.

    > table(netlab)netlab

    1 2 3 4 5 6

    686 104 72 44 25 8

    Note that the number of stations at each level de-creases as the resolution level increases. The mostdetailed subnetwork 1 consists of 686 stations whilethe coarsest subnetwork 6 consists of 8 stations.

    Figure 3: Network Design

    The network design in the SpherWave packagedepends only on the location of the data and the tem-plate grid, which is predetermined without consid-ering geophysical information. To be specific, given

    a template grid and a radius for the spherical cap,we can design a network satisfying two conditionsfor stations: 1) choose the stations closest to the tem-plate grid so that the stations could be distributed asuniformly as possible over the sphere, and 2) selectstations between consecutive resolution levels so thatthe resulting stations between two levels are not tooclose for the minimum radius of the spherical cap.This scheme ensures that the density ofNl decreasesas the resolution level index l increases. The func-tion network.design() is performed by the follow-ing parameters: latlon denotes the matrix of grid

    points (latitude, longitude) of the observation loca-tions. The SpherWave package uses the followingconvention. Latitude is the angular distance in de-grees of a point north or south of the equator andNorth and South are represented by "+" and "" signs,

    respectively. Longitude is the angular distance in de-grees of a point east or west of the prime (Green-wich) meridian, and East and West are representedby "+" and "" signs, respectively. method has fouroptions for making a template grid "Gottlemann",

    "ModifyGottlemann", "Oh", and "cover". For de-tails of the first three methods, see Oh and Kim(2007). "cover" is the option for utilizing the func-tion cover.design() in the package fields. Onlywhen using the method "cover", provide nlevel,which denotes a vector of the number of observa-tions in each level, starting from the resolution level1. type denotes the type of template grid; it is spec-ified as either "regular" or "reduce". The option"reduce" is designed to overcome the problem of aregular grid, which produces a strong concentrationof points near the poles. The parameter x is the min-imum radius of the spherical cap.

    Since the index l is a scale index in the result-ing multiscale analysis, as l increases, the densityofNl decreases and the bandwidth ofl increases.The bandwidths can be supplied by the user. Alter-natively, the SpherWave package provides its ownfunction for the automatic choosing of the band-widths. For example, the bandwidths for the net-work design using "ModifyGottlemann" can be cho-sen by the following procedure.

    > eta

  • 8/22/2019 Rnews_2007-3

    5/62

    Vol. 7/3, December 2007 5

    method : the method for the calculation of coef-ficients of equation (2), "ls" or "pls"

    approx : approx = TRUE will use the approxi-mation matrix

    grid.size : the size of the grid (latitude, longi-tude) of the extrapolation site

    lambda : smoothing parameter for method ="pls".

    method has two options "ls" and "pls". method ="ls" calculates the coefficients by the least squaresmethod, and method = "pls" uses the penalizedleast squares method. Thus, the smoothing param-eter lambda is required only when using method ="pls". approx = TRUE implies that we obtain the co-efficients using m(< n) selected sites from among the

    n observation sites, while the interpolation method(approx = FALSE) uses all the observation sites. Thefunction sbf() returns an object of class "sbf". SeeOh and Kim (2006) for details. The following codeperforms the approximate multiscale SBF represen-tation by the least squares method, and Figure 4 il-lustrates results.

    > out.ls sw.plot(out.ls, type="field",

    + xlab="Longitude", ylab="Latitude")

    Figure 4: An approximate multiscale SBF representa-

    tion for the observations in Figure 1.

    As can be observed, the result in Figure 4 is differ-ent from that in Figure 2, which is performed by thepenalized least squares interpolation method. Notethat the value of the smoothing parameter lambdaused in Figure 2 is chosen by generalized cross-validation. For the implementation, run the follow-ing procedure.

    > lam gcv for(i in 1:length(lam))

    + gcv lam[gcv == min(gcv)]

    [1] 0.8

    Multiresolution analysis

    Here, we explain how to decompose the multiscaleSBF representation into the global field of scale l + 1,Tl+1(x), and the local field of scale l, Dl (x). Use the

    function swd() for this operation.> out.dpls sw.plot(out.dpls, type="swcoeff", pch=19,

    + cex=1.1)

    > sw.plot(out.dpls, type="decom")

    Figure 5: Plot of SW smooth coefficients and detailedcoefficients at different levels l = 1,2,3,4,5.

    Spherical wavelet estimators

    We now discuss the statistical techniques of smooth-ing based on SWs. The theoretical background isbased on the works ofDonoho and Johnstone (1994)and Oh and Li (2004). The thresholding functionswthresh() for SW estimators is

    > swthresh(swd, policy, by.level, type, nthresh,

    + value=0.1, Q=0.05)

    This function swthresh() thresholds or shrinksdetailed coefficients stored in an swd object, andreturns the thresholded detailed coefficients in amodified swd object. The thresh.info list of an

    swd object has the thresholding information. Theavailable policies are "universal", "sure", "fdr","probability", and "Lorentz". For the first threethresholding policies, see Donoho and Johnstone(1994, 1995) and Abramovich and Benjamini (1996).

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    6/62

    Vol. 7/3, December 2007 6

    Figure 6: Multiresolution analysis of the multiscale SBF representation T1(x) in Figure 2. Note that the fieldT1(x) is decomposed as T1(x) = T6(x) + D1(x) + D2(x) + D3(x) + D4(x) + D5(x).

    Figure 7: Thresholding result obtained by using the FDR policy

    Q specifies the false discovery rate (FDR) of the FDRpolicy. policy = "probability" performs thresh-olding using the user supplied threshold representedby a quantile value. In this case, the quantile value

    is supplied by value. The Lorentz policy takesthe thresholding parameter as the mean sum ofsquares of the detailed coefficients.

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    7/62

    Vol. 7/3, December 2007 7

    by.level controls the methods estimating noisevariance. In practice, we assume that the noisevariances are globally the same or level-dependent.by.level = TRUE estimates the noise variance ateach level l. Only for the universal, Lorentz, and

    FDR policies, a level-dependent thresholding is pro-vided. The two approaches, hard and soft threshold-ing can be specified by type. In addition, the Lorentz

    type q(t, ) := sign(t)

    t2 2 I(|t| > ) is supplied.Note that only soft type thresholding is appropriatefor the SURE policy. By providing the number of res-olution levels to be thresholded by nthresh, we canalso specify the truncation parameter.

    The following procedures perform thresholdingusing the FDR policy and the reconstruction. Com-paring Figure 6 with Figure 7, we can observe thatthe local components of resolution level 1, 2, and 3

    of Figure 7 are shrunk so that its reconstruction (Fig-ure 8) illustrates a smoothed temperature field. Forthe reconstruction, the function swr() is used on anobject of class "swd".

    > ### Thresholding

    > out.fdr sw.plot(out.fdr, type = "decom")

    > ### Reconstruction

    > out.reconfdr sw.plot(z=out.reconfdr, type="recon",

    + xlab="Longitude", ylab="Latitude")

    Figure 8: Reconstruction

    We repeatedly use sw.plot() for display. Tosummarize its usage, the function sw.plot() dis-plays the observation, network design, SBF rep-resentation, SW coefficients, decomposition resultor reconstruction result, as specified by type ="obs", "network", "field", "swcoeff", "decom" or"recon", respectively. Either argument sw or z spec-ifies the object to be plotted. z is used for obser-vations, subnetwork labels and reconstruction resultand sw is used for an sbf or swd object.

    Conclusion remarks

    We introduce SpherWave, an R package implement-ing SWs. In this article, we analyze surface air tem-perature data using SpherWave and obtain mean-

    ingful and promising results; furthermore provide astep-by-step tutorial introduction for wide potentialapplicability of SWs. Our hope is that SpherWavemakes SW methodology practical, and encouragesinterested readers to apply the SWs for real world

    applications.

    Acknowledgements

    This work was supported by the SRC/ERC programof MOST/KOSEF (R11-2000-073-00000).

    Bibliography

    F. Abramovich and Y. Benjamini. Adaptive thresh-olding of wavelet coefficients. Computational Statis-tics & Data Analysis, 22(4):351361, 1996.

    D. L. Donoho and I. M. Johnstone. Ideal spatial adap-tation by wavelet shrinkage. Biometrika, 81(3):425455, 1994.

    D. L. Donoho and I. M. Johnstone. Adapting to un-known smoothness via wavelet shrinkage Journalof the American Statistical Association, 90(432):12001224, 1995.

    T-H. Li. Multiscale representation and analysis ofspherical data by spherical wavelets. SIAM Jour-

    nal of Scientific Computing, 21(3):924953, 1999.

    F. J. Narcowich and J. D. Ward. Nonstationarywavelets on the m-sphere for scattered data. Ap-plied and Computational Harmonic Analysis, 3(4):324336, 1996.

    H-S. Oh and D. Kim. SpherWave: Sphericalwavelets and SW-based spatially adaptive meth-ods, 2006. URL http://CRAN.R-project.org/src/contrib/Descriptions/SpherWave.html.

    H-S. Oh and D. Kim. Network design and pre-

    processing for multi-Scale spherical basis functionrepresentation. Joournal of the Korean Statistical So-ciety, 36(2):209228, 2007.

    H-S. Oh and T-H. Li. Estimation of globaltemperature fields from scattered observationsby a spherical-wavelet-based spatially adaptivemethod. Journal of the Royal Statistical Society B, 66(1):221238, 2004.

    Hee-Seok OhSeoul National University, Korea

    [email protected] KimSejong University, [email protected]

    R News ISSN 1609-3631

    http://cran.r-project.org/src/contrib/Descriptions/SpherWave.htmlhttp://cran.r-project.org/src/contrib/Descriptions/SpherWave.htmlhttp://cran.r-project.org/src/contrib/Descriptions/SpherWave.htmlmailto:[email protected]:[email protected]:[email protected]:[email protected]://cran.r-project.org/src/contrib/Descriptions/SpherWave.htmlhttp://cran.r-project.org/src/contrib/Descriptions/SpherWave.html
  • 8/22/2019 Rnews_2007-3

    8/62

    Vol. 7/3, December 2007 8

    Diving Behaviour Analysis in RAn Introduction to the diveMove Package

    by Sebastin P. Luque

    Introduction

    Remarkable developments in technology for elec-tronic data collection and archival have increased re-searchers ability to study the behaviour of aquaticanimals while reducing the effort involved and im-pact on study animals. For example, interest in thestudy of diving behaviour led to the development ofminute time-depth recorders (TD Rs) that can collectmore than 15 MB of data on depth, velocity, light lev-

    els, and other parameters as animals move throughtheir habitat. Consequently, extracting useful infor-mation from TD Rs has become a time-consuming andtedious task. Therefore, there is an increasing needfor efficient software to automate these tasks, with-out compromising the freedom to control critical as-pects of the procedure.

    There are currently several programs availablefor analyzing TD R data to study diving behaviour.The large volume of peer-reviewed literature basedon results from these programs attests to their use-fulness. However, none of them are in the free soft-ware domain, to the best of my knowledge, with allthe disadvantages it entails. Therefore, the main mo-tivation for writing diveMove was to provide an Rpackage for diving behaviour analysis allowing formore flexibility and access to intermediate calcula-tions. The advantage of this approach is that re-searchers have all the elements they need at their dis-posal to take the analyses beyond the standard infor-mation returned by the program.

    The purpose of this article is to outline the func-tionality ofdiveMove, demonstrating its most usefulfeatures through an example of a typical diving be-haviour analysis session. Further information can be

    obtained by reading the vignette that is included inthe package (vignette("diveMove")) which is cur-rently under development, but already shows ba-sic usage of its main functions. diveMove is avail-able from CRAN, so it can easily be installed usinginstall.packages().

    The diveMove Package

    diveMove offers functions to perform the followingtasks:

    Identification of wet vs. dry periods, definedby consecutive readings with or without depthmeasurements, respectively, lasting more thana user-defined threshold. Depending on the

    sampling protocol programmed in the instru-ment, these correspond to wet vs. dry periods,

    respectively. Each period is individually iden-tified for later retrieval.

    Calibration of depth readings, which is neededto correct for shifts in the pressure transducer.This can be done using a tcltk graphical user in-terface (GUI) for chosen periods in the record,or by providing a value determined a priori forshifting all depth readings.

    Identification of individual dives, with theirdifferent phases (descent, bottom, and ascent),using various criteria provided by the user.Again, each individual dive and dive phase isuniquely identified for future retrieval.

    Calibration of speed readings using themethod described by Blackwell et al. (1999),providing a unique calibration for each animaland deployment. Arguments are provided tocontrol the calibration based on given criteria.Diagnostic plots can be produced to assess thequality of the calibration.

    Summary of time budgets for wet vs. dry peri-ods.

    Dive statistics for each dive, including maxi-mum depth, dive duration, bottom time, post-dive duration, and summaries for each divephases, among other standard dive statistics.

    tcltk plots to conveniently visualize the entiredive record, allowing for zooming and panningacross the record. Methods are provided to in-clude the information obtained in the pointsabove, allowing the user to quickly identifywhat part of the record is being displayed (pe-riod, dive, dive phase).

    Additional features are included to aid in analy-sis of movement and location data, which are oftencollected concurrently with TD R data. They includecalculation of distance and speed between successivelocations, and filtering of erroneous locations usingvarious methods. However, diveMove is primarily adiving behaviour analysis package, and other pack-ages are available which provide more extensive an-imal movement analysis features (e.g. trip).

    The tasks described above are possible thanks tothe implementation of three formal S4 classes to rep-resent TD R data. Classes TDR and TDRspeed are used

    to represent data from TD Rs with and without speedsensor readings, respectively. The latter class inher-its from the former, and other concurrent data canbe included with either of these objects. A third for-mal class (TDRcalibrate) is used to represent data

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    9/62

    Vol. 7/3, December 2007 9

    obtained during the various intermediate steps de-scribed above. This structure greatly facilitates theretrieval of useful information during analyses.

    Data Preparation

    TD R data are essentially a time-series of depth read-ings, possibly with other concurrent parameters, typ-ically taken regularly at a user-defined interval. De-pending on the instrument and manufacturer, how-ever, the files obtained may contain various errors,such as repeated lines, missing sampling intervals,and invalid data. These errors are better dealt withusing tools other than R, such as awk and its variants,because such stream editors use much less memorythan R for this type of problems, especially with thetypically large files obtained from TD Rs. Therefore,diveMove currently makes no attempt to fix theseerrors. Validity checks for the TDR classes, however,do test for time series being in increasing order.

    Most TD R manufacturers provide tools for down-loading the data from their TD Rs, but often in a pro-prietary format. Fortunately, some of these man-ufacturers also offer software to convert the filesfrom their proprietary format into a portable for-mat, such as comma-separated-values (csv). At leastone of these formats can easily be understood by R,using standard functions, such as read.table() orread.csv(). diveMove provides constructors for itstwo main formal classes to read data from files in one

    of these formats, or from simple data frames.

    How to Represent TD R Data?

    TDR is the simplest class of objects used to representTD R data in diveMove. This class, and its TDRspeedsubclass, stores information on the source file for thedata, the sampling interval, the time and depth read-ings, and an optional data frame containing addi-tional parameters measured concurrently. The onlydifference between TDR and TDRspeed objects is thatthe latter ensures the presence of a speed vector

    in the data frame with concurrent measurements.These classes have the following slots:

    file: character,

    dtime: numeric,

    time: POSIXct,

    depth: numeric,

    concurrentData: data.frame

    Once the TD R data files are free of errors and in aportable format, they can be read into a data frame,

    using e.g.:

    R> ff tdrXcsv library("diveMove")R> ddtt.str ddtt time.posixct tdrX tdrX tdrX plotTDR(tdrX)

    Figure 1: The plotTDR() method for TDR objects pro-duces an interactive plot of the data, allowing forzooming and panning.

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    10/62

    Vol. 7/3, December 2007 10

    Several arguments for readTDR() allow mappingof data from the source file to the different slots indiveMoves classes, the time format in the input andthe time zone attribute to use for the time readings.

    Various methods are available for displaying

    TDR objects, including show(), which provides aninformative summary of the data in the object, ex-tractors and replacement methods for all the slots.There is a plotTDR() method (Figure 1) for both TDRand TDRspeed objects. The interact argument al-lows for suppression of the tcltk interface. Informa-tion on these methods is available frommethods?TDR.

    TDR objects can easily be coerced to data frame(as.data.frame() method), without losing informa-tion from any of the slots. TDR objects can addition-ally be coerced to TDRspeed, whenever it makes senseto do so, using an as.TDRspeed() method.

    Identification of Activities at VariousScales

    One the first steps of dive analysis involves correct-ing depth for shifts in the pressure transducer, sothat surface readings correspond to zero. Such shiftsare usually constant for an entire deployment period,but there are cases where the shifts vary within a par-ticular deployment, so shifts remain difficult to de-tect and dives are often missed. Therefore, a visualexamination of the data is often the only way to de-

    tect the location and magnitude of the shifts. Visualadjustment for shifts in depth readings is tedious,but has many advantages which may save time dur-ing later stages of analysis. These advantages in-clude increased understanding of the data, and earlydetection of obvious problems in the records, suchas instrument malfunction during certain intervals,which should be excluded from analysis.

    Zero-offset correction (ZOC) is done using thefunction zoc(). However, a more efficient method ofdoing this is with function calibrateDepth(), whichtakes a TDR object to perform three basic tasks. Thefirst is to ZOC the data, optionally using the tcltkpackage to be able to do it interactively:

    R> dcalib dcalib

  • 8/22/2019 Rnews_2007-3

    11/62

    Vol. 7/3, December 2007 11

    A more refined call to calibrateDepth() for ob-ject tdrX may be:

    R> dcalib plotTDR(dcalib, concurVars = "light",

    + concurVarTitles = c("speed (m/s)",

    + "light"), surface = TRUE)

    Figure 2: The plotTDR() method for TDRcalibrateobjects displays information on the major activitiesidentified throughout the record (wet/dry periods

    here).

    The dcalib object contains a TDRspeed object inits tdr slot, and speed is plotted by default in thiscase. Additional measurements obtained concur-rently can also be plotted using the concurVars ar-gument. Titles for the depth axis and the concurrentparameters use separate arguments; the former usesylab.depth, while the latter uses concurVarTitles.Convenient default values for these are provided.The surface argument controls whether post-divereadings should be plotted; it is FALSE by default,

    causing only dive readings to be plotted which savestime plotting and re-plotting the data. All plot meth-ods use the underlying plotTD() function, which hasother useful arguments that can be passed from thesemethods.

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    12/62

    Vol. 7/3, December 2007 12

    A more detailed view of the record can be ob-tained by using a combination of the diveNo and thelabels arguments to this plotTDR() method. Thisis useful if, for instance, closer inspection of certaindives is needed. The following call displays a plot of

    dives 2 through 8 (Figure 3):

    R> plotTDR(dcalib, diveNo = 2:8,

    + labels = "dive.phase")

    Figure 3: The plotTDR() method for TDRcalibrateobjects can also display information on the differ-ent activities during each dive record (descent=D,descent/bottom=DB, bottom=B, bottom/ascent=BA,ascent=A, X=surface).

    The labels argument allows the visualizationof the identified dive phases for all dives selected.The same information can also be obtained with theextractDive() method for TDRcalibrate objects:

    R> extractDive(dcalib, diveNo = 2:8)

    Other useful extractors include: getGAct() andgetDAct(). These methods extract the wholegross.activity and dive.activity, respectively, ifgiven only the TDRcalibrate object, or a particu-lar component of these slots, if supplied a stringwith the name of the component. For example:getGAct(dcalib, "trip.act") would retrieve thefactor identifying each reading with a wet/dry activ-ity and getDAct(dcalib, "dive.activity") wouldretrieve a more detailed factor with information onwhether the reading belongs to a dive or a briefaquatic period.

    With the information obtained during this cal-ibration procedure, it is possible to calculate divestatistics for each dive in the record.

    Dive Summaries

    A table providing summary statistics for each divecan be obtained with the function diveStats() (Fig-ure 4).

    diveStats() returns a data frame with the finalsummaries for each dive (Figure 4), providing thefollowing information:

    The time of start of the dive, the end of descent,and the time when ascent began.

    The total duration of the dive, and that of thedescent, bottom, and ascent phases.

    The vertical distance covered during the de-scent, the bottom (a measure of the level ofwiggling, i.e. up and down movement per-formed during the bottom phase), and the ver-

    tical distance covered during the ascent.

    The maximum depth attained.

    The duration of the post-dive interval.

    A summary of time budgets of wet vs. dry pe-riods can be obtained with timeBudget(), whichreturns a data frame with the beginning and end-ing times for each consecutive period (Figure 4).It takes a TDRcalibrate object and another argu-ment (ignoreZ) controlling whether aquatic periodsthat were briefer than the user-specified threshold2

    should be collapsed within the enclosing period ofdry activity.

    These summaries are the primary goal of dive-Move, but they form the basis from which more elab-orate and customized analyses are possible, depend-ing on the particular research problem. These in-clude investigation of descent/ascent rates based onthe depth profiles, and bout structure analysis. Someof these will be implemented in the future.

    In the particular case of TDRspeed objects, how-ever, it may be necessary to calibrate the speed read-ings before calculating these statistics.

    Calibrating Speed Sensor Readings

    Calibration of speed sensor readings is performedusing the procedure described by Blackwell et al.(1999). Briefly the method rests on the principle thatfor any given rate of depth change, the lowest mea-sured speeds correspond to the steepest descent an-gles, i.e. vertical descent/ascent. In this case, mea-sured speed and rate of depth change are expected tobe equal. Therefore, a line drawn through the bottomedge of the distribution of observations in a plot ofmeasured speed vs. rate of depth change would pro-

    vide a calibration line. The calibrated speeds, there-fore, can be calculated by reverse estimation of rateof depth change from the regression line.

    2This corresponds to the value given as the wet.thr argument to calibrateDepth().

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    13/62

    Vol. 7/3, December 2007 13

    R> tdrXSumm1 names(tdrXSumm1)

    [1] "begdesc" "enddesc" "begasc" "desctim"

    [5] "botttim" "asctim" "descdist" "bottdist"

    [9] "ascdist" "desc.tdist" "desc.mean.speed" "desc.angle"

    [13] "bott.tdist" "bott.mean.speed" "asc.tdist" "asc.mean.speed"[17] "asc.angle" "divetim" "maxdep" "postdive.dur"

    [21] "postdive.tdist" "postdive.mean.speed"

    R> tbudget head(tbudget, 4)

    phaseno activity beg end

    1 1 W 2002-01-05 11:32:00 2002-01-06 06:30:00

    2 2 L 2002-01-06 06:30:05 2002-01-06 17:01:10

    3 3 W 2002-01-06 17:01:15 2002-01-07 05:00:30

    4 4 L 2002-01-07 05:00:35 2002-01-07 07:34:00

    R> trip.labs tdrXSumm2 names(tdrXSumm2)

    [1] "trip.no" "trip.type" "beg" "end"

    [5] "begdesc" "enddesc" "begasc" "desctim"

    [9] "botttim" "asctim" "descdist" "bottdist"

    [13] "ascdist" "desc.tdist" "desc.mean.speed" "desc.angle"

    [17] "bott.tdist" "bott.mean.speed" "asc.tdist" "asc.mean.speed"

    [21] "asc.angle" "divetim" "maxdep" "postdive.dur"

    [25] "postdive.tdist" "postdive.mean.speed"

    Figure 4: Per-dive summaries can be obtained with functions diveStats(), and a summary of time budgetswith timeBudget(). diveStats() takes a TDRcalibrate object as a single argument (object dcalib above, seetext for how it was created).

    diveMove implements this procedure with func-tion calibrateSpeed(). This function performs thefollowing tasks:

    1. Subset the necessary data from the record.By default only data corresponding to depthchanges > 0 are included in the analysis, buthigher constraints can be imposed using thez argument. A further argument limiting thedata to be used for calibration is bad, which is avector with the minimum rate of depth changeand minimum speed readings to include in the

    calibration. By default, values > 0 for both pa-rameters are used.

    2. Calculate the binned bivariate kernel den-sity and extract the desired contour. Oncethe proper data were obtained, a bivari-ate normal kernel density grid is calculatedfrom the relationship between measured speedand rate of depth change (using the KernS-mooth package). The choice of bandwidthsfor the binned kernel density is made us-ing bw.nrd. The contour.level argument tocalibrateSpeed() controls which particular

    contour should be extracted from the densitygrid. Since the interest is in defining a regres-sion line passing through the lower densities ofthe grid, this value should be relatively low (itis set to 0.1 by default).

    3. Define the regression line passing through thelower edge of the chosen contour. A quantileregression through a chosen quantile is usedfor this purpose. The quantile can be specifiedusing the tau argument, which is passed to therq() function in package quantreg. tau is set to0.1 by default.

    4. Finally, the speed readings in the TDR object arecalibrated.

    As recognized by Blackwell et al. (1999), the ad-vantage of this method is that it calibrates the instru-

    ment based on the particular deployment conditions(i.e. controls for effects of position of the instrumenton the animal, and size and shape of the instrument,relative to the animals morphometry, among oth-ers). However, it is possible to supply the coefficientsof this regression if they were estimated separately;for instance, from an experiment. The argumentcoefs can be used for this purpose, which is then as-sumed to contain the intercept and the slope of theline. calibrateSpeed() returns a TDRcalibrate ob-ject, with calibrated speed readings included in itstdr slot, and the coefficients used for calibration.

    For instance, to calibrate speed readings using the0.1 quantile regression of measured speed vs. rateof depth change, based on the 0.1 contour of the bi-variate kernel densities, and including only changesin depth > 1, measured speeds and rates of depth

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    14/62

    Vol. 7/3, December 2007 14

    change > 0:

    R> vcalib

  • 8/22/2019 Rnews_2007-3

    15/62

    Vol. 7/3, December 2007 15

    Very Large Numbers in R: IntroducingPackage BrobdingnagLogarithmic representation for floating-pointnumbers

    Robin K. S. Hankin

    Introduction

    The largest floating point number representable instandard double precision arithmetic is a little un-der 21024, or about 1.79 10308 . This is too small forsome applications.

    The R package Brobdingnag (Swift, 1726) over-

    comes this limit by representing a real number x us-ing a double precision variable with value log |x|,and a logical corresponding to x 0; the S4 classof such objects is brob. Complex numbers with largeabsolute values (class glub) may be represented us-ing a pair of brobs to represent the real and imagi-nary components.

    The package allows user-transparent access tothe large numbers allowed by Brobdingnagian arith-metic. The package also includes a vignettebrobwhich documents the S4 methods used and includesa step-by-step tutorial. The vignette also functions as

    a Hello, World! example of S4 methods as used ina simple package. It also includes a full descriptionof the glub class.

    Package Brobdingnag in use

    Most readers will be aware of a googol which is equalto 10100 :

    > require(Brobdingnag)

    > googol stirling stirling(googol)

    [1] +exp(2.2926e+102)

    Note the transparent coercion to brob formwithin function stirling().

    It is also possible to represent numbers very closeto 1. Thus

    > 2^(1/googol)

    [1] +exp(6.9315e-101)

    It is worth noting that if x has an exact repre-

    sentation in double precision, then ex

    is exactly rep-resentable using the system described here. Thus eand e1000 are represented exactly.

    Accuracy

    For small numbers (that is, representable using stan-dard double precision floating point arithmetic),Brobdingnag suffers a slight loss of precision com-pared to normal representation. Consider the follow-ing function, whose return value for nonzero argu-ments is algebraically zero:

    f f(1/7)

    [1] 1.700029e-16

    > f(as.brob(1/7))

    [1] -1.886393e-16

    This typical example shows that Brobdingnagiannumbers suffer a slight loss of precision for numbersof moderate magnitude. This degradation increaseswith the magnitude of the argument:

    > f(1e+100)

    [1] -2.185503e-16

    > f(as.brob(1e+100))

    [1] -3.219444e-14

    Here, the brobs accuracy is about two orders ofmagnitude worse than double precision arithmetic:this would be expected, as the number of bits re-quired to specify the exponent goes as log log x.

    Compare

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    16/62

    Vol. 7/3, December 2007 16

    > f(as.brob(10)^1000)

    [1] 1.931667e-13

    showing a further degradation of precision. How-ever, observe that conventional double precisionarithmetic cannot deal with numbers this big, andthe package returns about 12 correct significant fig-ures.

    A practical example

    In the field of population dynamics, and espe-cially the modelling of biodiversity (Hankin, 2007b;Hubbell, 2001), complicated combinatorial formulaeoften arise.

    Etienne (2005), for example, considers a sample

    of N individual organisms taken from some naturalpopulation; the sample includes S distinct species,and each individual is assigned a label in the range 1to S. The sample comprises ni members of species i,with 1 i S and ni = N. For a given sam-ple D, Etienne defines, amongst other terms, K(D, A)for 1 A N S + 1 as

    {a1 ,...,aS|Si=1 ai =A }

    S

    i=1

    s(ni, ai)s(ai, 1)

    s(ni, 1)(1)

    where s(n, a) is the Stirling number of the secondkind (Abramowitz and Stegun, 1965). The summa-

    tion is over ai = 1 , . . . , ni with the restriction thatthe ai sum to A, as carried out by blockparts() ofthe partitions package (Hankin, 2006, 2007a).

    Taking an intermediate-sized dataset due toSaunders1 of only 5903 individualsa relativelysmall dataset in this contextthe maximal elementof K(D, A) is about 1.435 101165. The accu-racy of package Brobdingnag in this context maybe assessed by comparing it with that computedby PARI/GP (Batut et al., 2000) with a work-ing precision of 100 decimal places; the naturallogs of the two values are 2682.8725605988689

    and 2682.87256059887 respectively: identical to 14significant figures.

    Conclusions

    The Brobdingnag package allows representationand manipulation of numbers larger than those cov-

    ered by standard double precision arithmetic, al-though accuracy is eroded for very large numbers.This facility is useful in several contexts, includingcombinatorial computations such as encountered intheoretical modelling of biodiversity.

    Acknowledgments

    I would like to acknowledge the many stimulatingand helpful comments made by the R-help list overthe years.

    Bibliography

    M. Abramowitz and I. A. Stegun. Handbook of Mathe-matical Functions. New York: Dover, 1965.

    C. Batut, K. Belabas, D. Bernardi, H. Cohen,and M. Olivier. Users guide to pari/gp.Technical Reference Manual, 2000. url:http://www.parigp-home.de/.

    R. S. Etienne. A new sampling formula for neutralbiodiversity. Ecology Letters, 8:253260, 2005. doi:10.111/j.1461-0248.2004.00717.x.

    R. K. S. Hankin. Additive integer partitions in R.Journal of Statistical Software, 16(Code Snippet 1),May 2006.

    R. K. S. Hankin. Urn sampling without replacement:Enumerative combinatorics in R. Journal of Statisti-cal Software, 17(Code Snippet 1), January 2007a.

    R. K. S. Hankin. Introducing untb, an R package forsimulating ecological drift under the Unified Neu-tral Theory of Biodiversity, 2007b. Under review atthe Journal of Statistical Software.

    S. P. Hubbell. The Unified Neutral Theory of Biodiversityand Biogeography. Princeton University Press, 2001.

    J. Swift. Gullivers Travels. Benjamin Motte, 1726.

    W. N. Venables and B. D. Ripley. Modern AppliedStatistics with S-PLUS. Springer, 1997.

    Robin K. S. HankinSouthampton Oceanography CentreSouthampton, United [email protected]

    1The dataset comprises species counts on kelp holdfasts; here saunders.exposed.tot of package untb (Hankin, 2007b), is used.

    R News ISSN 1609-3631

    mailto:[email protected]:[email protected]
  • 8/22/2019 Rnews_2007-3

    17/62

    Vol. 7/3, December 2007 17

    Applied Bayesian Non- andSemi-parametric Inference using

    DPpackageby Alejandro Jara

    Introduction

    In many practical situations, a parametric model can-not be expected to describe in an appropriate man-ner the chance mechanism generating an observeddataset, and unrealistic features of some commonmodels could lead to unsatisfactory inferences. In

    these cases, we would like to relax parametric as-sumptions to allow greater modeling flexibility androbustness against misspecification of a parametricstatistical model. In the Bayesian context such flex-ible inference is typically achieved by models withinfinitely many parameters. These models are usu-ally referred to as Bayesian Nonparametric (BNP) orSemiparametric (BSP) models depending on whetherall or at least one of the parameters is infinity dimen-sional (Mller & Quintana, 2004).

    While BSP and BNP methods are extremely pow-erful and have a wide range of applicability withinseveral prominent domains of statistics, they are notas widely used as one might guess. At least partof the reason for this is the gap between the type ofsoftware that many applied users would like to havefor fitting models and the software that is currentlyavailable. The most popular programs for Bayesiananalysis, such as BUGS (Gilks et al., 1992), are gener-ally unable to cope with nonparametric models. Thevariety of different BSP and BNP models is huge;thus, building for all of them a general softwarepackage which is easy to use, flexible, and efficientmay be close to impossible in the near future.

    This article is intended to introduce an R pack-

    age, DPpackage, designed to help bridge the pre-viously mentioned gap. Although its name is mo-tivated by the most widely used prior on the spaceof the probability distributions, the Dirichlet Process(DP) (Ferguson, 1973), the package considers andwill consider in the future other priors on functionalspaces. Currently, DPpackage (version 1.0-5) allowsthe user to perform Bayesian inference via simula-tion from the posterior distributions for models con-sidering DP, Dirichlet Process Mixtures (DPM), PolyaTrees (PT), Mixtures of Triangular distributions, andRandom Bernstein Polynomials priors. The package

    also includes generalized additive models consider-ing penalized B-Splines. The rest of the article is or-ganized as follows. We first discuss the general syn-tax and design philosophy of the package. Next, themain features of the package and some illustrative

    examples are presented. Comments on future devel-opments conclude the article.

    Design philosophy and general

    syntax

    The design philosophy behind DPpackage is quitedifferent from that of a general purpose language.

    The most important design goal has been the imple-mentation of model-specific MCMC algorithms. Adirect benefit of this approach is that the samplingalgorithms can be made dramatically more efficient.

    Fitting a model in DPpackage begins with a callto an R function that can be called, for instance,DPmodel or PTmodel. Here model" denotes a de-scriptive name for the model being fitted. Typically,the model function will take a number of argumentsthat govern the behavior of the MCMC sampling al-gorithm. In addition, the model(s) formula(s), data,and prior parameters are passed to the model func-tion as arguments. The common elements in any

    model function are:

    i) prior: an object list which includes the valuesof the prior hyperparameters.

    ii) mcmc: an object list which must include theintegers nburn giving the number of burn-in scans, nskip giving the thinning interval,nsave giving the total number of scans to besaved, and ndisplay giving the number ofsaved scans to be displayed on screen: the func-tion reports on the screen when every ndisplayscans have been carried out and returns theprocesss runtime in seconds. For some spe-cific models, one or more tuning parameters forMetropolis steps may be needed and must beincluded in this list. The names of these tun-ing parameters are explained in each specificmodel description in the associated help files.

    iii) state: an object list giving the current valuesof the parameters, when the analysis is the con-tinuation of a previous analysis, or giving thestarting values for a new Markov chain, whichis useful for running multiple chains starting

    from different points.

    iv) status: a logical variable indicating whetherit is a new run (TRUE) or the continuation of aprevious analysis (FALSE). In the latter case the

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    18/62

    Vol. 7/3, December 2007 18

    current values of the parameters must be spec-ified in the object state.

    Inside the R model function the inputs to themodel function are organized in a more useableform, the MCMC sampling is performed by call-ing a shared library written in a compiled language,and the posterior sample is summarized, labeled, as-signed into an output list, and returned. The outputlist includes:

    i) state: a list of objects containing the currentvalues of the parameters.

    ii) save.state: a list of objects containing theMCMC samples for the parameters. Thislist contains two matrices randsave andthetasave which contain the MCMC samplesof the variables with random distribution (er-

    rors, random effects, etc.) and the parametricpart of the model, respectively.

    In order to exemplify the extraction of the outputelements, consider the abstract model fit:

    fit

  • 8/22/2019 Rnews_2007-3

    19/62

    Vol. 7/3, December 2007 19

    model with a binary distribution, respectively.DPraschpoisson and FPTraschpoisson, em-ploying a Poisson distribution.

    iv) Semiparametric meta-analysis models: DPmeta

    andDPMmeta

    for the random (mixed) effectsmeta-analysis models, using a DP/MDP andDPM of normals prior, respectively.

    v) Binary regression with nonparametric link:CSDPbinary, using Newton et al. (1996)s cen-trally standardized DP prior. DPbinary andFPTbinary, using a DP and a finite PT prior forthe inverse of the link function, respectively.

    vi) AFT model for interval-censored data:DPsurvint, using a MDP prior for the errordistribution.

    vii) ROC curve estimation: DProc, using DPM ofnormals.

    viii) Median regression model: PTlm, using amedian-0 MPT prior for the error distribution.

    ix) Generalized additive models: PSgam, using pe-nalized B-Splines.

    Additional tools included in the package areDPelicit, to elicit the DP prior using the exact andapproximated formulas for the mean and variance ofthe number of clusters given the total mass parame-ter and the number of subjects (see, Jara et al. 2007);and PsBF, to compute the Pseudo-Bayes factors formodel comparison.

    Examples

    Bivariate Density Estimation

    As an illustration of bivariate density estimationusing DPM normals (DPdensity) and MPT models(PTdensity), part of the dataset in Chambers et al.(1983) is considered. Here, n = 111 bivariate obser-

    vations yi = (yi1, yi2)

    T

    on radiation yi1 and the cuberoot of ozone concentration yi2 are modeled. Theoriginal dataset has the additional variables windspeed and temperature. These were analyzed byMller et al. (1996) and Hanson (2006).

    The DPdensity function considers the multivari-ate extension of the univariate Dirichlet Process Mix-ture of Normals model discussed in Escobar & West(1995),

    yi | G iid

    Nk (, ) G(d, d)

    G

    |M, G0

    DP (G0)

    G0 Nk( | m1,10 )IWk( | 1 , 1)

    (a0 , b0)

    m1 | m2 , S2 Nk(m2, S2)

    0 | 1 ,2 (1/2, 2/2)

    1 | 2 , 2 IWk(2, 2)

    where Nk (,) refers to a k-variate normal distri-bution with mean and covariance matrix and ,respectively, IWk (,) refers to an inverted-Wishartdistribution with shape and scale parameterand ,respectively, and (a, b) refers to a gamma distribu-tion with shape and rate parameter, a and b, respec-tively. Note that the inverted-Wishart prior is param-

    eterized such that its mean is given by 1k1

    1.The PTdensity function considers a Mixture of

    multivariate Polya Trees model discussed in Hanson(2006),

    yi|G iid G, (1)

    G | ,, , M PTM(,, A), (2)

    p(, ) ||(d+1)/2 , (3)

    |a0 , b0 (a0, b0), (4)where the PT prior is centered around a Nk(,)

    distribution. To fit these models we used the follow-ing commands:

    # Data

    data("airquality")

    attach(airquality)

    ozone

  • 8/22/2019 Rnews_2007-3

    20/62

    Vol. 7/3, December 2007 20

    fitMPT t | x) = G (t exp{xTi }, +), thus provid-ing a model with a simple interpretation of the re-gression coefficients for practitioners.

    Classical treatments of the semiparametric AFTmodel with interval-censored data were presented,

    for instance, in Lin & Zhang (1998). Note, how-ever, that for semiparametric AFT models there isnothing comparable to a partial likelihood function.Therefore, the vector of regression coefficients andthe baseline survival distribution must be estimatedsimultaneously, complicating matters enormously inthe interval-censored case. The more recent classicalapproaches only provide inferences about the regres-sion coefficients and not for the survival function.

    In the Bayesian semiparametric context, Chris-tensen & Johnson (1998) assigned a simple DP prior,centered in a single distribution, to baseline survival

    for nested interval-censored data. A marginal like-lihood for the vector of regression coefficients ismaximized to provide a point estimate and resultingsurvival curves. However, this approach does notallow the computation of credible intervals for the

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    21/62

    Vol. 7/3, December 2007 21

    parameters. Moreover, it may be difficult in prac-tice to specify a single centering distribution for theDP prior and, once specified, a single centering dis-tribution may affect inferences. To overcome thesedifficulties, a MDP prior can be considered. Under

    this approach, it is not very difficult to demonstratedthat the computations involved for a full Bayesiansolution are horrendous at best, even for the non-censored data problem. The analytic intractabilityof the Bayesian semiparametric AFT model has beenovercome using MCMC methods by Hanson & John-son (2004).

    To test whether chemotherapy in addition to ra-diotherapy has an effect on time to breast retraction,an AFT model Ti = exp(xTi )Vi, i = 1 , . . . , n, wasconsidered. We model the baseline distribution inthe AFT model using a MDP prior centered in a stan-dard parametric family, the lognormal distribution,

    V1 , . . . , Vn|G iid G,

    G | ,,2 DP (G0) , G0 LN(,2),

    | m0 , s0 N(m0 , s0) ,

    2 | 1 ,2 (1/2, 2/2) ,

    | 0 , S0 Np0, S0

    ,

    where LN

    m, s2

    and N

    m, s2

    refer to a log-normaland normal distribution, respectively, with location

    m and scale parameter s2. The precision parameterof the MDP prior was chosen to be = 10, allowingfor moderate deviations from the log-normal family.We allow the parametric family to hold only approx-imately, and the resulting model is robust againstmis-specification of the baseline survival distribu-tion. The covariate of interest is trti = 0 if the ithpatient had radiotherapy only and trti = 1 if theith patient had radiotherapy and chemotherapy. Thefollowing commands were used to fit the model,

    # Data

    data("deterioration")

    attach(deterioration)y

  • 8/22/2019 Rnews_2007-3

    22/62

    Vol. 7/3, December 2007 22

    > summary(fit)

    Bayesian Semiparametric AFT Regression Model

    Call:

    DPsurvint.default(formula = y ~ trt, prior = prior, mcmc = mcmc,

    state = state, status = TRUE)

    Posterior Predictive Distributions (log):

    Min. 1st Qu. Median Mean 3rd Qu. Max.

    -4.5920 -2.3570 -1.4600 -1.6240 -0.7121 -0.1991

    Regression coefficients:

    Mean Median Std. Dev. Naive Std.Error 95%HPD-Low 95%HPD-Upp

    trt 0.502282 0.513219 0.195521 0.001955 0.120880 0.820614

    Baseline distribution:

    Mean Median Std. Dev. Naive Std.Error 95%HPD-Low 95%HPD-Upp

    mu 3.255374 3.255518 0.173132 0.001731 2.917770 3.589759

    sigma2 1.021945 0.921764 0.469061 0.004691 0.366900 1.908676

    Precision parameter:

    Mean Median Std. Dev. Naive Std.Error 95%HPD-Low 95%HPD-Upp

    ncluster 27.58880 28.00000 3.39630 0.03396 20.00000 33.00000

    Acceptance Rate for Metropolis Step = 0.2637435

    Number of Observations: 94

    Figure 2: Posterior summary for the Breast Cancer Data fit using DPsurvint.

    GLMM provide a popular framework for the

    analysis of longitudinal measures and clustered data.The models account for correlation among clusteredobservations by including random effects in the lin-ear predictor component of the model. AlthoughGLMM fitting is typically complex, standard ran-dom intercept and random intercept/slope modelswith normally distributed random effects can nowbe routinely fitted in standard software. Such mod-els are quite flexible in accommodating heterogenousbehavior, but they suffer from the same lack of ro-bustness against departures from distributional as-sumptions as other statistical models based on Gaus-

    sian distributions.A common strategy for guarding against such

    mis-specification is to build more flexible distribu-tional assumptions for the random effects into themodel. Following Lesaffre & Spiessens (2001), weconsider a logistic mixed effects model to examinethe probability of moderate or severe toenail separa-tion Y = 1 versus the probability of absent or mildY = 0, including as covariates treatment (trt) (0 or1), time (t) (continuous), and timetreatment inter-action,

    logit

    P

    Yi j = 1

    |,i

    = i +1Trti +2Timei j +

    3Trti Timei j .

    However, we replace the normality assumption ofthe random intercepts by using a DPM of normals

    prior (see, e.g., Mller et al. 2007),

    i | G iid G,

    G | P, k

    N(m, k)P(dm),

    P | ,, DP (N(, )) ,

    Np0 , S0

    ,

    k | 0 , T IWk(0 , T) ,

    | mb, Sb Nq (mb, Sb) ,

    | 0, Tb IWk (b, Tb) ,

    | a0 , b0 (a0 , b0) .The semiparametric GLMM using DPM of normalsmodel can be fitted using function DPMglmm and thefollowing code,

    # MCMC parameters

    mcmc

  • 8/22/2019 Rnews_2007-3

    23/62

    Vol. 7/3, December 2007 23

    Sbeta0 = diag(100,3))

    # Fitting the model

    fitDPM anova(fitDPM)

    Table of Pseudo Contour Probabilities

    Response: infect

    Df PsCP

    trt 1 0.512

    time 1

  • 8/22/2019 Rnews_2007-3

    24/62

    Vol. 7/3, December 2007 24

    > summary(fitDPM)

    Bayesian semiparametric generalized linear mixed effect model

    Call:

    DPMglmm.default(fixed = infect ~ trt + time * trt, random = ~1 |

    idnr, family = binomial(logit), prior = prior, mcmc = mcmc,

    state = state, status = TRUE)

    Posterior Predictive Distributions (log):

    Min. 1st Qu. Median Mean 3rd Qu. Max.

    -9.644e+00 -2.335e-01 -4.190e-02 -2.442e-01 -8.629e-03 -4.249e-05

    Model'

    s performance:

    Dbar Dhat pD DIC LPML

    753.0 603.6 149.4 902.5 -466.0

    Regression coefficients:

    Mean Median Std. Dev. Naive Std.Error 95%HPD-Low 95%HPD-Upp

    (Intercept) -2.508419 -2.440589 0.762218 0.005390 -4.122867 -1.091684trt 0.300309 0.304453 0.478100 0.003381 -0.669604 1.242553

    time -0.392343 -0.390384 0.046101 0.000326 -0.482329 -0.302442

    trt:time -0.128891 -0.128570 0.072272 0.000511 -0.265813 0.018636

    Kernel variance:

    Mean Median Std. Dev. Naive Std.Error 95%HPD-Low 95%HPD-Upp

    sigma-(Intercept) 0.0318682 0.0130737 0.0966504 0.0006834 0.0009878 0.1069456

    Baseline distribution:

    Mean Median Std. Dev. Naive Std.Error 95%HPD-Low 95%HPD-Upp

    mub-(Intercept) -2.624227 -2.558427 1.405269 0.009937 -5.621183 0.008855

    sigmab-(Intercept) 26.579978 23.579114 13.640300 0.096451 7.714973 52.754246

    Precision parameter:

    Mean Median Std. Dev. Naive Std.Error 95%HPD-Low 95%HPD-Upp

    ncluster 70.6021 64.0000 36.7421 0.2598 11.0000 143.0000

    alpha 38.4925 25.7503 44.1123 0.3119 1.1589 112.1120

    Acceptance Rate for Metropolis Steps = 0.8893615 0.9995698

    Number of Observations: 1908

    Number of Groups: 294

    Figure 5: Posterior summary for the Toe-nail Data fit using DPMglmm.

    trt:time 1 0.075 .

    ---

    Signif. codes: 0 ' ***' 0.001 ' ** ' 0.01

    '

    *'

    0.05'

    .'

    0.1' '

    1

    Finally, information about the posterior distributionof the subject-specific effects can be obtained by us-ing the DPMrandom function as follows,

    > DPMrandom(fitDPM)

    Random effect information for the DP object:

    Call:

    DPMglmm.default(fixed = infect ~ trt +

    time * trt, random = ~1 |idnr,

    family = binomial(logit),

    prior = prior, mcmc = mcmc,

    state = state, status = TRUE)

    Posterior mean of subject-specific components:

    (Intercept)

    1 1.6239

    .

    .

    383 2.0178

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    25/62

    Vol. 7/3, December 2007 25

    Summary and Future Develop-

    ments

    As the main obstacle for the practical use of BSPand BNP methods has been the lack of estimationtools, we presented an R package for fitting some fre-quently used models. Until the release of DPpack-age, the two options for researchers who wished tofit a BSP or BNP model were to write their own codeor to rely heavily on particular parametric approxi-mations to some specific processes using the BUGScode given in Peter Congdons books (see, e.g., Con-gdon 2001). DPpackage is geared primarily towardsusers who are not willing to bear the costs associatedwith both of these options.

    Many improvements to the current status of thepackage can be made. For example, all DPpackage

    modeling functions compute CPOs for model com-parison. However, only some of them compute theeffective number of parameters pD and the devianceinformation criterion (DIC), as presented by Spiegel-halter et al. (2002). These and other model compar-ison criteria will be included for all the functions infuture versions ofDPpackage.

    The implementation of more models, the devel-opment of general-purpose sampling functions, andthe ability to run several Markov chains at once andto handle large dataset problems through the use ofsparse matrix techniques, are areas of further im-

    provement.

    Acknowledgments

    I wish to thank (in alphabetical order) Timothy Han-son, George Karabatsos, Emmanuel Lesaffre, PeterMller, and Fernando Quintana for many valuablediscussions, comments, and suggestions during thedevelopment of the package. Timothy Hanson andFernando Quintana are also co-authors of the func-tions PTdensity and FPTbinary, and BDPdensity, re-spectively. I gratefully acknowledge the partial sup-

    port of the KUL-PUC bilateral (Belgium-Chile) grantNo BIL05/03 and of the IAP research network grantNo P6/03 of the Belgian government (Belgian Sci-ence Policy). The author thanks Novartis, Belgium,for permission to use their dermatological data forstatistical research.

    Bibliography

    J. Besag, P. Green, D. Higdon, and K. Mengersen.Bayesian computation and stochastic systems

    (with Discussion). Statistical Science, 10:366, 1995.

    J. M. Chambers, S. Cleveland, and A. P. Tukey.Graphical Methods for Data Analysis. Boston,USA: Duxbury, 1983.

    M. H. Chen and Q. M. Shao. Monte Carlo estimationof Bayesian credible and HPD intervals. Journalof Computational and Graphical Statistics, 8(1):6992,1999.

    R. Christensen and W. O. Johnson. Modeling Ac-celerated Failure Time With a Dirichlet Process.Biometrika, 75:693704, 1998.

    P. Congdon. Bayesian Statistical Modelling. New York,USA: John Wiley and Sons, 2001.

    M. D. Escobar and M. West. Bayesian density esti-mation and inference using mixtures. Journal of theAmerican Statistical Association, 90:577588, 1995.

    T. S. Ferguson. A Bayesian analysis of some nonpara-metric problems. The Annals of Statistics, 1:209230,1973.

    S. Geisser and W. Eddy. A predictive approach tomodel selection. Journal of the American StatisticalAssociation, 74:153160, 1979.

    W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. Soft-ware for Gibbs sampler Computing Science andStatistics 24:439448, 1992.

    T. Hanson. Inference for Mixtures of Finite Polya TreeModels. Journal of the American Statistical Associa-tion, 101:15481565, 2006.

    T. Hanson and W. O. Johnson. A Bayesian Semipara-

    metric AFT Model for Interval-Censored Data.Journal of Computational and Graphical Statistics13(2):341361, 2004.

    A. Jara, M. J. Garcia-Zattera, and E. Lesaffre. ADirichlet Process Mixture model for the analysis ofcorrelated binary responses. Computational Statis-tics and Data Analysis, 51:54025415, 2007.

    J. Klein and M. Moeschberger. Survival Analysis. NewYork, USA: Springer-Verlag, 1997.

    E. Lesaffre and B. Spiessens. On the effect of thenumber of quadrature points in a logistic random-

    effects model: an example. Applied Statistics 50:325335, 2001.

    G. Lin and C. Zhang. Linear Regression With Inter-val Censored Data. The Annals of Statistics 26:13061327, 1998.

    P. Mller and F.A. Quintana. NonparametricBayesian Data Analysis. Statistical Science 19(1):95110, 2004.

    P. Mller, A. Erkanli, and M. West. BayesianCurve Fitting Using Multivariate Normal Mix-tures. Biometrika, 83:6779, 1996.

    P. Mller, F. A. Quintana, and G. Rosner. Semi-parametric Bayesian Inference for Multilevel Re-peated Measurement Data. Biometrics, 63(1):280289, 2007.

    R News ISSN 1609-3631

  • 8/22/2019 Rnews_2007-3

    26/62

    Vol. 7/3, December 2007 26

    M. A. Newton, C. Czado, and R. Chappell. Bayesianinference for semiparametric binary regression.Journal of the American Statistical Association 91:142153, 1996.

    M. Plummer, N. Best, K. Cowles, and K. Vines.CODA: Output analysis and diagnostics forMCMC. R package version 0.12-1, 2007.

    B. J. Smith. boa: Bayesian Output Analysis Program(BOA) for MCMC. R package version 1.1.6-1, 2007.

    S. D. Spiegelhalter, N. G.Best, B. P. Carlin, andA. Van der Linde. Bayesian measures of modelcomplexity and fit. Journal of the Royal StatisticalSociety: Series B 64:583639, 2002.

    Alejandro JaraBiostatistical CentreCatholic University of LeuvenLeuven, [email protected]

    An Introduction to gWidgetsby John Verzani

    Introduction

    CRAN has several different packages that interfaceR with toolkits for making graphical user inter-faces (GUIs). For example, among others, there areRGtk2 (Lawrence and Temple Lang, 2006), rJava,and tcltk (Dalgaard, 2001). These primarily providea mapping between library calls for the toolkits andsimilarly named R functions. To use them effectively

    to create a GUI, one needs to learn quite a bit aboutthe underlying toolkit. Not only does this add com-plication for many R users, it can also be tedious,as there are often several steps required to set up abasic widget. The gWidgets package adds anotherlayer between the R user and these packages provid-ing an abstract, simplified interface that tries to beas familiar to the R user as possible. By abstractingthe toolkit it is possible to use the gWidgets inter-face with many different toolkits. Although, open tothe criticism that such an approach can only providea least-common-denominator user experience, well

    see that gWidgets, despite not being as feature-richas any underlying toolkit, can be used to producefairly complicated GUIs without having as steep alearning curve as the toolkits themselves.

    As of this writing there are implementations forthree toolkits, RGtk2, tcltk, and rJava (with progresson a port to RwxWidgets). The gWidgetsRGtk2package was the first and is the most complete.Whereas gWidgetstcltk package is not as complete,due to limitations of the base libraries, but it hasmany useful widgets implemented. Installation ofthese packages requires the base toolkit libraries be

    installed. For gWidgetstcltk these are bundled withthe windows distribution, for others they may re-quire a separate download.

    Dialogs

    We begin by loading the package. Both the packageand at least one toolkit implementation must be in-stalled prior to this. If more than one toolkit imple-mentation has been installed, you will be queried asto which one to use.

    library("gWidgets")

    The easiest GUI elements to create are the basicdialogs (Figure 1). These are useful for sending outquick messages, such as: 1

    gconfirm("Are we having fun?")

    Figure 1: Simple dialog created by gconfirm usingthe RGtk2 toolkit.

    A basic dialog could be used to show error mes-sages

    options(error = function() {

    err = geterrmessage()

    gmessage(err, icon="error")

    })

    or, be an alternative to file.choose

    source(gfile())

    In gWidgets, these basic dialogs are modal,meaning the user must take some action before con-

    trol of R is returned. The return value is a logical orstring, as appropriate, and can be used as input to afurther command. Modal dialogs can be confusing

    1The code for these examples is available from http://www.math.csi.cuny.edu/pmg/gWidgets/rnews.R

    R News ISSN 1609-3631

    mailto:[email protected]://www.math.csi.cuny.edu/pmg/gWidgets/rnews.Rhttp://www.math.csi.cuny.edu/pmg/gWidgets/rnews.Rmailto:[email protected]
  • 8/22/2019 Rnews_2007-3

    27/62

    Vol. 7/3, December 2007 27

    if the dialog gets hidden by another window. Addi-tionally, modal dialogs disrupt a users flow. A moretypical GUI will allow the R session to continue andwill only act when a user initiates some action withthe GUI by mouse click, text entry, etc. The GUI de-

    signer adds handlers to respond to these events. ThegWidgets programming interface is based around fa-cilitating the following basic tasks in building a GUI:constructing widgets that a user can control to af-fect the state of the GUI, using generic functions toprogrammatically manipulate these widgets, simpli-fying the layout of widgets within containers, and fa-cilitating the assigning of handlers to events initiatedby the user of the GUI.

    Selecting a CRAN site

    Figure 2: GUI to select a CRAN mirror shown usinggWidgetstcltk. The filter feature of gtable has beenused to narrow the selection to USA sites. Double

    clicking a row causes the CRAN repository to be set.

    Selecting an item from a list of items or a tableof items is a very common task in GUIs. Our nextexample presents a GUI that allows a user to se-lect with the mouse a CRAN repository from a ta-ble. The idea comes from a tcltk GUI created by thechooseCRANmirror function. This example showshow the widget constructors allow specification ofboth containers and event handlers.

    We will use this function to set the CRAN reposi-tory from a URL.

    setCRAN

  • 8/22/2019 Rnews_2007-3

    28/62

    Vol. 7/3, December 2007 28

    The ggroup function produces a container thatmay be visualized as a box that allows new wid-gets to be packed in from left to right (the default)or from top to bottom (horizontal=FALSE). Nestingsuch containers gives a wide range of flexibility.

    Widgets are added to containers through thecontainer argument at the time of construction(which hereafter we shorten to cont) or using the addmethod for containers. However, gWidgetstcltk re-quires one to use the container argument when awidget is constructed. That is, except with gWidget-stcltk

    win

  • 8/22/2019 Rnews_2007-3

    29/62

    Vol. 7/3, December 2007 29

    side of a dialog. First we pack a ggroup instance intoa top-level container.

    Figure 4: Illustration of addSpace and addSpringmethods ofggroup using the gWidgetsRGTk2 pack-age after resizing main window. The buttons arepushed to the right side by addSpring.

    win

  • 8/22/2019 Rnews_2007-3

    30/62

    Vol. 7/3, December 2007 30

    construction and layout. Figure 5 shows a realizationusing rJava.

    The widgets have various arguments to specifytheir values that depend on what the widget does.For example, the slider and spinbutton select a value

    from a sequence, so the arguments mirror those ofseq. Some widgets have a coerce.with argumentwhich allows a text-based widget, like gedit, to re-turn numeric values through svalue. The gdroplistwidget is used above to pop up its possible valuesfor selection. Additionally, if the editable=TRUE ar-gument is given, it becomes a combo-box allowing auser to input a different value.

    In Figure 5 we see in the slider widget one of thelimitations of the gWidgets approach it is not asfeature rich as any individual toolkit. Although theunderlying JSlider widget has some means to adjustthe labels on the slider, gWidgets provides none in itsAPI, which in this case is slightly annoying, as Javaschosen values of 5, 28, 51, 74, and 97 just dont seemright.

    The values specified by these widgets are to befed into the function makeCIs, which is not shownhere, but uses matplot to construct a graphic dis-playing simulated confidence intervals. To make thisGUI interactive, the following handler will be usedto respond to changes in the GUI. We need to callsvalue on each of the widgets and then pass theseresults to the makeCIs function. This task is stream-lined by using lapply and do.call:

    allWidgets

  • 8/22/2019 Rnews_2007-3

    31/62

    Vol. 7/3, December 2007 31

    widget has the default handler respond to the eventof a user pressing the ENT ER key. Whereas, theaddHandlerKeystroke method can be used to add ahandler that responds to any keystroke. Officially,only one handler can be assigned per event, although

    some toolkits allow more.Handlers return an ID which can be used with

    removeHandler to remove the response to the event.The signature of a handler function has a first ar-

    gument, h, to which is passed a list containing com-ponents obj, action and perhaps others. The objcomponent refers to the underlying widget. In theexamples above we found this widget after storingit to a variable name, this provides an alternative.The action component may be used to passed alongan arbitrary value to the handler. The other compo-nents depend on the widget. For the ggraphics wid-

    get (currently just gWidgetsRGtk2), the componentsx and y report the coordinates of a mouse click foraddHandlerClicked. Whereas, for addDropTargetthe dropdata component contains a string specifyingthe value being dragged.

    To illustrate the drag-and-drop handlers the fol-lowing creates two buttons. The drop source needs ahandler which returns the value passed along to thedropdata component of the drop target handler. Inthis example, clicking on the first button and drag-ging its value onto the label will change the text dis-played on the label.

    g = ggroup(cont=gwindow("DnD example"))l1

  • 8/22/2019 Rnews_2007-3

    32/62

    Vol. 7/3, December 2007 32

    Figure 7: The GUI for getQuotes created byggenericwidget using the RGtk2 toolkit.

    We define two functions to display graphicalsummaries of the stock, using the zoo package toproduce the plots. Rather than use a plot device (asonly gWidgetsRGtk2 implements an embeddableone through ggraphics), we choose instead to makea file containing the plot as a graphic and displaythis using the gimage widget. In gWidgetstcltk thenumber of graphic formats that can be displayed islimited, so we use an external program, convert, tocreate a gif file. This is not needed if using RGtk2 orrJava.

    showTrend

  • 8/22/2019 Rnews_2007-3

    33/62

    Vol. 7/3, December 2007 33

    coerce.with=as.numeric,

    handler=updateNB)

    e$pg

  • 8/22/2019 Rnews_2007-3

    34/62

    Vol. 7/3, December 2007 34

    Financial Journalism with RBill Alpert

    R proved itself a sharp tool in testing the stock picksof Jim Cramer, a popular US financial journalist.We examined Cramers advice for a recent coverstory in Barrons, the Dow Jones & Co. stock marketweekly, where I am a reporter and floundering R user(Alpert, 2007). The August 20, 2007 story should beaccessible without subscription at the Barrons web-site (http://www.barrons.com).

    The 52-year-old Cramer once ran a hedge fundwhich racked up 24% annualized returns over abouta dozen years. His current celebrity comes from theMad Money television show on the cable networkCNBC, in which he makes Buy and Sell recommen-dations to the accompaniment of wacky sound ef-fects and clownish antics. A few have attemptedto measure the performance of modest samples ofCramers picks (Engelberg et al., 2007; Nayda, 2006).Yet Cramer makes almost 7,000 recommendationsa year, according to the count of a database at hisMad Money website (http://madmoney.thestreet.com/). He has never reported the performance of allthose stock picks. I figured Id try.

    As in most projects, data collection was the hardpart. I called Cramer and asked for any records of hisMad Money picks. After ripping into me and all jour-

    nalists whove reviewed his show, he stopped tak-ing my calls. Meanwhile, I found a website main-tained by a retired stock analyst, who had talliedabout 1,300 of Cramers Mad Money recommenda-tions over two years. I also found the abovemen-tioned database at Cramers official website, whichrecorded over 3,400 recommendations from the pre-ceding six months. This Cramer site classified hispicks in ways useful for subsetting in R, but con-spicuously lacked any performance summaries. Iturned these Web records of his stock picks into Ex-cel spreadsheets. Then I downloaded stock price his-

    tories from Edgar Onlines I-Metrix service (http://I-Metrix.Edgar-Online.com), using some Excelmacros. None of this was trivial work, because Iwanted a years worth of stock prices around eachrecommendation and the date ranges varied over thethousands of stocks. Financial data suppliers can de-liver a wealth of information for a single, commondate range, but an "event study" like mine involvedhundreds of date ranges for thousands of stocks.Most finance researchers deal with this hassle by us-ing SAS and a third-party add-on called Eventus thateases data collection. But I wanted to use R.

    I reached out to quantitative finance program-mer Pat Burns and Pat wrote some R code for ourevent study style analysis. Pat has posted his ownworking paper on our study at his website (http://www.burns-stat.com). Rs flexibility was useful

    because we needed a novel data structure for theCramer analysis. In most analyses, the data for prices

    (or for returns) are in a matrix where each column is adifferent stock and each row is a specific date. In ourcase, each stock recommendation had the same num-ber of data points, so a matrix was a logical choice.However, instead of each row being a specific date, itwas a specific offset from the recommendation date.We still needed the actual date, though, in order toget the difference in return between the stocks andthe S&P 500 on each day to see if Cramers picks"beat the market." Pats solution was to have a ma-trix of dates as a companion to the matrix of prices.It was then a trivial subscripting exercise to get a ma-trix of S&P returns that matched the matrix of returns

    for the recommendations. Many stocks were recom-mended multiple times, so the column names of thematrices were not unique.

    Once the data were in R, it was fairly easyto test various investment strategies, such as buy-ing the day after Cramers 6 pm show or, instead,waiting two days before buying. Any investmentstrategy that actively picks stocks should at leastbeat the returns youd get from a passive market-mimicking portfolio with comparable risk (that is,similar return variance) and it should beat themarket by enough to cover trading costs. Oth-

    erwise you ought to keep your money in an in-dex fund. You can see the importance of market-adjusting Cramers returns by comparing the redand blue lines in Figure 1. The Nasdaq Compos-ite Index is arguably a better match to the riski-ness of Cramers widely-varying returns. We madehis performance look better when we used the S&P.

    Figure 1: Average cumulative percentage log returnfrom the day of broadcast for approx. 1,100 MadMoney recommendations recorded at http://www.

    yourmoneywatch.com. Blue line shows return rela-tive to S&P 500 Index.

    The red line shows his average picks unad-justed log return, while the blue shows the log re-

    R News ISSN 1609-3631

    http://www.barrons.com/http://www.barrons.com/http://madmoney.thestreet.com/http://madmoney.thestreet.com/http://i-metrix.edgar-online.com/http://i-metrix.edgar-online.com/http://i-metrix.edgar-online.com/http://www.burns-stat.com/http://www.burns-stat.com/http://www.burns-stat.com/http://www.yourmoneywatch.com/http://www.yourmoneywatch.com/http://www.yourmoneywatch.com/http://www.yourmoneywatch.com/http://www.burns-stat.com/http://www.burns-stat.com/http://i-metrix.edgar-online.com/http://i-metrix.edgar-online.com/http://madmoney.thestreet.com/http://madmoney.thestreet.com/http://www.barrons.com/
  • 8/22/2019 Rnews_2007-3

    35/62

    Vol. 7/3, December 2007 35

    turn relative to the Standard & Poors 500 Index.The data were roughly 1,100 Buy recommendationsover the period from July 2005 to July 2007. Thelines lefthand peaks mark the trading day after eachevening broadcast, when enthusiastic fans have bid

    up Cramers pick. We ultimately tested severaldozen investment strategies.

    The results were disappointing for someone whowants to follow Cramers advice. You could notbeat the market by a statistically significant amount ifyou followed his recommendations in any way read-ily available to a viewer. But we did find that youmight make an interesting return if you went againstCramers recommendations shorting his Buys themorning after his shows, while putting on offset-ting S&P 500 positions. This shorting opportunityappears in Figure 1, as the widening difference be-tween the red and blue lines. If a viewer shorted onlyCramers recommendations that jumped 2% or moreon the day after his broadcasts, that difference couldearn the viewer an annualized 12% on average (lesstrading costs). The bootstrapped 95% confidence in-tervals of this difference ranged from 3% to 21%. (Forbackground on bootstrap techniques, see Efron andTibshirani, 1993)

    0 5 10 15 20 25 30

    4

    3

    2

    1

    0

    Negative lightning

    Days

    Cumulativeabnormalreturn

    Figure 2: Cumulative log return relative to S&P500 on about 480 of Cramers off-the-cuff LightningRound Sell recommendations, as recorded at mad-money.thestreet.com. The yellow band is a boot-strapped 95% confidence interval, showing that thenegative return desirable for Sell recommendations is clearly different from zero

    One reason we tested so many different strate-

    gies is that when I finally started getting responsesfrom Cramer and CNBC, they kept changing theirstory. They argued about selection. Neither of mydatabases were reliable records of Cramers picks,CNBC said, not even the database endorse