22
Volume 21, December 2010 A joint newsletter of the Statistical Computing & Statistical Graphics Sections of the American Statistical Association A Word from our 2010 Section Chairs LUKE TIERNEY COMPUTING SIMON URBANEK GRAPHICS The 2010 JSM saw the presentation of the first Statistical Computing and Graphics Award to Ross Ihaka and Robert Gentleman in recognition for their work in initiating the R Project for Statistical Computing. Fortunately both were able to receive their awards in person in a very well attended session and present their (very different!) view on future directions for R. Continued on page 2 ... The main purpose of statistical graphics is to display data, models or properties in a way that new insights can be obtained. It is very important that graphics are guided by this principle which is often forgotten in the race for the most fancy or ‘cool’-looking displays especially outside of statistical graphics. Although we should keep our eyes open for new approaches that allow us to display important Continued on page 2 ... Contents of this volume: A Word from our 2010 Section Chairs ..... 1 Highlights from the Joint Statistical Meetings 2 JSM 2011 Announcements ........... 4 barNest: Illustrating nested summary mea- sures ...................... 5 You say “graph invariant,” I say “test statistic” 11 Computation in Large-Scale Scientific and Internet Data Applications is a Focus of MMDS 2010 .................. 15 Section Officers ..... 21

A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

Volume 21, December 2010

A joint newsletter of the StatisticalComputing & Statistical GraphicsSections of the American StatisticalAssociation

A Word from our 2010 Section ChairsLUKE TIERNEYCOMPUTING

SIMON URBANEKGRAPHICS

The 2010 JSM saw the presentation of the firstStatistical Computing and Graphics Award to RossIhaka and Robert Gentleman in recognition fortheir work in initiating the R Project for StatisticalComputing.

Fortunately both were able to receive theirawards in person in a very well attended sessionand present their (very different!) view on futuredirections for R.

Continued on page 2 . . .

The main purpose of statistical graphics is todisplay data, models or properties in a way thatnew insights can be obtained. It is very importantthat graphics are guided by this principle whichis often forgotten in the race for the most fancyor ‘cool’-looking displays especially outside ofstatistical graphics.

Although we should keep our eyes open fornew approaches that allow us to display important

Continued on page 2 . . .

Contents of this volume:A Word from our 2010 Section Chairs . . . . . 1Highlights from the Joint Statistical Meetings 2JSM 2011 Announcements . . . . . . . . . . . 4barNest: Illustrating nested summary mea-

sures . . . . . . . . . . . . . . . . . . . . . . 5

You say “graph invariant,” I say “test statistic” 11Computation in Large-Scale Scientific and

Internet Data Applications is a Focus ofMMDS 2010 . . . . . . . . . . . . . . . . . . 15

Section Officers . . . . . 21

Page 2: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

Computing ChairContinued from page 1.

This coming JSM will again have a Data Expocompetition. The topic is the Deep-Water Horizonoil spill. For details visit http://stat-computing.org/dataexpo/. Data sources are still beingcollected on the competition web page, and diligentsearching may turn up many more, so it is a greatopportunity for you to show your skills as a datasleuth.

Thanks to Thomas Lumley for a great programat Vancouver. David Poole has put together astrong invited program for the 2011 JSM at MiamiBeach. We also are hoping to offer several excitingCE courses, so you may want to keep an eye out forthose. The contributed program of course dependson you: I encourage you to submit contributionsand mark them for our section.

This is my final note to you as chair, so Iwould like tho thank all current and retiring boardmembers for their hard work and welcome the newmembers and our new chair, Rich Heiberger.

Luke TierneyUniversity of Iowa

Graphics ChairContinued from page 1.

properties better or to handle large data, we shouldalways keep in mind that the goal is to representthe underlying data is such a way that we canunderstand what we see. Only then will graphicsserve an analytical purpose.

This may sound familiar to most of ourmembers, but we need to get this messageout to other visualization communities as well.Interdisciplinary collaboration is very common andwe should take this opportunity to learn fromother fields but also share the knowledge wehave accumulated in our community over manyyears. Better graphics even for very specializedapplications are high in demand. It is alwaysa pleasure to see excited domain experts whendiscussing results in graphical form – on paper orinteractively.

Part of this sharing at least among statisticianshappens at the JSM and we had a very strongprogram for which I would like to thank HeikeHofmann. Webster West is preparing the nextyear’s program, so please feel free to suggest topiccontributed sessions and do not forget to submitabstracts - the JSM website is open for submissions.Also I would like to encourage more graphicscontributions to the Student Paper Competitionwhich closes very soon (December 13). Ourbi-annual Data Expo 2011 competition is on again- please see the website http://stat-computing.

org/dataexpo/ for details.Finally, on behalf of the section I would like

to thank Rick Wicklin for his excellent service tothe section as Secretary/Treasurer and welcome JayEmerson in this function. I would also like to thankeveryone in the Statistical Graphics and StatisticalComputing sections as well as the section officersfor a great year. Juergen Symanzik will be takingover as the chair of graphics in 2011.

Simon UrbanekAT&T Research

Highlights from the Joint StatisticalMeetingsStat Computing program

This JSM saw the inaugural Statistical Comput-ing and Graphics award, given to Ross Ihaka andRobert Gentleman for the creation of R. We had awell-attended session including a fairly large con-

tingent from Auckland, where Ross and Robertwere working at the time. Among the other invitedsessions in Computing was the session on Networkmodels that started off the current SAMSI programon complex networks. Our Topic Contributed ses-sions included an interesting and diverse set of stu-

2

Page 3: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

dent paper award presentations, congratulations tothe students, and thanks to the reviewers. High-lights of the contributed paper sessions included aTuesday morning session on software and user in-terfaces. The Joint Computing and Graphics mixerwas well attended and we thank the all the com-panies and individuals who provided door prizes.Finally, the city of Vancouver was itself a highlightof the JSM, a convention location that really is at itsbest in August.

Thomas LumleyUniversity of Auckland

Graphical Highlights from JSM2010

Vancouver was a gorgeous setting for the Joint Sta-tistical Meetings with water and mountains in viewfrom the conference center. The Graphics sectionhad a good turnout of attendees and sessions.

We had two invited sessions this year - the‘Quantified Self: Personal Data Collection, Analy-sis, and Exploration’ had outside speakers, givingus a perspective of collecting data on and aroundthemselves. Seth Roberts from Tsinghua Univer-sity was sharing his views, his very personal dataand got a great discussion going. The other invitedsession was one that is very close to my heart andresearch. It introduced two new approaches of get-ting highly interactive displays into R. Simon Ur-banek showcased the iplots extreme package Aci-

nonyx, Hadley Wickham and Michael Lawrencepresented the Qt based packages qtbase and qt-

paint. A few months down the road and more de-velopment on all these packages shows more andmore clearly that having highly interactive graph-ics in R is shortly to become the rule rather than theexception for everybody!

We had two topic contributed sessions this year.Thanks to Charlotte Wickham and Dawn Wood-berger for organizing them! One was looking atlast year’s data expo - we saw data on millionsof flights, and various approaches of how to think

about exploring this mass of data. It seemed tobe a good venue to have another look at what weonly saw briefly in poster form the year before. Theother topic contributed sessions linked Bayesianand Graphical Statistics, and provided a theoreticalapproach to implement user-feedback into a dataexploration.

One of the sessions that caused a lot of talkright at the JSM and afterwards, was the ses-sion on ‘Graphics in clinical trials’, which drewso much interest, that people queued in the hall-ways! In case you were unlucky or missed thesession: Jürgen Symanzik pulled all the slides to-gether, and you will be able to find them andsession details at http://stat-computing.org/

events/2010-jsm/ .Overall, one of the highlights of Vancouver’s

JSM was the session for the inaugural Computing-Graphics Software Award, which went to R & R– Robert Gentlemen and Ross Ihaka – for bring-ing R to us. The award will be ongoing and isused to ‘recognize an individual or team for in-novation in computing, software, or graphics thathas had a great impact on statistical practice or re-search’. A Revolution Analystics comment summa-rizes it nicely: “It really can’t be overstated howwell-deserved this award is.".

Speaking of awards - the section will againsponsor the ASA Data Expo in 2011. It is alreadyon the way, but there is still plenty of time to startworking on the data! This year’s topic is the effectof the BP oil spill. More details and a sign-up for thecompetition are available at http://streaming.

stat.iastate.edu/dataexpo/2011/ or just googlefor ‘Data Expo’.

Don’t forget to make use of the opportunity toput in topic contributed sessions; it is a great wayto get talks on a similar topic into the same ses-sion, while at the same time helping to strengthenthe section allocating invited sessions in the fu-ture. Deadlines for topic contributed sessions arethe same as for regular contributed talks.

See you all at Miami Beach next year,

Heike HoffmannIowa State University

3

Page 4: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

JSM 2011 AnnouncementsRoundtable Discussion LeadersDear members of the sections on statistical com-puting and graphics,

We are looking for people to lead roundtable dis-cussions at JSM 2011 in Miami. Roundtable discus-sions are small, informal discussions of 10 or fewerpeople, led by an expert in a particular topic. Theseroundtables are good ways to meet other folks in-terested in specific topics in statistical computing,engage in some good discussion, and get a freemeal!

Note that leading a roundtable does NOT counttowards your one allotted JSM event - i.e. you canlead a roundtable and still be eligible to give an-other presentation at JSM.Roundtables in 2010 included:

• Behind the scenes at the Netflix prize

• Facilitating communication about analysisdata needs across functional areas

• Effective collaboration of statistical program-mers and clinical statisticians

• R’s graphical user interface ’R commander’ inthe intro stats classroom

• R graphics with an Excel front end

• R graphics for EDA

Abstracts for roundtables need to be submittedin January, so if you are interested please contactChris Volinsky ([email protected]) andHadley Wickham ([email protected]).

Regards,Chris Volinsky

Chris VolinskyAT&T Labs

Data Expo 2011The ASA Statistical Graphics and Computing sec-tions are pleased to announce Data Expo 2011, a

poster competition at the Joint Statistical Meetingsin Miami Beach, Florida, Jul 30-Aug 4, 2011. Detailsof the competition are at:

http://streaming.stat.iastate.edu/

dataexpo/2011/ (or web search "Data Expo 2011".)

http://groups.google.com/group/data-expo-2011

Explore how the Deepwater Horizon oil impacts the gulf. Visualize where the oil is and where it is going. Show us how the oil spill has affected you.

Entries will be judged during poster session at JSM 2011,

Miami Beach Florida.

Cash prizes and winners invited to submit to

Computational Statistics. Separate awards for college

and high school students

Photo by http://www.flickr.com/photos/kk/4713571958

ASA Data Expo 2011

This year’s data focuses on the government pub-licly collected data monitoring the effects of the BPoil spill related to April 20 2010 Deepwater Horizonrig explosion.

Entry is open to all. Cash prizes will beawarded. Please send an email expression of in-terest to [email protected] if you plan on par-ticipating. The first deadline is to submit a posterabstract to the meetings web site by Feb 1, 2011.

Please spread the word as widely as you arewilling. The more entrants, the better the compe-tition!

Dianne CookIowa State University

4

Page 5: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

barNest: Illustrating nested summarymeasuresJim Lemon and Ofir Levy

Abstract

The challenge of illustrating nested summary mea-sures is introduced and methods for meeting thesechallenges is outlined. The nested bar plot is sug-gested as one method that finds support in studiesof perception and is similar to other common illus-trations. Its major advantage is that most viewersneed little or no explanation to understand the re-lationships illustrated.

Introduction

Descriptive studies of large samples often include anumber of categorical measures that define disjunctsubsamples, such as males and females or work-ing and unemployed. Summaries of other mea-sures are often of interest, for example the meanages of males and females or the median incomeof those working and unemployed. The categoriesmay also be combined to discover whether a com-parison within one category is different from thatwithin another. Illustrating such interactions, espe-cially across the different levels of categories, maynot be easy.

An example of the problem:women and children first

The classic survival data from the sinking of the Ti-tanic provide a good example of this. The survivalrate broken down by accommodation class, sex andage of the passengers and crew is often used to il-lustrate the unequal outcomes that were once ac-cepted as the normal course of events as well as thegallantry of those who chose to sacrifice their livesfor others. Making the relationships between thesedifferent effects apparent to the average audience isthe challenge that is attempted.

Perceptual considerations

The intended illustration must make both the hi-erarchical grouping of the summary measures andthe comparisons between different levels of group-ing obvious. The categories that are combined tomake the groups must also be clear to the viewer.The bar plot was chosen as the plot style. One ofthe principles of gestalt perception is that objectsin the visual field will be grouped by their similari-ties and the groups will be separated by dissimilari-ties or discontinuities. Thus bar plots in which barsare grouped by their horizonal position provide aneasy cue to their relationship in the data set. Whileit is possible to add an extra level of grouping bystacking bars within a group defined by horizon-tal position, this makes the relative heights of thebars almost impossible to compare. Additionally,it is difficult to see how this could be extended tofurther levels of grouping.

A response to the problem: thenested bar plot

The approach taken was to devise a function thatwould display bars at each level of categorization,making the hierarchical structure obvious by dis-playing each group of categorized summary mea-sures within the aggregated summary measure thatcontained them. Thus Figure 1 begins with a barthat is almost the full width of the plot, with eachsuccessive set of summary measures plotted withinthe bar that represents the superordinate category.The three small plots above the main plot showhow it is built up. Each level of categorization hasa different set of colors, and while the colors are re-peated within each level of categorization, it wasintended that both the separation between groupsand the underlying bars would help to avoid con-fusion between the groups. The labels beneath theplot were intended to reinforce these distinctions.

Proceeding from the overall proportion surviv-ing, it is clear that the accommodation class had aconsiderable effect. First class passengers were halfagain as likely to survive as second class, second

5

Page 6: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

class half again as likely to survive as third class.There was little difference in survival between thethird class passengers and the crew. Within the ac-commodation classes, it is again obvious that chil-dren were much more likely to survive than adults,particularly in the first and second classes, in whichno children lost their lives. Finally, the innermostsex breakdown shows that females were muchmore likely to survive in every class and age divi-sion apart from the universal survival of childrenin first and second class mentioned above. Con-jectures about the relative gallantry of the males ineach class might even be made.

What are the alternatives?There is an unavoidable compromise in illustrat-ing one summary measure at the expense of oth-ers. The nested bar plot attempts to clearly displayone summary measure, in this case the proportionof survivors, across the entire hierarchical break-down. Another approach is to translate the num-ber of survivors and non-survivors in each subsetinto tiled rectangles, the area of which is propor-tional to the frequency of each cell. This is knownas a mosaic plot, initially attributed to Hartigan andKleiner [2]. It is possible to construct a mosaic plotwith two dimensions of disaggregation on the twoaxes of a plot and add more levels by nesting (Fig-ure 3). While the mosaic plot accurately portraysthe quantitative relationships between the obser-vations, a glance at such a plot of the Titanic datashows the difficulty of visually comparing the pro-portions across even one level, let alone three.

One commonly suggested alternative to thenested bar plot is the "doubledecker" plot [1]. Fig-ure 2 is an example using the same data with thisplot. As the number of children was small in eachclass and non-existent in the crew, the identifyinglabels become illegible, even with a wide graph-ics device. As there is no nesting of the rectangles,the intermediate levels of aggregation are not dis-played. All variations on this technique known tothe authors suffer from the similar problems. Thesuperbarplot function in the ‘UsingR’ package usesnested bars to display extreme values and variabil-ity, but presents separate subsets of data, such asdaily temperature measures.

The nested bar plot departs from the conven-tional stance that bar plots should be restricted todisplaying frequencies. However, the available al-

ternatives are much more difficult, and in somecases impossible, to display in a visual hierarchy.

The programming of barNestThe R function that displays nested bar plots pro-ceeded through a number of stages before produc-ing satisfactory plots. The first consideration was todefine and calculate a data object that would con-tain the summary measures to be plotted. Fortu-nately, a list containing the summary measures foreach level of disaggregation is easy to produce in R.

Displaying the plot requires recursive calls, asthe number of category levels typically differ at dif-ferent levels of categorization. The display is pro-duced by a function that calls itself for each subcat-egory until there are no further levels of categoriesto illustrate.

Using the barNest functionUsers of ‘barNest’ will typically have a data framecontaining a column of numeric values that are tobe grouped by two or more columns of categoricalvalues. The order of grouping is specified in theformula that is the first argument of the function,with the leftmost variable defining the first set ofcategories. Most of the arguments such as ‘main’and ‘ylab’ will be familiar to those who use otherplot functions. The ‘trueval’ argument is usedwhen the proportion of one value in a catgeoricalvariable is to be plotted rather than a measure ofcentral tendency in a numeric variable.

The code used to produce Figure 1 is the firstexample used in the ‘barNest’ help page in the‘plotrix’ package, and recreates the Titanic dataset as a data frame. The formula and data set arethen passed directly to ‘barNest’.exit

In order to define the colors, it is necessary tocreate a list with one element for each level of cat-egorization. The number of colors in each levelshould be at least as many as the number of cat-egories. Choosing colors that are easily distin-guished may require some thought. Apart fromthe ‘errbars’ option, the user may also ask for onlythe final set of categories to be displayed or not todisplay the labels below the bars. These bar labelscan be passed as a list similar to the bar colors ifthe labels of the category variables are not suitable.There are a number of fine adjustments like the size

6

Page 7: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

0.0

0.2

0.4

0.6

0.8

1.0

Overall only

0.0

0.2

0.4

0.6

0.8

1.0

Overall + Class

0.0

0.2

0.4

0.6

0.8

1.0

Overall + Class + Age0.

00.

20.

40.

60.

81.

0

Titanic survival by class, age and sex

Prop

ortio

n su

rviv

ing

Overall1st

AdultF M

ChildF M

2ndAdultF M

ChildF M

3rdAdultF M

ChildF M

CrewAdult

F M

Figure 1: Survival on the Titanic by accommodation class, age and sex (barNest).

7

Page 8: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

● ● ● ●

● ●

ClassSexAge

1stMaleChildAdult

FemaleChildAdult

2ndMaleChildAdult

FemaleChildAdult

3rdMaleChildAdult

FemaleChildAdult

CrewMaleChildAdult

FemaleChildAdult

Yes

No

Survived

Figure 2: Survival on the Titanic by accommodation class, age and sex (doubledecker).

of the characters in the labels and the proportionto shrink each succeeding set of bars. Detailed in-structions for the use of ‘barNest’, as well as thefunction itself, can be found in the ‘plotrix’ pack-age.

As a final example of what ‘barNest’ can accom-plish, we will turn to a lighter set of data concernedwith making holes in paper rather than ocean lin-ers. The scores for a year in two matches for a smallpistol club will be used to demonstrate an interest-ing, if sobering, effect of age. The club memberswere categorized by age as being under 20 years,20 to 40 years and over 40 years of age. Figure 4shows the mean scores and 0.25 and 0.975 quan-tiles broken down by match and age. Only the ulti-mate breakdown is displayed, but the grouping la-bels are retained. It is clear that the 20 to 40 year oldmembers have a considerable advantage in bothmatches. Perhaps just as interesting is that they are

also much less variable in their scores. This is notentirely due to the opposing influences of trainingand age, but is in part a ceiling effect of the possiblescore of 600 points. However, the ceiling effect isclearly stronger in Open Sport Pistol than in Stan-dard Pistol. Readers who are pistol shooters maybe interested in speculating upon this.

Limitations of barNestFigure 1 displays four levels of summary measures.Even this modest level of complexity may require awider than normal graphics device to avoid crowd-ing of the labels, although it does not in this in-stance. The visual complexity of the plot is ap-parent, yet viewers typically understand the nestedstructure and grasp the relationships between thevalues.

The illustration of dispersion available in

8

Page 9: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

Titanic

Class

Sex

1st 2nd 3rd CrewMale

Female

Child AdultNo

Yes

No

Yes

Child Adult Child Adult Child Adult

Figure 3: Survival on the Titanic by accommodation class, age and sex (mosaicplot).

‘barNest’ is easily misused. Setting the ‘errbars’argument to TRUE causes the values calculated bythe second and possibly third summary functionspassed as arguments to be displayed as "error bars".It is left to the user to decide whether these con-vey valuable, or even correct, information to theviewer. At best, the error bars can give a roughvisual indication of the significance of differencesin the summary measures, or as in Figure 3, differ-ences in variability that may be of interest.

Summary

The nested bar plot offers the user a method to il-lustrate summary values that are nested in increas-ingly complex combinations of categorical vari-ables. Its major advantage is the ease with whichan audience can grasp the relationships betweenhierarchical summary measures with little or noexplanation of the illustration. We expect that re-searchers wanting to illustrate such relationships to

9

Page 10: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

non-specialist audiences, particularly when there isno opportunity to explain the illustration, will findit useful.

Bibliography[1] M. Friendly. Mosaic displays for multi-way con-

tingency tables. Journal of the American StatisticalAssociation, (89):190–200, 1994.

[2] J. A. Hartigan and B. Kleiner. A mosaic of televi-sion ratings. The American Statistician, (9):17–23,1984.

Jim Lemonbitwrit software16A Imperial Avenue, Gladesville, NSW [email protected]

Ofir LevyDepartment of ZoologyTel Aviv University, Tel [email protected]

0

100

200

300

400

500

Pistol scores by match and age

Mea

n sc

ore

OverallOpen Sport Pistol

under20 20−40 over40Standard Pistol

under20 20−40 over40

Figure 4: Mean scores and 0.25 and 0.975 quantiles for two matches by age.

10

Page 11: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

You say “graph invariant,” I say “teststatistic”Carey E. Priebe, Glen A. Coppersmith and AndreyRukhin

Introduction: Statistical Inferenceon Random GraphsHypothesis testing on graphs g 2 G has applica-tion in areas as diverse as connectome inference(wherein vertices are neurons or brain regions), so-cial network analysis (wherein vertices representindividual actors or organizations), and text pro-cessing (wherein vertices represent authors or doc-uments). Graph invariants – functions T : G ! Rthat do not depend on the particular labeling of thevertices – can be used as test statistics on a ran-dom graph G ⇠ F for deciding H0 : F 2 F0 vsHA : F 2 FA.

However, even for simple models the exact dis-tribution is unavailable for most invariants. Fur-thermore, comparative analyses of statistical powerat some given Type I error rate for competing in-variants, via both Monte Carlo and large sampleapproximation, demonstrate that simple settingscan yield interesting comparative power phenom-ena.

In particular, two forthcoming articles inves-tigating comparative power of simple invariantsin the independent edge setting show that forsmall graphs (103 vertices) the comparative powersurface is complicated [1] and limiting behaviormay be misleading except for astronomically largegraphs [2].

Random Graphs: Simple Modelsand Simple InvariantsWe consider simple graphs (undirected, noweights, no self-loops) on n vertices1. Perhapsthe simplest of all random graph models is H0:ER(n, p), wherein all

�n2

�potential edges are inde-

pendent and identically distributed Bernoulli(p). Asimple but interesting and illustrative alternativeis given by HA: (n, p,m, q), wherein there exists a

subset of m vertices and all�m2

�potential edges

amongst this subset are identically distributedBernoulli(q) while the remaining

�n2

�-�m2

�poten-

tial edges are identically distributed Bernoulli(p)with all

�n2

�potential edges independent. We con-

sider p known so that the null is simple; m (andin particular, which m) and q are unknown so thatthe alternative is composite. (These models consistof independent coin flips . . . how can this yieldinteresting (i.e., non-trivial) results? In response tothis question we quote Béla Bollobás’ comment [3]that a particular J.E. Littlewood paper [4] “is highlyrecommend to those who think that the binomialdistribution is too simple to deserve study.”)

Among the simplest of all graph invariants issize, the total number of edges in the graph. Un-der H0 size(G) is binomial(

�n2

�, p) while under HA

this invariant is the sum of independent binomialswith different success probabilities, limiting Gaus-sian distributions are available in both cases. An-other simple invariant is maxdegree, the maximumover all vertices v of the degree d(v). The individualdegrees d(v) are binomial(n�1, p) under H0 and thesum of independent binomials with different suc-cess probabilities under HA; the collection {d(v)}is not independent, but limiting Gumbel distribu-tions are available in both cases.

Many more interesting random graph models –in particular, latent position models – are availablefor study (see for instance [5] Section 3), as are moreelaborate graph invariants – in particular, the num-ber of triangles can dominate size and the graphscan statistic can dominate maxdegree in terms ofstatistical power on the inference task (see [1] Fig-ures 13 and 11, respectively). There are significantissues involved in actually computing many candi-date invariants on large graphs and in estimationof percentiles for testing, so the trade-offs betweenbetter invariants for inference and computationalissues demand investigation. Nevertheless, evenour simple models and simple invariants yield in-teresting and illustrative results.

1We assume familiarity with basic graph terminology, or see e.g. [1, 2, 3].

11

Page 12: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

Figure 1: Our first comparative power plot, from 2006. This plot considers the maximum average degree(mad) invariant against the graph scan statistic (scan) via Monte Carlo. H0: ER(n = 100, p = 0.1), HA:(n = 100, p = 0.1,m, q), level ↵ = 0.05, �(�) ⌘ power(mad)-power(scan). We see the phenomenon ofinterest: both invariants have power � ⇡ 1 for large m and q and power � ⇡ ↵ for small m and q, asexpected, while for moderate m and q, the comparative behavior of the invariants is quite complicated. Inparticular, there is a ridge/trough phenomenon running (nonlinearly) from large q small m (where scan issuperior) to small q large m (where mad is superior) which differentiates the invariants. That is, the specificalternative – how many anomalous vertices (m) and by how much are they anomalous (q) – determines themost powerful statistic. This phenomenon is the impetus for on-going research.

Interesting and Illustrative Re-sults

The first plot we made, years ago, which is the im-petus for an on-going pursuit of understanding, isshown and described in Figure 1. The ridge/troughphenomenon apparent in the figure begs investiga-tion!

Alas, the maximum average degree is a compli-cated invariant, both computationally and from anull and alternative distribution perspective, andso we have reason to consider simpler invariants asdescribed above.

The forthcoming JCGS paper [1] presents resultsof a thorough Monte Carlo investigation. For ex-

ample, Figure 2 is analogous to our original Fig-ure 1 but compares size and maxdegree at n = 1000;the left panel depicts Monte Carlo results, and theright panel depicts the use of asymptotic resultsto provide approximate distributions analytically.In Figure 2 the Monte Carlo is accurate exceptfor variability due to a finite number of replicates,while the accuracy of the large sample approxi-mations must be verified. For the case depicted(n = 1000) the two comparative power surfaces arestructurally similar although there are differencesbeyond just Monte Carlo variation. For smallergraphs (n = 100) the Monte Carlo is still accu-rate but the asymptotic approximations are not; forlarger n the Monte Carlo is computationally pro-hibitive but the asymptotics are accurate. Consider

12

Page 13: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

1020

3040

5060

7080

90100 0

0.10.2

0.30.4

0.50.6

0.70.8

0.91

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

qm

Δ(β)

1020

3040

5060

7080

90100 0

0.10.2

0.30.4

0.50.6

0.70.8

0.91

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

qm

Δ(β)

Figure 2: Comparative power plots for size vs. maxdegree via Monte Carlo (left) and large sample approx-imation (right) for H0: ER(n = 1000, p = 0.1), HA: (n = 1000, p = 0.1,m, q), level ↵ = 0.05, �(�) ⌘power(maxdegree)-power(size). The ridge/trough phenomenon is apparent. Note that both axes in this Fig-ure have been flipped with respect to Figure 1; with this understanding, one can see that the orientation ofthe ridge/trough is consistent.

a taxonomy of graph sizes wherein “small” meansvertices and edges all fit into memory, “moderate”means vertices fit into memory but not edges, and“large” means even the vertices do not all fit intomemory at once. In such a scenario we see thatMonte Carlo analysis does not scale well with thenumber of vertices: for any “large” graph in thistaxonomy even the simplest tasks (e.g. performingan edge-count) become computationally challeng-ing.

The forthcoming JSPI paper [2] presents theo-retical results comparing size and maxdegree, andshows that when m = ⇥(

pn) size dominates

asymptotically but that this domination does nottake effect until astronomically large n. Thisdemonstrates that a comparison of test statisticsbased on limiting power can be misleading forgraph inference. Figure 3 illustrates this effect, withm =

pn.

1e+02 1e+08 1e+14 1e+20 1e+26

-1.0

-0.5

0.0

0.5

1.0

n

Δ(β)

1e+02 1e+08 1e+14 1e+20 1e+26

-1.0

-0.5

0.0

0.5

1.0

Figure 3: Comparative power plot demonstrat-ing that size dominates maxdegree asymptoticallybut that this domination does not take effect untilastronomically large n. H0: ER(n, p = 0.1), HA:(n, p = 0.1,m =

pn, q = 0.9), level ↵ = 0.05,

�(�) ⌘ power(maxdegree)-power(size).

13

Page 14: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

Conclusions

The conclusion of this short story is three-fold: (1)even for simple models and simple invariants com-plex and interesting behavior is evident; (2) muchwork – theoretical, computational, and experimen-tal – remains to provide results for realistic modelsand methods; and (3) an interesting plot (such asFigure 1) can provide years of enjoyment!

Bibliography[1] H. Pao, G. A. Coppersmith, and C. E. Priebe.

Statistical inference on random graphs: Com-parative power analyses via monte carlo. Jour-nal of Computational and Graphical Statistics,forthcoming.

[2] A. Rukhin and C. E. Priebe. A compara-tive power analysis of the maximum degreeand size invariants for random graph inference.

Journal of Statistical Planning and Inference, forth-coming.

[3] B. Bollobás. Random Graphs. Cambridge Uni-versity Press, 2001.

[4] J. E. Littlewood. On the probability in the tailof a binomial distribution. Advances in AppliedProbability 1(1):pp. 43–72, 1969.

[5] A. Goldenberg, A. X. Zheng, S. E. Fienberg,and E. M. Airoldi. A survey of statistical net-work models. Foundations and Trends in MachineLearning 2(2):129–233, 2009.

Carey E. Priebe,Glen A. CoppersmithAndrey RukhinDepartment of Applied Mathematics and StatisticsJohns Hopkins University,Baltimore, MD [email protected]

14

Page 15: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

Computation in Large-Scale Scientificand Internet Data Applications is a Focusof MMDS 2010Michael W. Mahoney

The 2010 Workshop on Algorithms for ModernMassive Data Sets (MMDS 2010) was held at Stan-ford University, June 15–18. The goals of MMDS2010 were (1) to explore novel techniques for mod-eling and analyzing massive, high-dimensional,and nonlinearly-structured scientific and Internetdata sets; and (2) to bring together computerscientists, statisticians, applied mathematicians,and data analysis practitioners to promote cross-fertilization of ideas. MMDS 2010 followed on theheels of two previous MMDS workshops. The first,MMDS 2006, addressed the complementary per-spectives brought by the numerical linear algebraand theoretical computer science communities tomatrix algorithms in modern informatics applica-tions [1]; and the second, MMDS 2008, exploredmore generally fundamental algorithmic and sta-tistical challenges in modern large-scale data anal-ysis [2].

The MMDS 2010 program drew well over 200

participants, with 40 talks and 13 poster presenta-tions from a wide spectrum of researchers in mod-ern large-scale data analysis. This included bothacademic researchers as well as a wide spectrumof industrial practitioners. As with the previousmeetings, MMDS 2010 generated intense interdis-ciplinary interest and was extremely successful,clearly indicating the desire among many researchcommunities to begin to distill out and establishthe algorithmic and statistical basis for the analy-sis of complex large-scale data sets, as well as thedesire to move increasingly-sophisticated theoreti-cal ideas to the solution of practical problems.

Several Recurring Themes

Several themes—recurring melodies, as one par-ticipant later blogged, that played as backgroundmusic throughout many of the presentations—emerged over the course of the four days of themeeting. One major theme was that many modern

data sets of practical interest are better-describedby (typically sparse and poorly-structured) graphsor matrices than as dense flat tables. While thismay be obvious to some—after all, both graphs andmatrices are mathematical structures that providea “sweep spot” between more descriptive flexibil-ity and better computational tractability—this alsoposes considerable research and implementationalchallenges, given the way that databases have his-torically been constructed and the way that super-computers have historically been designed. A sec-ond major theme was that computations involvingmassive data are closely tied to hardware consid-erations in ways that are very different than havebeen encountered historically in scientific comput-ing and computer science—and this is true both forcomputations involving a single machine (recall re-cent developments in multicore computing) and forcomputations run across many machines (such asin large distributed data centers).

Given that these and other themes weretouched upon from many complementary perspec-tives and that there was a wide range of back-grounds among the participants, MMDS 2010 wasorganized loosely around six hour-long tutorialpresentations.

Large-Scale Informatics: Prob-lems, Methods, and ModelsOn the first day of the workshop, participantsheard two tutorials that addressed computationalissues in large-scale data analysis from two verydifferent perspectives. The first was by PeterNorvig of Google, and the second was by JohnGilbert of the University of California at Santa Bar-bara.

Norvig kicked-off the meeting with a tutorialon “Internet-Scale Data Analysis,” during whichhe described the practical problems of running, aswell as the enormous potential of having, a datacenter so massive that “six-sigma” events, like cos-mic rays, drunken hunters, blasphemous infidels,

15

Page 16: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

and shark attacks, are legitimate concerns. Atthis size scale, the data can easily consist of bil-lions to trillions of examples, each of which is de-scribed by millions to billions of features. In mostdata-intensive Internet applications, the peak per-formance of a machine is less important than theprice-performance ratio. Thus, at this size scale,computations are typically performed on clustersof tens or hundreds of thousands of relatively-inexpensive commodity-grade CPUs, carefully or-ganized into hierarchies of servers, racks, andwarehouses, with high-speed connections betweendifferent machines at different levels of the hierar-chy. Given this cluster design, working within asoftware framework like MapReduce that providesstateless, distributed, and parallel computation hasbenefits; developing methods to maximize energyefficiency is increasingly-important; and develop-ing software protocols to handle ever-present hard-ware faults and failures is a must.

Given all of this infrastructure, one can then doimpressive things, as large Internet companies suchas Google have demonstrated. Norvig surveyed arange of applications such as modelling flu trendswith search terms, image analysis for scene comple-tion (removing undesirable parts of an image andfilling in the background with pixels taken froma large corpus of other images), and using sim-ple models of text to perform spelling correction.In these and other Web-scale applications, simplermodels trained on more data can often beat morecomplex models trained on less data. This can besurprising for those with experience in small-scalemachine learning, where the curse of dimensional-ity and overfitting the data are paramount issues.In Internet-scale data analysis, though, more datamean different data, and throwing away even rareevents can be a bad idea since much Web data con-sists of individually rare but collectively frequentevents.

John Gilbert then provided a complementaryperspective in his tutorial “Combinatorial ScientificComputing: Experience and Challenges.” Combi-natorial Scientific Computing (CSC) is a researcharea at the interface between scientific computingand algorithmic computer science; and an impor-tant goal of CSC is the development, application,and analysis of combinatorial algorithms to en-able scientific and engineering computations. Asan example, consider so-called fill-reducing matrixfactorizations that arise in the solution of sparse

linear systems, a workhorse for traditional high-performance scientific computation. “Fill” refers tothe introduction of new non-zero entries into a fac-tor, and an important component of sparse matrixsolvers is an algorithm that attempts to solve thecombinatorial problem of choosing an optimal or-dering of the columns and rows of the initial ma-trix in order to minimize the fill. Similar combi-natorial problems arise in scientific problems as di-verse as mesh generation, iterative methods, cli-mate modeling, computational biology, and par-allel computing. Throughout his tutorial, Gilbertfocused on two broad challenges—the challengeof architecture and algorithms, and the challengeof primitives—in applying CSC methods to large-scale data analysis.

The “challenge of architecture and algorithms”refers to the nuts and bolts of getting high-qualityimplementations to run rapidly on machines, e.g.,given architectural constraints imposed by commu-nication and memory hierarchy issues or the exis-tence of multiple processing units on a single chip.As an example of the impact of architecture oneven simple computations, consider the ubiquitousthree-loop algorithm for multiplying two n⇥n ma-trices, A and B: foreach i, j, k,

C(i, j) = A(i, k) ⇤B(k, j).

It seems obvious that this algorithm should runin O(n3

) time (and it does in the Random Ac-cess Model of computation); but empirical resultsdemonstrate that the actual scaling on real ma-chines of this naïve algorithm for matrix multipli-cation can be closer to O(n5

). Theoretical results inthe Uniform Memory Hierarchy model of compu-tation explain this scaling behavior, and it is onlymore sophisticated BLAS-3 GEMM and recursiveblocked algorithms that take into account memoryhierarchy issues that run in O(n3

) time.The “challenge of primitives” refers to the need

to develop algorithmic tools that provide a frame-work to express concisely a broad scope of com-putations; that allow programming at the ap-propriate level of abstraction; and that are ap-plicable over a wide range of platforms, hidingarchitecture-specific details from the users. His-torically, linear algebra has served as the “middle-ware” of scientific computing. That is, by provid-ing mathematical tools, interactive environments,and high-quality software libraries, it has pro-vided an “impedance match” between the theory

16

Page 17: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

of continuous physical modeling and the practiceof high-performance hardware implementations.Although there are deep theoretical connectionsbetween linear algebra and graph theory, Gilbertnoted that it is not clear yet to what extent theseconnections can be exploited practically to createan analogous middleware for very large-scale ana-lytics on graphs and other discrete data. Perhapssome of the functionality that is currently beingadded onto the basic MapReduce framework (andthat draws strength from experiences in relationaldatabase management or high-performance paral-lel scientific computing) will serve this role, but thisremains to be seen.

New Perspectives on Old Ap-proaches to Networked Data

Although graphs and networks provide a popularway to model large-scale data, their use in statisti-cal data analysis has had a long history. Describingrecent developments in a broader historical contextwas the subject of tutorials by Peter Bickel of theUniversity of California at Berkeley and SebastianoVigna of the Università degli Studi di Milano.

In his tutorial on “Statistical Inference for Net-works,” Bickel described a nonparametric statisti-cal framework for the analysis of clustering struc-ture in unlabeled networks, as well as for para-metric network models more generally. As back-ground, recall the basic Erdos-Rényi (ER) randomgraph model: given n vertices, connect each pair ofvertices with probability p. If p � log(n)/n, suchgraphs are “dense” and fairly regular—due to thehigh-dimensional phenomenon of measure concen-tration, such graphs are fully-connected; they areexpanders (i.e., there do not exist any good cuts, orpartitions, of them into two or more pieces); and theempirically-observed degrees are very close to theirmean. On the other hand, for the much less well-studied regime 1/n < p < log(n)/n, these graphsare very sparse and very irregular—such graphsare not even fully-connected; and when consider-ing just the giant component, there are many smallbut deep cuts, and empirically-observed degreescan be much larger than their mean. This lack oflarge-scale regularity is also seen in “power law”generalizations of the basic ER model; it’s signa-tures are seen empirically in a wide range of verylarge social and information networks; and it ren-

ders traditional methods of statistical inference oflimited usefulness for these very large real-worldnetworks.

Bickel considered a class of models applica-ble to both the dense/regular and sparse/irregu-lar regimes, but for which the assumption of sta-tistical exchangeability holds for the nodes. Thisexchangeability assumption provides a regularitysuch that any undirected random graph whose ver-tices are exchangeable can be written as a mixtureof “simple” graphs that can be parametrized by afunction h(·, ·) of two arguments. Popular stochas-tic blockmodels are examples of parametric mod-els which approximate this class of nonparametricmodels—the block model with K classes is a simpleexchangeable graph model, and block models canbe used to approximate a general function h. In thisframework, Bickel considered questions of identifi-ability and consistency; and he showed that, un-der assumptions such as that the expected degreeis sufficiently high, it is possible to recover “groundtruth” clusters in this model.

Sebastiano Vigna provided a tutorial on “Spec-tral Ranking,” a general umbrella name for tech-niques that apply the theory of linear functions,e.g., eigenvalues and eigenvectors, to matrices thatdo not represent geometric transformations, but in-stead represent some other kind of relationship be-tween entities. For example, the matrix M may bethe adjacency matrix of a graph or network, wherethe entries of M represent some sort of binary re-lations between entities. In this case, a commongoal is to use this information to obtain a meaning-ful ranking of the entities; and a common difficultyis that the matrix M may contain “contradictory”information—e.g., i likes j, and j likes k, but i doesnot like k; or i is better than j, j is better than k, buti is not better than k.

A variant of this was considered by J.R. Seelywho, in an effort to rank children back in 1949, ar-gued that the rank of a child should be defined re-cursively as the sum of the ranks of the childrenthat like him. In modern terminology, this led tothe computation of a dominant left eigenvector ofM (normalized by row to get a stochastic matrix).A dual variant was considered by T.H. Wei who, in1952, wanted to rank sports teams and argued thatthe score of a team should be related to the sumof the scores of the teams it defeated. This led tothe computation of a dominant right eigenvector ofM (with no normalization). Since then, numerous

17

Page 18: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

domain-specific considerations led researchers topropose methods that, in retrospect, are variants ofthis basic framework. For example, in 1953, L. Katzwas interested in whether individual i endorses orvotes for individual j, and he argued that the im-portance of i depends on not just the number ofvoters, but on the number of the voters’ voters, etc.,with a suitable attenuation ↵ at each step. Since,if M is a zero/one matrix representing a directedgraph, the i, j entry of Mk contains the number ofdirected paths from i to j, he was led to compute1

P1n=0 ↵

nMn= 1(I � ↵M)

�1. Similarly, in 1965,C.H. Hubbell was interested in a form of clusteringused by sociologists known as clique identification.He argued that on can define a status index r by us-ing the recursive equation r = v + rM , where v isa “boundary condition” or “initial preference,” andthis led him to compute v

P1n=0 M

n= v(I �M)

�1.From this broader perspective, the popular

PageRank is the damped spectral ranking of thenormalized adjacency matrix of the web graph; theboundary condition is the so-called preference vec-tor; and this vector can be used for various gen-eralizations such as to bias PageRank with respectto a topic or to generate trust scores. Remark-ably, although PageRank is one of the most talked-about algorithms ever, there is no reproducible sci-entific proof that it works for the problem of rank-ing web pages, there is a large body of empiricalevidence that it does not work, and it is likely tobe of miniscule importance in today’s ranking algo-rithms. Nevertheless, partly because the basic ideasunderlying spectral ranking are so intuitive, thereare “gazillions” of small variants that could be (andare still being) introduced regularly in many areasof machine learning and data analysis. Unfortu-nately, this is often without reproducible scientificjustification or careful evaluation of which variantsare meaningful or useful.

Matrix Computations—in DataApplications

Challenges and tradeoffs in performing matrixcomputations in MMDS applications were the sub-ject of the final pair of tutorials—one by Piotr Indykof the Massachusetts Institute of Technology, andone by Petros Drineas of Rensselaer Polytechnic In-stitute.

Indyk discussed recent developments in

“Sparse Recovery Using Sparse Matrices.” Thisproblem arises when the data can be modeled bya vector x that is sparse in some (often unknown)basis; and it has received attention recently in ar-eas such as compressive sensing, data stream com-puting, and combinatorial group testing. Tradi-tional approaches first capture the entire signal andthen process it for compression, transmission, orstorage. Alternatively, one can obtain a succinctapproximate representation by acquiring a smallnumber of linear measurements of the signal. Thatis, if x is an n-vector, the representation is Ax, forsome m ⇥ n matrix A. Although typically m ⌧ n,the matrix A can be constructed such that one canuse a recovery algorithm to obtain a sparse ap-proximation to x. It is often useful (and sometimescrucial) that the measurement matrix A be sparse,in that it contains very few non-zero elements percolumn. For example, sparsity can be exploitedcomputationally—one can compute the productAx very quickly if A is sparse. Similarly, in datastream processing, the time needed to update thesketch Ax under an update �i is proportional to thenumber of non-zero elements in the i-th column ofA.

Indyk described tradeoffs that arise when de-signing recovery schemes to satisfy the tricriterionof short sketches, low algorithmic complexity, andstrong recovery guarantees. Randomization hasproved to be an important computational resource,and thus a key issue has been to identify prop-erties that hold for very sparse random matricesand also are sufficient to support efficient and ac-curate recovery algorithms. A key challenge is that,whereas dense random matrices are fairly homoge-neous (e.g., since measure concentrates their eigen-values follow Wigner’s semicircle law), very sparserandom matrices are much less regular. One cansay that a matrix A satisfies the RIP (p, k, ✏) prop-erty if

||x||p(1� ✏) ||Ax||p ||x||p

holds for any k-sparse vector x. (This generalizesthe well-known Restricted Isometry Property fromp = 2 to general p.) Although very sparse matri-ces cannot satisfy the RIP (2, k, ✏) property, unlessk or ✏ is rather large, Indyk showed that the adja-cency matrices of constant-degree expander graphsdo satisfy this property for p = 1 and that severalprevious algorithms generalize to very sparse ma-trices if this condition is satisfied.

In his tutorial on “Randomized Algorithms

18

Page 19: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

in Linear Algebra and Large Data Applications,”Petros Drineas used his work on DNA single-nucleotide polymorphisms (SNPs) to illustrate theuses of randomized matrix algorithms in data anal-ysis. SNPs are sites in the human genome wherea nonnegligible fraction of the population has oneallele and a nonnegligible fraction has a second al-lele. Thus, they are of interest in population genet-ics and personalized medicine. In addition, theycan be naturally represented as a {�1, 0,+1} matrixA, where Aij represents whether the i-th individualis homozygous for the major allele, heterozygous,or homozygous for the minor allele.

While some SNP data sets are rather small,data consisting of thousands or more of individ-uals typed at hundreds of thousands of SNPs areincreasingly-common. Size is an issue since evengetting off-the-shelf SVD and QR decompositioncode to run on dense matrices of size, say, 5000 ⇥500, 000 is nontrivial on commodity laptops. Thechallenge is especially daunting if the computa-tions need to be performed thousands of times inthe course of a cross-validation experiment. Per-haps less obvious is the issue of interpretability—even if the data clusters well in the span of the top k“eigenSNPs,” these eigenSNPs cannot be assayedin the lab and they cannot be easily thought about.Thus, while eigenvector-based methods for dimen-sionality reduction are popular among data ana-lysts, the geneticists were more interested in the kactual SNPs that were most important.

Drineas described how to address these twochallenges—the “challenge of size” and the “chal-lenge of interpretability”—in a unified manner. Hedescribed a randomized approximation algorithmfor choosing the best set of exactly k columns froman arbitrary matrix. The key structural insightwas to choose columns according to an impor-tance sampling distribution proportional to the di-agonal elements of the projection matrix onto thespan of the top k right singular vectors. Thesequantities can be computed exactly by comput-ing a basis for that space, or they can be approxi-mated more rapidly with more sophisticated meth-ods. Importantly for interpretability, these quanti-ties are the diagonal elements of the so-called “hatmatrix,” and thus they have a natural interpreta-tion in terms of statistical leverage and diagnos-tic regression analysis. Importantly for size andspeed, Hadamard-based random projections ap-proximately uniformize these scores, washing out

interesting structure and providing a basis wheresimple uniform sampling performs well. This hasled in recent years to fast high-quality numericalimplementations of these and related randomizedalgorithms.

Conclusions and Future Direc-tions

In addition to these tutorial presentations, MMDSparticipants heard about and discussed a widerange of theoretical and practical issues having todo with algorithm development and the challengesof working with modern massive data sets. As withprevious MMDS meetings, the presentations fromall speakers can be found at the conference web-site, http://mmds.stanford.edu; and as with pre-vious MMDS meetings, participant feedback madeit clear that there is a lot of interest in MMDS as adeveloping research area at the interface betweencomputer science, statistics, applied mathematics,and scientific and Internet data applications. Sokeep an eye out for future MMDSs!

AcknowledgmentsI am grateful to the numerous individuals who pro-vided assistance prior to and during MMDS 2010;to my co-organizers Alex Shkolnik, Petros Drineas,Lek-Heng Lim, Gunnar Carlsson; and to each of thespeakers, poster presenters, and other participants,without whom MMDS 2010 would not have beensuch a success.

Michael Mahoney ([email protected]) isin the Department of Mathematics at Stanford Uni-versity. His research interests include algorithmic andstatistical aspects of large-scale data analysis, includ-ing randomized algorithms for very large linear alge-bra problems and graph algorithms for analytics on verylarge informatics graphs.

Bibliography[1] G.H. Golub, M.W. Mahoney P. Drineas, and L.-

H. Lim, “Bridging the gap between numeri-cal linear algebra, theoretical computer science,and data applications,” SIAM News, 39, no. 8,(2006).

19

Page 20: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

[2] M.W. Mahoney, L.-H. Lim, and G.E. Carls-son, “Algorithmic and Statistical Challenges inModern Large-Scale Data Analysis are the Fo-cus of MMDS 2008,” SIGKDD Explorations, 10,no. 2, pp. 57–60, (2008).

Michael W. MahoneyDept. of MathematicsStanford [email protected]

20

Page 21: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

Section OfficersStatistical ComputingSection Officers 2010Luke Tierney, [email protected]

(319) 335-3386

Jose C. Pinheiro, Past [email protected]

(908) 927 5204

Richard M. Heiberger, [email protected]

(215) 808-1808

Usha S. Govindarajulu, Secretary/[email protected]

(617) 525-1237

Montserrat Fuentes, COMP/COS [email protected]

(919) 515-1921

Thomas Lumley, Program [email protected]

+64 9 373 7599 ext 83785

David J. Poole, Program [email protected]

(973) 360-7337

Barbara A Bailey, Publications [email protected]

(619) 594-4170

Jane Lea Harvill, ComputingSection [email protected]

(254) 710-1517

Donna F. Stroup (see right)Monica D. Clark (see right)

Nicholas Lewin-Koh, Newsletter [email protected]

Statistical GraphicsSection Officers 2010Simon Urbanek, [email protected]

(973) 360-7056

Antony Unwin, [email protected]

+49-821-598-2218

Juergen Symanzik, [email protected]

(435) 797-0696

Rick Wicklin, Secretary/ [email protected]

(919) 531-6629

Peter Craigmile, GRPH COS Rep [email protected]|(614) 688-3634

Mark Greenwood, GRPH COS Rep [email protected]

(406) 994-1962

Heike Hofmann, Program [email protected]

(515) 294-8948

Webster West, Program [email protected]

(803) 351-5087

Donna F. Stroup, Council of [email protected]

(404) 218-0841

Monica D. Clark, ASA Staff [email protected]

(703) 684-1221

Martin Theus, Newsletter [email protected]

21

Page 22: A Word from our 2010 Section Chairs - JHU Center for ...parky/CEP-Publications/scgn-21-2.pdfA Word from our 2010 Section Chairs . . . . . 1 ... In case you were unlucky or missed the

The Statistical Computing & Statistical GraphicsNewsletter is a publication of the Statistical Comput-ing and Statistical Graphics Sections of the ASA. Allcommunications regarding the publication should beaddressed to:

Nicholas Lewin-KohEditor Statistical Computing [email protected]

Martin TheusEditor Statistical Graphics [email protected]

All communications regarding ASA membership andthe Statistical Computing and Statistical GraphicsSection, including change of address, should be sentto

American Statistical Association1429 Duke StreetAlexandria, VA 22314-3402 USATEL (703) 684-1221FAX (703) [email protected]

22