4
Barcodes, Generalized Pairs Plots, and Sparkmats Jay Emerson, Walton Green, and John Hartigan,Yale University June 17, 2006, UseR! Vienna, Austria The barcode plot aids in comparing distributions. It shares some of the characteristics of side-by-side histograms or boxplots, and of rugs or stripplots. We have found it particularly useful with clumped data, when other methods obscure detail. As an example, we display the 2003 Tour de France stage results. Flat stages (Nevers and Lyon) are easily identified by the tallest spikes (finishers in the peleton arriving together). In contrast, mountain stages (Gap and L’Alpe d’Huez) stretch the distribution of finishing times. In some mountain stages (Gap and Morzine), trailing clumps of slower (exhausted) riders finish together (in a group called the autobus), avoiding elimination. Tour de France 2003 ParisPrologueTT 0 5000 10000 15000 20000 25000 Stage finishing times (seconds) Meaux Sedan StDizier StDizierTT Nevers Lyon Morzine AlpedHuez Gap Marseille Toulouse GapDecouverteITT Bonascre Loudenvielle LuzArdiden Bayonne Bordeaux StMaixentlEcole NantesITT Paris

Barcodes, Generalized Pairs Plots, and Sparkmatsjay/R/Vienna/Vienna.pdfNASA for a JSM poster competition, also show unexplained change points and truncation in pressure measurements

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Barcodes, Generalized Pairs Plots, and Sparkmatsjay/R/Vienna/Vienna.pdfNASA for a JSM poster competition, also show unexplained change points and truncation in pressure measurements

Barcodes, Generalized Pairs Plots, and Sparkmats

Jay Emerson, Walton Green, and John Hartigan,Yale University

June 17, 2006, UseR! Vienna, Austria

The barcode plot aids in comparing distributions. It shares some of the characteristics of side-by-side histograms or boxplots, and of rugs or stripplots. We have found it particularly useful with clumped data, when other methods obscure detail. As an example, we display the 2003 Tour de France stage results. Flat stages (Nevers and Lyon) are easily identified by the tallest spikes (finishers in the peleton arriving together). In contrast, mountain stages (Gap and L’Alpe d’Huez) stretch the distribution of finishing times. In some mountain stages (Gap and Morzine), trailing clumps of slower (exhausted) riders finish together (in a group called the autobus), avoiding elimination.

Tour de France 2003

ParisPrologueTT

0 5000 10000 15000 20000 25000

Stage finishing times (seconds)

Meaux

Sedan

StDizier

StDizierTT

Nevers

Lyon

Morzine

AlpedHuez

Gap

Marseille

Toulouse

GapDecouverteITT

Bonascre

Loudenvielle

LuzArdiden

Bayonne

Bordeaux

StMaixentlEcole

NantesITT

Paris

Page 2: Barcodes, Generalized Pairs Plots, and Sparkmatsjay/R/Vienna/Vienna.pdfNASA for a JSM poster competition, also show unexplained change points and truncation in pressure measurements

Traditional pairs plots (or scatterplot matrices) show relationships between quantitative variables, and package vcd provides similar plots for categorical data using mosaic tiles. Many data frames, however, include both quantitative and categorical variables, and the generalized pairs plot recognizes the fundamental importance of the relationships between different types of variables. Characteristics of a sample of properties in New Haven, CT, USA, illustrate the generalized pairs plot. Not all the properties are residential; outliers are evident, dominating the distribution of the data (for example, one property has more than 80 acres of land), and transformations are clearly needed. Some properties seem to have negative values of depreciation!

CurrentValue

50000 2e+05 20 40 60 80 0 2 4 6 8

RS

2

01e+0

72e

+07

3e+0

74e

+07

5e+0

76e

+07RS

1R

M2

RM

1O

ther

0000e+050000e+050000e+05

SquareFeet

Depreciation

-150-1

00-50

050

100

20406080 Acres

A/C

TRUE

FALSE

02468 Bedrooms

Bathrooms

02

46

1e+074e+07

RS2RS1RM2RM1Other

-150 -500 50FALSE TRUE

0 2 4 6

Zone

Page 3: Barcodes, Generalized Pairs Plots, and Sparkmatsjay/R/Vienna/Vienna.pdfNASA for a JSM poster competition, also show unexplained change points and truncation in pressure measurements

Here, we limit our attention to residential properties (removing many of the outliers). The current property value and square footage are logged, while the square root of property acreage is used. This latter choice retains condominiums without land in the plots (which would have been log(0)= -Inf using the log transformation). RS1-zoned properties are larger single-family homes in more affluent areas; not surprisingly, these homes have more bedrooms. The barcode plots show clumping of depreciation (most noticeable in the barcode of depreciation versus air-conditioning). According to the City of New Haven, about half of all residential properties have exactly 18% or 28% depreciation.

log(CurVal)

66.577.588.5 0.2 0.6 1 2 3 4 5 6 7

RS

2

910

1112

13

RS

1R

M2

RM

1O

ther

66.5

77.5

88.5 log(SqFeet)

Depreciation

020

4060

80

0.20.40.60.8

1sqrt(Acres)

A/C

TRUE

FALSE

234567 Bedrooms

Bathrooms1

23

45

6

9 1011 1213

RS2RS1RM2RM1Other

20 40 60 80FALSE TRUE

2 3 4 5 6

Zone

Page 4: Barcodes, Generalized Pairs Plots, and Sparkmatsjay/R/Vienna/Vienna.pdfNASA for a JSM poster competition, also show unexplained change points and truncation in pressure measurements

Finally, we present a new graphical tool called a sparkmat for spatially distributed time series. It generalizes Edward Tufte's sparklines for continuous variables distributed in both space and time. In the example below, we can identify certain peculiarities of three 6-year-long monthly time series of climate data on a 24 by 24 raster covering an area of South and Central America. Note the difference in the amplitude of temperature cycles (black) at different latitudes (less variability and warmer temperatures near the equator). The data, provided by NASA for a JSM poster competition, also show unexplained change points and truncation in pressure measurements (red).

-110 -100 -90 -80 -70 -60

-20

-10

0

10

20

30

Black = Temperature, Red = Pressure, Green = Low clouds

Contact: Jay Emerson ([email protected]) barcode() and gpairs() source, documentation, and examples:

http://www.stat.yale.edu/~jay/R/ViennaUseR/