45
Data Visualization Brian D. Ripley Professor of Applied Statistics University of Oxford [email protected] http://www.stats.ox.ac.uk/ripley

Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Data Visualization

Brian D. Ripley

Professor of Applied StatisticsUniversity of Oxford

[email protected]

http://www.stats.ox.ac.uk/∼ripley

Page 2: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

‘Data Visualization’

Data visualization is the art of looking at data in ’high’ (3 ormore) dimensions.

The lecture will major on interactive/dynamic ways to do so suchas brushing, (directed) grand tours, projection pursuit and multi-dimensional scaling, with demonstrations using R and GGobi.

Page 3: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

‘Data Visualization’

Data visualization is the art of looking at data in ’high’ (3 ormore) dimensions.

The lecture will major on interactive/dynamic ways to do so suchas brushing, (directed) grand tours, projection pursuit and multi-dimensional scaling, with demonstrations using R and GGobi.

Several recent/forthcoming books such as

Cook, D. and Swayne, D. F. (2007?) Interactive and Dynamic Graphics forData Analysis: With Examples Using R and GGobi.

Unwin, A., Theus, M. and Hoffmann, H. (2006) Graphics of Large Datasets.Visualizing a Million. Springer.

Young, F. W., Valero-Mora, P. M. and Friendly, M. (2006) Visual Statistics:Seeing Data with Dynamic Interactive Graphics. Wiley.

Page 4: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’
Page 5: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’
Page 6: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Why Now?

• Because we can.

Actually it is not that recent, and Cleveland (1993) is about VisualizingData, and I heard about most of the ideas in the 1980s. But noweveryone can afford to do it.

Page 7: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Why Now?

• Because we can.

Actually it is not that recent, and Cleveland (1993) is about VisualizingData, and I heard about most of the ideas in the 1980s. But noweveryone can afford to do it.

• ‘Graphics for the video-game generation’.(Ross Ihaka, R wishlist ca 1998.)

Page 8: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Why Now?

• Because we can.

Actually it is not that recent, and Cleveland (1993) is about VisualizingData, and I heard about most of the ideas in the 1980s. But noweveryone can afford to do it.

• ‘Graphics for the video-game generation’.(Ross Ihaka, R wishlist ca 1998.)

• Data are increasingly being collected automatically, and in areas likedata mining it is often very high-dimensional.

Prime example: genomics.

Page 9: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Why Now?

• Because we can.

Actually it is not that recent, and Cleveland (1993) is about VisualizingData, and I heard about most of the ideas in the 1980s. But noweveryone can afford to do it.

• ‘Graphics for the video-game generation’.(Ross Ihaka, R wishlist ca 1998.)

• Data are increasingly being collected automatically, and in areas likedata mining it is often very high-dimensional.

Prime example: genomics.

• Usable software is becoming available.

Page 10: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Why Now?

• Because we can.

Actually it is not that recent, and Cleveland (1993) is about VisualizingData, and I heard about most of the ideas in the 1980s. But noweveryone can afford to do it.

• ‘Graphics for the video-game generation’.(Ross Ihaka, R wishlist ca 1998.)

• Data are increasingly being collected automatically, and in areas likedata mining it is often very high-dimensional.

Prime example: genomics.

• Usable software is becoming available.

As the MSc class will find out in an assessed practical tomorrow.

Page 11: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

A Brief History

From Young et al (2006):

1600–1699 Measurement and Theory

1700–1799 New Graphics Forms and Data

1800–1899 Modern Graphics and the Golden Age

1900–1950 The Dark Ages of Statistical Graphics—The Golden Age ofMathematical Statistics

1950–1975 Rebirth of Statistical Graphics[Tukey’s Exploratory Data Analysis.]

1975–2000 Statistical Graphics comes of Age.

[Apparently developments stopped then!]

Page 12: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Visualizing What?

• Three or more continuous variables.

• Contingency tables (mosaic plots, correspondence analysis).

• Mixed types of variables.

• Patterns of missingness.

• Imputations.

Two of the books mentioned have chapters on missing data.

Page 13: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Three or More Continuous Variables

Human beings are quite proficient in seeing in 2.5 dimensions. We don’treally do this by stereoscopic vision, but more by

• Perspective.

• Shading / lighting.

• Texture.

• Motion.

The ways we have to visualize three or more continuous variables arealmost all by two- (or occasionally three-) dimensional ‘windows’ on a high-dimensional point cloud. But there are some others, e.g. via glyphs.

Page 14: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Three or More Continuous Variables

Human beings are quite proficient in seeing in 2.5 dimensions. We don’treally do this by stereoscopic vision, but more by

• Perspective.

• Shading / lighting.

• Texture.

• Motion.

The ways we have to visualize three or more continuous variables arealmost all by two- (or occasionally three-) dimensional ‘windows’ on a high-dimensional point cloud. But there are some others, e.g. via glyphs.

What are we looking for?

Page 15: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

What are we looking for?

• Multivariate outliers.

• Subgroups (clusters).

• Gradations.

Page 16: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Alabama

Alaska

Arizona

Arkansas

California

Colorado

Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana

Iowa

Kansas

Kentucky

Louisiana

Maine

Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon

Pennsylvania

Rhode Island

South Carolina

South Dakota

Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Chernoff faces plot of the state.x77 dataset, from S-PLUS.

Page 17: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware

Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas

Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi

Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York

North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina

South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia

Wisconsin Wyoming

Chernoff faces plot of the state.x77 dataset, from the R TeachingDemos package(faces).

Page 18: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware

Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas

Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi

Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York

North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina

South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia

Wisconsin Wyoming

Chernoff faces plot of the state.x77 dataset, from the R TeachingDemos package(faces2).

Page 19: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

AlabamaAlaska

ArizonaArkansas

CaliforniaColorado

Connecticut

DelawareFlorida

GeorgiaHawaii

IdahoIllinois

Indiana

IowaKansas

KentuckyLouisiana

MaineMaryland

Massachusetts

MichiganMinnesota

MississippiMissouri

MontanaNebraska

Nevada

New HampshireNew Jersey

New MexicoNew York

North CarolinaNorth Dakota

Ohio

OklahomaOregon

PennsylvaniaRhode Island

South CarolinaSouth Dakota

Tennessee

TexasUtah

VermontVirginia

WashingtonWest Virginia

Wisconsin

Wyoming Frost

Life ExpHS GradIncome

Murder

Illiteracy

R version of stars plot of the state.x77 dataset.

Page 20: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Three Ways to Use Windows on Point Clouds

1. Projections of random/guided rotations, with motion and perhapsdepth cues. Example: exploratory projection pursuit.

2. Parallel axes, so-called parallel coordinate plots. Think of this asmultiple 1D views.

3. Multiple low-dim views.

[Categorization by Young et al (2006).]

Page 21: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Population Income Illiteracy Life Exp Murder HS Grad Frost Area

Parallel coordinate plot of the state.x77 dataset.

Page 22: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Frost Life Exp HS Grad Income Murder Illiteracy

A better parallel coordinate plot of the state.x77 dataset. I re-ordered the variablesand flipped some signs.

Page 23: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Interacting with Plots

Parallel coordinate plots rapidly become unusable with large numbers ofcases or variables. We can make some progress by highlighting cases orgroups of cases.

• Colour/glyph for cases/groups, transiently or permanently.

• Shadow cases

• Unshadow cases

• Identify

The first three are done by brushing, often linked to some other display.

Page 24: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Three Ways to Use Windows on Point Clouds

1. Projections of random/guided rotations, with motion and perhapsdepth cues. Example: exploratory projection pursuit.

2. Parallel axes, so-called parallel coordinate plots. Think of this asmultiple 1D views.

3. Multiple low-dim views.

The idea of linking as in parcoord plots can be done in other multiple viewstoo, especially dynamically (brushing) and by colour or glyph type.

Page 25: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Multiple Low-Dimensional Views

The classic examples are 1D (parallel coordinate plots) and scatterplotmatrices. We can link the latter by glyph, colour, dynamically.

Page 26: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Leptograpsus variegatus Crabs

200 crabs from Western Australia. Two colour forms, blue and orange;collected 50 of each form of each sex. Are the colour forms species?

Measurements of carapace (shell) length CL and width CW, the size of thefrontal lobe FL, rear width RW and body depth BD.

Page 27: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

.

Page 28: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

10 15

15 20

15

20

10

15FL

8 10 12 14

14 16 18 20

14

16

18

20

8

10

12

14

RW

15 20 25 30

30 35 40 45

30

35

40

45

15

20

25

30CL

20 30

40 50

40

50

20

30

CW

8 10 12 14

14 16 18 20

14

16

18

20

8

10

12

14BD

Blue male Blue female Orange Male Orange female

Page 29: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Transformations

Do not forget that your data may need transformation even for visualization.

• Univariate transformations, of the types Tukey promoted in EDA (andgo back a long way).

Also scaling to a common visual scale (as parcoords did): by range,mean/variance, median/IQR, . . . .

Page 30: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Transformations

Do not forget that your data may need transformation even for visualization.

• Univariate transformations, of the types Tukey promoted in EDA (andgo back a long way).

Also scaling to a common visual scale (as parcoords did): by range,mean/variance, median/IQR, . . . .

• Multivariate transformations.

– Removal of correlation: one use of principal components.(May need to do this robustly.)

– Sphering, a multivariate rescaling to common scale.Most commonly done by changing to principal components andscaling each to unit variance.

Page 31: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

-1.5 -1.0 -0.5

-0.5 0.0 0.5

-0.5

0.0

0.5

-1.5

-1.0

-0.5Comp. 1

-0.15 -0.10 -0.05

0.00 0.05 0.10

0.00

0.05

0.10

-0.15

-0.10

-0.05

Comp. 2

-0.10 -0.05 0.00

0.00 0.05 0.10

0.00

0.05

0.10

-0.10

-0.05

0.00Comp. 3

Blue male Blue female Orange Male Orange female

First three principal components on log scale.

Page 32: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Grand Tours

Now our first idea, of a moving 2D window into the point cloud (technically,an orthogonal projection).

In Daniel Asimov’s grand tour one chooses a random projection (a randomrotation of the point cloud), and rotates towards it along the geodesic (thekD analogue of the great circle route).

Motion helps us in several ways:

• Front vs back.

• Outliers move at different speeds and directions from the bulk of thedata, and will (perhaps briefly) appear at the periphery of a view.

• Groups move together.

It can be helpful to add trails, as in what Young calls an orbitplot.

Page 33: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’
Page 34: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Guided Tours

There are far too many possible views in k � 5 dimensions to have a chanceof coming close to an ‘interesting’ view and spotting it.

We need some guidance, and that is the ‘pursuit’ of (exploratory) projectionpursuit.

Choose an index of ‘interestingness’, and optimize the views to high valuesof the index. (Lots of research ideas in the 1980s.) These are normallyapplied to sphered data, so random views look like samples from thestandard bivariate normal distribution, and ‘interesting’ means ‘non-normal’(but remember the Anna Karenina principle).

• ‘Holes’: look for views with relatively few points in the centre.Tends to find clusters.

• ‘Central mass’: look for relatively many points in the centre.Tends to find multivariate outliers.

Page 35: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

The Anna Karenina principle

Happy families are all alike; every unhappy family is unhappyin its own way.

‘With this dramatic sentence, Leo Tolstoy begins his famous novel AnnaKarenina about the struggles of multiple, interconnected families to findhappiness.’

Page 36: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

(a) (b)

(c) (d)

Projections of the Leptograpsus crabs data found by projection pursuit. View (a) is arandom projection. View (b) was found using the natural Hermite index, view (c) bythe Friedman–Tukey index and view (d) by Friedman’s (1987) index.

Page 37: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Three Ways to Use Windows on Point Clouds

1. Projections of random/guided rotations, with motion and perhapsdepth cues. Example: exploratory projection pursuit.

2. Parallel axes, so-called parallel coordinate plots. Think of this asmultiple 1D views.

3. Multiple low-dim views.

The idea of linking as in parcoord plots can be done in other multiple viewstoo, especially dynamically (brushing) and by colour or glyph type.

Note that we do not need to use projections: we could ‘squeeze’ thepoint cloud to 2D (or 1D, seriation, or 3D), known to psychologists asmultidimensional scaling.

Page 38: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Multidimensional Scaling

Aim is to represent distances between points well.

Suppose we have distances (dij) between all pairs of n points, or a dissim-ilarity matrix. Classical MDS plots the first k principal components, andminimizes ∑

i �=j

d2ij − d̃2

ij

where (d̃ij) are the Euclidean distances in the kD space.

Shepard and Kruskal (1962–4) proposed only to preserve the ordering ofdistances, minimizing

STRESS2 =

∑i �=j

[θ(dij) − d̃ij

]2

∑i �=j d̃2

ij

over both the configuration of points and an increasing function θ.

The optimization task is quite difficult and this can be slow.

Page 39: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Multidimensional scaling

-0.10

0.0

0.10

-1.5 -1.0 -0.5 0.0 0.5 1.0

Blue maleBlue female

Orange MaleOrange female

An order-preserving MDS plot of the (raw) crabs data.

Page 40: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

-0.15

-0.10

-0.05

0.0

0.05

0.10

-0.1 0.0 0.1 0.2

Blue maleBlue female

Orange MaleOrange female

After re-scaling to (approximately) constant carapace area.

Page 41: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

Now for GGobi demonstrations

Page 42: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

A Forensic Example

Data on 214 fragments of glass collected at scenes of crimes. Each has ameasured refractive index and composition (weight percent of oxides of Na,Mg, Al, Si, K, Ca, Ba and Fe).

Grouped as window float glass (70), window non-float glass (76), vehiclewindow glass (17) and other (containers, tableware, headlamps) (22).

Page 43: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

RI

Na

Mg

Al

Si

K

Ca

Ba

Fe

WinF

-4 -2 0 2 4 6 8

WinNF Veh

-4 -2 0 2 4 6 8

RI

Na

Mg

Al

Si

K

Ca

Ba

Fe

Con Tabl

-4 -2 0 2 4 6 8

Head

Strip plot by type of glass.

Page 44: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

WinFWinNF

VehConTabl

Head

RI

-5 0 5 10 15

Na

12 14 16

Mg

0 1 2 3 4

WinFWinNF

VehConTabl

Head

Al

0.5 1.5 2.5 3.5

Si

70 71 72 73 74 75

K

0 1 2 3 4 5 6

WinFWinNF

VehConTabl

Head

Ca

6 8 10 12 14 16

Ba

0.0 1.0 2.0 3.0

Fe

0.0 0.2 0.4

Strip plot by type of analyte.

Page 45: Data Visualization - Oxford Statisticsripley/DataVisualization/DataVis.pdf · 2007-02-01 · ‘Data Visualization’ Data visualization is the art of looking at data in ’high’

WinFWinNFVehConTablHead

Isotonic multidimensional scaling representation.