51
Data Visualization (DSC 530/CIS 568) Marks & Channels Dr. David Koop D. Koop, DSC 530, Spring 2019

Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Data Visualization (DSC 530/CIS 568)

Marks & Channels

Dr. David Koop

D. Koop, DSC 530, Spring 2019

Page 2: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

d3.js

�2

Overview Examples Documentation Source

Data-Driven Documents

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data tolife using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities ofmodern browsers without tying yourself to a proprietary framework, combining powerful visualizationcomponents and a data-driven approach to DOM manipulation.

See more examples.

D. Koop, DSC 530, Spring 2019

Page 3: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

D3 Introduction• Ogievetsky has put together a nice set of interactive examples that

show off the major features of D3 • http://dakoop.github.io/IntroD3/

- (Updated from original for D3 v5 with new joins) • https://beta.observablehq.com/@dakoop/d3-intro • Other references:

- Murrary’s book on Interactive Data Visualization for the Web - The D3 website: d3js.org - Ros's Slides on v4: https://iros.github.io/d3-v4-whats-new/

�3D. Koop, DSC 530, Spring 2019

Page 4: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

D3 Data Joins• Two groups: data and visual elements • Three parts of the join between them: enter, update, and exit • enter: s.enter(), update: s, exit: s.exit()

�4D. Koop, DSC 530, Spring 2019

Data Visual Elements

Enter Update Exit

Page 5: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Assignment 2• Create a stacked bar chart using three different tools: Tableau,

Vega-Lite, and D3 • Due Monday, Feb. 25

�5D. Koop, DSC 530, Spring 2019

Page 6: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

D3 Examples• Bar Chart:

- Start: http://codepen.io/dakoop/pen/dNxjYL - Simple Solution: http://codepen.io/dakoop/pen/aJoLBp

• With Axes and Scales: http://codepen.io/dakoop/pen/WpeZOV • With Objects and Margin Convention: http://codepen.io/dakoop/

pen/MJNGwZ • More on Margin Convention:

- https://bl.ocks.org/mbostock/3019563 (Note this is D3 v3!)

�6D. Koop, DSC 530, Spring 2019

Page 7: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Visual Encoding• How should we visualize this data?

�7D. Koop, DSC 530, Spring 2019

Name Region Population Life Expectancy Income

China East Asia & Pacific 1335029250 73.28 7226.07

India South Asia 1140340245 64.01 2731

United States America 306509345 79.43 41256.08

Indonesia East Asia & Pacific 228721000 71.17 3818.08

Brazil America 193806549 72.68 9569.78

Pakistan South Asia 176191165 66.84 2603

Bangladesh South Asia 156645463 66.56 1492

Nigeria Sub-Saharan Africa 141535316 48.17 2158.98

Japan East Asia & Pacific 127383472 82.98 29680.68

Mexico America 111209909 76.47 11250.37

Philippines East Asia & Pacific 94285619 72.1 3203.97

Vietnam East Asia & Pacific 86970762 74.7 2679.34

Germany Europe & Central Asia 82338100 80.08 31191.15

Ethiopia Sub-Saharan Africa 79996293 55.69 812.16

Turkey Europe & Central Asia 72626967 72.06 8040.78

Page 8: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Potential Solution

�8

Share ! " #Bubbles $

Color

Select

Size

Zoom20152015

30

40

50

60

70

80

year

s

Life

exp

ecta

ncy ▼

1800 1900 2000

World Regions

Search...

Afghanistan

Albania

Algeria

Andorra

Angola

Antigua and Barbuda

Argentina

Armenia

Australia

Austria

Azerbaijan

Bahamas

Bahrain

Bangladesh

Barbados

Belarus

Population, total

100100%%

OPTIONS EXPAND PRESENT

English ▼ FACTS TEACH ABOUT ►HOW TO USE

[Gapminder, Wealth & Health of Nations]D. Koop, DSC 530, Spring 2019

Page 9: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Another Solution

�9

Share ! " #Bubbles $

Color

Select

Size

Size: Population, totalWorld Regions

Search...

Afghanistan

Albania

Algeria

Andorra

Angola

Antigua and Barbuda

Argentina

Armenia

Australia

Austria

Azerbaijan

Bahamas

Bahrain

Bangladesh

Barbados

Belarus

Population, total

OPTIONS EXPAND PRESENT

English ▼ FACTS TEACH ABOUT ►HOW TO USE

[Gapminder, Wealth & Health of Nations]D. Koop, DSC 530, Spring 2019

Page 10: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

What about change over years?

�10D. Koop, DSC 530, Spring 2019

Page 11: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Another Solution showing trends over time

�11

Share ! " #Bubbles $

Color

Show

Hide all shapes except:Hide all shapes except:

4000

8000

16k

32k

NigeriaNigeria

RussiaRussia

United Sta…United Sta…

ChinaChina

Income per person (GDP/capita, PPP$ inflation-adjusted) DATA DOUBTSWorld Regions

Search...

United States

Russia

Nigeria

China

Afghanistan

Albania

Algeria

Andorra

Angola

Antigua and Barbuda

Argentina

Armenia

Australia

Austria

Azerbaijan

RESET

FIND OPTIONS EXPAND PRESENT

English ▼ FACTS TEACH ABOUT ►HOW TO USE

[Gapminder, Wealth & Health of Nations]D. Koop, DSC 530, Spring 2019

Page 12: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Visual Encoding• How do we encode data visually?

- Marks are the basic graphical elements in a visualization - Channels are ways to control the appearance of the marks

• Marks classified by dimensionality:

• Also can have surfaces, volumes • Think of marks as a mathematical definition, or if familiar with tools

like Adobe Illustrator or Inkscape, the path & point definitions

�12D. Koop, DSC 530, Spring 2019

Points Lines Areas

Page 13: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Bertin - Visual Variables

�13D. Koop, DSC 530, Spring 2019

Page 14: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Visual Channels

�14

Horizontal

Position

Vertical Both

Color

Shape Tilt

Size

Length Area Volume

[Munzner (ill. Maguire), 2014]D. Koop, DSC 530, Spring 2019

Page 15: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Visual Attributes Survey

�15

[R. Brath]D. Koop, DSC 530, Spring 2019

Page 16: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

More Visual Attributes

�16

[R. Brath]D. Koop, DSC 530, Spring 2019

Page 17: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Channels• Usually map an attribute to a single channel

- Could use multiple channels but… - Limited number of channels

• Restrictions on size and shape - Points are nothing but location so size and shape are ok - Lines have a length, cannot easily encode attribute as length - Maps with boundaries have area, changing size can be

problematic

�17D. Koop, DSC 530, Spring 2019

Page 18: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Cartograms

�18

[Election Results by Population, M. Newman, 2012]D. Koop, DSC 530, Spring 2019

Page 19: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Channel Types• Identity => what or where, Magnitude => how much

�19

[Munzner (ill. Maguire), 2014]D. Koop, DSC 530, Spring 2019

Magnitude Channels: Ordered Attributes Identity Channels: Categorical Attributes

Spatial region

Color hue

Motion

Shape

Position on common scale

Position on unaligned scale

Length (1D size)

Tilt/angle

Area (2D size)

Depth (3D position)

Color luminance

Color saturation

Curvature

Volume (3D size)

Channels: Expressiveness Types and Effectiveness Ranks

Page 20: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Mark Types• Can have marks for items and links

- Connection => pairwise relationship - Containment => hierarchical relationship

�20

[Munzner (ill. Maguire), 2014]D. Koop, DSC 530, Spring 2019

Marks as Items/Nodes

Marks as Links

Points Lines Areas

Containment Connection

Page 21: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Expressiveness and Effectiveness• Expressiveness Principle: all data from the dataset and nothing

more should be shown - Do encode ordered data in an ordered fashion - Don’t encode categorical data in a way that implies an ordering

• Effectiveness Principle: the most important attributes should be the most salient - Saliency: how noticeable something is - How do the channels we have discussed measure up?

�21D. Koop, DSC 530, Spring 2019

Page 22: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Mackinlay's Ranking of Perceptual Tasks

�22

[Mackinlay,1986] D. Koop, DSC 530, Spring 2019

Page 23: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Iliinsky's Best Uses, +Ordering, +NumValues

�23D. Koop, DSC 530, Spring 2019

Page 24: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

How do we get these rankings?

�24D. Koop, DSC 530, Spring 2019

Page 25: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in length between elements

�25

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Page 26: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in length between elements

�26

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Answer: Left is ~5.6x longer than Right

Page 27: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in length between elements

�27

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Page 28: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in length between elements

�28

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Page 29: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in length between elements

�29

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Modified from Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Page 30: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in length between elements

�30

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Modified from Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Answer: Right is 4x larger than Left

Page 31: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in area between elements

�31

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Page 32: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in area between elements

�32

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Answer: A is ~2.25x larger (in area) than B

Page 33: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in area between elements

�33

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Page 34: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in area between elements

�34

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Answer: B is ~6.1x larger (in area) than A

Page 35: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in area between elements

�35

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Page 36: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Test % difference in area between elements

�36

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

Answer: B is ~2.5 larger (in area) than A

Page 37: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Cleveland & McGill Experiments

�37

534 Journal of the American Statistical Association, September 1984

TYPE 1 TYPE 2 TYPE 3 TYPE 4 TYPE 5

100o 10oo 100- 10oo 100-

IhLL O_ 0A A * A B A B A B A B A B

Figure 4. Graphs from position-length experiment.

tracted by perceiving position along a scale, in this case the horizontal axis. The y values can be perceived in a similar manner.

The real power of a Cartesian graph, however, does not derive only from one's ability to perceive the x and y values separately but, rather, from one's ability to un- derstand the relationship of x and y. For example, in Fig- ure 7 we see that the relationship is nonlinear and see the nature of that nonlinearity. The elementary task that en- ables us to do this is perception of direction. Each pair of points on the plot, (xi, yi) and (xj, yj), with xi =$ Xj, has an associated slope

(yj - y)(xj - xi).

The eye-brain system is capable of extracting such a slope by perceiving the direction of the line segment join- ing (xi, yi) and (xj, yj). We conjecture that the perception of these slopes allows the eye-brain system to imagine a smooth curve through the points, which is then used to judge the pattern. For example, in Figure 7 one can per- ceive that the slopes for pairs of points on the left side of the plot are greater than those on the right side of the plot, which is what enables one to judge that the rela- tionship is nonlinear.

That the elementary task of judging directions on a Cartesian graph is vital for understanding the relationship of x and y is demonstrated in Figure 8. The same x and y values are shown by paired bars. As with the Cartesian

MURDER RATES, 1978

8.5 FIVE REPRESETIV SHADINGS- _ , _,

RE 12.1_

-~ 1 5.8- RATES PER 100,000 POPULATION

Figure 5. Statistical map with shading.

This content downloaded from 134.88.249.216 on Thu, 12 Feb 2015 23:18:30 PMAll use subject to JSTOR Terms and Conditions

Cleveland and McGill: Graphical Perception 533

0

-i S

to

I

5 10 15

MUROER RATE

Figure 2. Sample distribution function of 1978 murder rate.

judging position along a common scale, which in this case is the horizontal scale.

Bar Charts

Figures 3 and 4 contain bar charts that were shown to subjects in perceptual experiments. The few noticeable peculiarities are there for purposes of the experiments, described in a later section.

Judging position is a task used to extract the values of the data in the bar chart in the right panel of Figure 3. But now the graphical elements used to portray the data-the bars-also change in length and area. We con- jecture that the primary elementary task is judging po- sition along a common scale, but judgments of area and length probably also play a role.

Pie Charts

The left panel of Figure 3 is a pie chart, one of the most commonly used graphs for showing the relative sizes of the parts of a whole. For this graph we conjecture that the primary elementary visual task for extracting the nu- merical information is perception of angle, but the areas and arc lengths of the pie slices are variable and probably are also involved in judging the data.

Divided Bar Charts

Figure 4 has three div'ided bar charts (Types 2, 4, and 5). For each of the three, the totals of A and B can be compared by perceiving position along the scale. Position judgments can also be used to compare the two bottom

diviionsin ech cse; or Tpe 2the otto divsin are arkd wth ots.Allothr vluesmus becomare by he lemntay tsk f prcevin difernt ar enghs

examples are the two divisions marked with dots in Type 4 and the two marked in Type 5.

Statistical Maps With Shading

A chart frequently used to portray information as a function of geographical location is a statistical map with shading, such as Figure 5 (from Gale and Halperin 1982), which shows the murder data of Figure 2. Values of a real variable are encoded by filling in geographical re- gions using any one of many techniques that produce gray-scale shadings. In Figure 5 the technique illustrated uses grids drawn with different spacings; the data are not proportional to the grid spacing but, rather, to a compli- cated function of spacing. We conjecture that the primary elementary task used to extract the data in this case is the perception of shading, but judging the sizes of the squares formed by the grids probably also plays a role, particularly for the large squares.

Curve-Difference Charts

Another class of commonly used graphs is curve-dif- ference charts: Two or more curves are drawn on the graph, and vertical differences between some of the curves encode real variables that are to be extracted. One type of curve-difference chart is a divided, or aggregate, line chart (Monkhouse and Wilkinson 1963), which is typ- ically used to show how parts of a whole change through time.

Figure 6 is a curve-difference chart. The original was drawn by William Playfair; because our photograph of the original was of poor quality, we had the figure re- drafted, trying to keep as close to the original as possible. The two curves portray exports from England to the East Indies and imports to England from the East Indies. The vertical distances between the two curves, which encode the export-import imbalance, are highlighted. The quan- titative information about imports and exports is ex- tracted by perceiving position along a common scale, and the information about the imbalances is extracted by per- ceiving length, that is, vertical distance between the two curves.

Cartesian Graphs and Why They Work

Figure 7 is a Cartesian graph of paired values of two variables, x and y. The values of x can be visually ex-

40

c< 0WBHEl a A BC D E

Figure 3. Graphs from position-angle experiment.

This content downloaded from 134.88.249.216 on Thu, 12 Feb 2015 23:18:30 PMAll use subject to JSTOR Terms and Conditions

[Cleveland & McGill, 1984]D. Koop, DSC 530, Spring 2019

Page 38: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Heer & Bostock Experiments• Rerun Cleveland & McGill’s experiment using Mechanical Turk • … with more tests

�38

[Heer & Bostock, 2010]D. Koop, DSC 530, Spring 2019

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Page 39: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Results Summary

�39

Positions

Rectangular areas

(aligned or in a treemap)

Angles

Circular areas

Cleveland & McGill’s Results

Crowdsourced Results

1.0 3 .01.5 2 .52 .0Log Error

1.0 3 .01.5 2 .52 .0Log Error

[Munzner (ill. Maguire) based on Heer & Bostock, 2014]D. Koop, DSC 530, Spring 2019

Page 40: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Psychophysics• How do we perceive changes in

stimuli • The Psychophysical Power Law

[Stevens, 1975]: All sensory channels follow a power function based on stimulus intensity (S = In)

• Length is fairly accurate • Magnified vs. compressed

sensations

�40

[Munzner (ill. Maguire), 2014]D. Koop, DSC 530, Spring 2019

Page 41: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Ranking Channels by Effectiveness

�41

Magnitude Channels: Ordered Attributes Identity Channels: Categorical Attributes

Spatial region

Color hue

Motion

Shape

Position on common scale

Position on unaligned scale

Length (1D size)

Tilt/angle

Area (2D size)

Depth (3D position)

Color luminance

Color saturation

Curvature

Volume (3D size)

Channels: Expressiveness Types and Effectiveness Ranks

[Munzner (ill. Maguire), 2014]D. Koop, DSC 530, Spring 2019

Page 42: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Discriminability

�42

PythonSource

vtkDataSetReader

vtkDataSetMapper

vtkActorvtkLODActor

vtkRenderer

VTKCell

vtkScalarBarActor

vtkColorTransferFunctionvtkLookupTable

vtkImageClipvtkImageDataGeometryFilter

vtkImageResamplevtkImageReslicevtkWarpScalar

PythonSourcevtkElevationFiltervtkOutlineFilter

vtkPolyDataMapper

vtkActor

vtkProperty

vtkCubeAxesActor2D

vtkCamera

File

vtkPolyDataNormals

[Koop et al., 2013]D. Koop, DSC 530, Spring 2019

What is problematic here?

Page 43: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Discriminability• Can someone tell the difference? • How many values (bins) can be used so that a person can tell the

difference? • Example: Line width

- Matching a particular width with a legend - Comparing two widths

�43D. Koop, DSC 530, Spring 2019

Page 44: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Separability• Cannot treat all channels as independent! • Separable means each individual channel can be distinguished • Integral means the channels are perceived together

�44

[Munzner (ill. Maguire) based on Ware, 2014]D. Koop, DSC 530, Spring 2019

Position Hue (Color)

Size Hue (Color)

Width Height

Red Green

Fully separable Some interference Some/significant interference

Major interference

Page 45: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Separable or Integral?

�45

[GOOD]D. Koop, DSC 530, Spring 2019

Page 46: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Separable or Integral?

�45

[GOOD]D. Koop, DSC 530, Spring 2019

Page 47: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Visual Popout

�46

[C. G. Healey]D. Koop, DSC 530, Spring 2019

Page 48: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Visual Popout: Parallel Lines Require Search…

�47

[Munzner (ill. Maguire), 2014]D. Koop, DSC 530, Spring 2019

Page 49: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Relative vs. Absolute Judgments• Weber’s Law:

- We judge based on relative not absolute differences - The amount of perceived difference depends is relative to the

object’s magnitude!

�48

[Munzner (ill. Maguire), 2014]D. Koop, DSC 530, Spring 2019

A B

Unframed Aligned

Framed Unaligned

AB

AB

Unframed Unaligned

Page 50: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Luminance Perception

�49

[E. H. Adelson, 1995]D. Koop, DSC 530, Spring 2019

Page 51: Data Visualization (DSC 530/CIS 568)dkoop/dsc530-2019sp/lectures/lecture09.pdfD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life

Luminance Perception

�50

[E. H. Adelson, 1995]D. Koop, DSC 530, Spring 2019