Just Another Pretty Visualization? The What, Where, When, Why and How of Evaluating Visualizations...

Preview:

Citation preview

Just Another Pretty Visualization?

The What, Where, When, Why and How of Evaluating Visualizations

Jim FoleyCollege of ComputingGeorgia Institute of TechnologyAtlanta, Georgia USA

Talk originally presented at

Opening ConvocationUNC-Charlotte Visualization CenterMay 2006

Visualization

• Interactive visual representation of data to provide a user with information and insights

• Two subfields Scientific visualization Information visualization

4

Scientific Visualization

• Data typically from physics/chemistry/biology

• Data an inherent geometry• Examples

Air flow, molecules, weather

5

Information Visualization

• Data typically abstract Census, financial, business, …

• Must create geometricalrepresentation

Examples taken from Information Visualization

Methodologies that are appropriate for both SciVis and InfoVis

Evaluation of Visualizations

Roadmap

• The challenges of evaluation• Subjective evaluation• Cognition + perception - based

evaluation• Experimental evaluation• Long-term Evaluation• Conclusions

The Challenges of Evaluation

• The challenges of evaluation• Subjective evaluation• Evaluation based on model of

cognition & perception• Experimental evaluation• Long-term Evaluation• Conclusions

Evaluation Criteria

• Look pretty? Be effective?• Extract numbers?

Accuracy or speed or avolid errors or …

• The purpose of computing is insight, not numbers - Richard Hamming How do we measure insights? Are some insights better than others?

• Detecting the Expected – Discovering the Unexpected™ - Jim Thomas If we don’t know what is expected to be there,

how do we know if we have found it?

Challe

nge

1

The Problem of Good, Better, Best

• There is no “proof of optimality”• Even if we find all info that we seek –

could be faster with another viz• Maybe another insight was there?• Settle for “Good” or “Better” rather

than “Best”

Challe

nge

2

Effectiveness Depends on …

• The data• The user

Domain knowledge Tool knowledge Right brain - left brain

Challe

nge

3

The Tool vs. the Visualization

• Usability ≠ Usefulness• If viz hard to create, user may not

use Maybe the viz is perfect for the task.

• Focus is NOT on usability of the viz tool Assume it is easy to learn and use=> Apply ALL usability tools/methods!

Challe

nge

4

Compare Best Apple to Best Orange

• Compare two visualizations => Each must be good as

possible• Great implementation of a low-

effectiveness visualization may perform better than a poor implementation of a high-effectiveness visualization

Challe

nge

5

The Default Trap

• Users tend to stick with default settings and visualizations, even though others would be better Example: Kobsa, An empirical

evaluation of three commercial information visualization systems, InfoVis 2001

Seen in other domains as well

Challe

nge

6

Purpose of Evaluation: Insight, not Numbers

• Good to know that A is better than B

• But MUCH better to know WHY A is better than B

=> Ground experiments in theory

=> Get inside users’ heads

Challe

nge

7

Subjective Evaluation

• The challenges of evaluation• Subjective evaluation• Evaluation based on model of

cognition & perception• Experimental evaluation• Long-term Evaluation• Conclusions

Subjective Evaluation

• Really subjective - “I like it so must be GREAT”

• Less subjective - “All my friends like it so it must be REALLY GREAT”

• Tufte’s “minimize chart junk” etc• Shneiderman’s “Overview first, zoom

and filter, details on demand” • But

Need more goal-oriented criteria Need HCI/InfoVis experts, not our

friends

Potential Criteria

• Amar and Stasko, A Knowledge Task-Based Framework for Design and Evaluation of Information Visualizations, Best Paper, InfoVis 2004.

• Rationale-based Tasks Expose uncertainty Concretize relationships Formulate cause and effect

• Worldview-based Tasks Determination of domain parameters Multivariate explanation Confirm hypotheses

Pros/Cons (+/-)

+ Fast+ Have rationale basis- Still based on subjective (but

informed) judgments

Cognition & Perception - Based

• The challenges of evaluation• Subjective evaluation• Evaluation based on model of

cognition & perception• Experimental evaluation• Long-term Evaluation• Conclusions

Evaluation based on model of cognition & perception

• Lohse, A Cognitive Model for the Perception and Understanding of Graphics, CHI 1991

• In the spirit of GOMS and ACT• Given one of three query types

Read: Did Ford’s market share exceed 18% in 1983

Compare: In 1983, did Honda have more USA market share than Nissan?”

Trend: From 1979 to 1982, did GM’s market share increase?

Lohse’s Model Accounts for

• Eye movements (saccades)• Word recognition• Retrieve into working memory• Compare two units in working

memory• Etc…

“Semantic Traces” on three content-equivalent visualizations, to answer the question “In 1983, did Nissan have more USA market share than Toyota?”

WorkingMemory

Model explained 58% of variation in verification

experiment

Pros/Cons (+/-)

- Focus is facts (read, compare, trend), not insights

+ But facts can lead to insights+ Useful for comparing designs+ Faster than experiments!- Assumes scanning

sequence can bedetermined

CS 7450 26

Application

http://www.perceptualedge.com/example2.php

Experimental Evaluation

• The challenges of evaluation• Subjective evaluation• Evaluation based on model of

cognition & perception• Experimental evaluation• Long-term Evaluation• Conclusions

Experimental Comparison of Tree Visualization Systems

• Kobsa, User Experiments with Tree Visualization Systems, InfoViz 2004

• Compares 5 Tree Visualization Systems + Windows file browser

• Quantitative - measure results• Qualitative - to (partially) understand

results

Treemap 3.2Tree Viewer

Hyperbolic BrowserBeamTrees

SequoiaView

Experimental Design

• One data set - subset of eBay item taxonomy 5799 nodes

• Six systems 5 viz + windows file browser

• Between subjects 48 S’s, each uses one system => 8 S’s per system

• Each subject performs 15 tasks 9 tasks concern directory structure

Maximum depth of hierarchy?Tree balanced or unbalanced?

6 tasks concern file or directory attributesFind file with name= xxxxFind name of largest file

Results - Correct Answers

Treemap (TM

)SequoiaView

(SV)Beam

Trees (BT)H

yperbolic Browser (H

B)

Tree Viewer (TV)

Window

s File Browser (W

FB)

BT << TM, SV, HB, WFB (1%)

TV << TM, HB, WFB (1%)

TV < SV (5%)

Results - Effect of Task Type

BeamTrees - Structure

BeamTrees - Attributes

Tree Viewer - Structure

Tree Viewer - Attributes

Tree ViewerSS at 0.0001BeamTrees

SS at 0.001

Results - Task Times

• Range for all tasks 101 seconds (Windows File Browser) and

106 (Treemap) up to 188 (BeamTrees)

• Structure-related tasks ~165 (SequoiaView and Windows

Browser) up to 265 (Tree Viewer)

• Attribute-related tasks ~80 (Treemap and Windows) up to 155

(BeamTrees)

Typical Qualitative Results

• Treemaps Color coding and filtering heavily used Needs “Detail on Demand”

• Hyperbolic Browser Cartesian distance ≠ distance from root

Tree depth not what HB intended for!!

• Lots of usability problems• From analyzing videos of users

Pros/Cons

+ Nicely designed experiment- Very different tool functionalities Detecting the Expected –

Discovering the UnexpectedTM

+ Good job measuring “Detecting the Expected”

- Does not try to measure “Discovering the Unexpected” (ie, Insights)

Experimental Evaluation for INSIGHT

• Saraiya, North and Davis, An Evaluation of Microarray Visualization Tools for Biological Insight, InfoVis 2004

• Purpose of visualizations is insight => Count insights• Domain: Visualization of gene

expression arrays• Compare: five software tools

TimeSearcher Hierarchical Clustering Explorer

Spotfire® GeneSpring®C

lust

er /

Tre

evie

w,

aka

Clu

ster

view

(not

sho

wn)The Tools

Experimental Design

• Five visualization tools• Three data sets

Time series: viral infection - 1060 genes x 5 times Comparison: 861 genes x 3 related viral infections Comparison: 170 genes x (48 with lupus + 42 healthy)

• Three classes of users Domain expert (10) Domain novice (11) Software developers (9)

• One user, two tools => 30 trials Mixed design – within/between subject

Tool Capabilities

Overview + Detail

Dynamic Queries

Measure INSIGHTS

• Users instructed to list some questions they might want to ask, and to then use tool until could learn nothing more

• For each insight, record / assess The actual insight Time to reach insight Importance of the insight Correctness of the insight Sought for vs serendipitous insight Etc

Results (Statistically Sig.)

• Insight Value Spotfire® (66) > GeneSpring® (40)

• User’s perception of how much learned Spotfire® > HierClusExplr, Clusterview

• Time to first insight Clusterview (4.6) < all others (7 to 16) :-) GeneSpring® (16) > all others (4.6 to 14) s

:-(

Other Results & Insights

• User background did not have a s.s. effect• Poor UI design in some cases

Discouraged/prevented users from using most effective visualization

Slowed down use of an otherwise powerful tool – too many clicks

• Clustering is a poor default view - biases users toward certain points of view

• Tool - dataset interaction (in some cases)

Pros/Cons (+/-)

+ Very important methodology!Gets at users’ real objectives!!!!

+ Detecting the Expected – Discovering the UnexpectedTM

+ Measures “Detecting the Expected”+ Measures “Discovering the Unexpected” (ie,

Insights)- Wide variation in tool capabilities

TimeSearcher did really well with the time series data - but not with other types of data

- Not enough users to get many s.s. results- Only 10 domain experts- Short-term use – just a few hours- Users not 100% motivated

Implications

• Study real users doing real work with “seeded” data Known set of predetermined insights

Also called “ground truth”

• Can approximate this with run-off competitions using “seeded” data

Long-term Evaluation

• The challenges of evaluation• Subjective evaluation• Evaluation based on model of

cognition & perception• Experimental evaluation• Long-term Evaluation• Conclusions

Long-term Evaluation

• Hard to know what really works for exploratory data analysis

• Real people doing real work Real people need results for

professional success Highly motivated users!!

Long-term Studies

• Perer & Shneiderman, Integrating Statistics and Visualization: Case Studies of Gaining Clarity during Exploratory Data Analysis, CHI ’08

• Domain: Social network analysis – political influence, etc

• Problem: Network visualizations by themselves not so useful Need statistics about the networks

• SocialAction InfoVis system Combines visualization and statistics

• Question: Does SocialAction improve researcher’s

capabilities?For another such study, see Saraiya, North, et al, An Insight-Based Longitudinal Study of Visual Analytics, IEEE TVCG 2006

SocialAction

Methodology – for each user

• Interview (1 hour) Screen participants Understand their goals

• Training (2 hours) With generic data

• Early use (2-4 weeks) With user’s data Weekly visits Requested (and feasible) software enhancements Provide help

• Mature use (2-4 weeks) No software improvements Weekly visits Provide help

• Exit Interview (1 hour) How did system impacted research? How well were goals met?

One User: A Political Analyst

Study US senate power structures, based on voting blocs

Political Analyst

• Analyst had tried 3 other network analysis tools, without gaining desired insights NetDraw, ManyEyes, and KrackPlot

• Found interesting patterns using the capability to rank all nodes, visualize outcome and then filter out the unimportant The betweenness centrality statistic helped

find “centers of gravity” Found geographic alliances.

Findings

• With all four users Integration of network statistics with

visualizations improved ability to find insights

• User feedback helped improve SocialAction

Pros/Cons (+/-)

+ Detecting the Expected – Discovering the UnexpectedTM

+ Measures “Detecting the Expected”+ Measures “Discovering the Unexpected” (ie,

Insights) – by users’ successes

- Does not compare two tools- Focus on how well a tool helps real users do

real work – very authentic, not lab study

- Not enough users to get s.s. results+ Users 110% motivated

Conclusions

• Apply UI & graphic design skills• Methods not mutually exclusive• Use as part of development cycle• Understand why users do what they do• Be sure you know what you’re

measuring• The purpose of evaluations is insight,

not numbers!

Thank You

Questions and

(maybe) Answers

Recommended