THE ART OF SEEING: VISUAL PERCEPTION IN …research.cs.rutgers.edu/~asantell/thesis.pdfthe art of seeing: visual perception in design and evaluation of non-photorealistic rendering

THE ART OF SEEING: VISUAL PERCEPTION INDESIGN AND EVALUATION OF

NON-PHOTOREALISTIC RENDERING

BY ANTHONY SANTELLA

A Dissertation submitted to the

Graduate School—New Brunswick

Rutgers, The State University of New Jersey

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

Graduate Program in Computer Science

Written under the direction of

Doug DeCarlo

and approved by

New Brunswick, New Jersey

May, 2005

guest

Placed Image

ABSTRACT OF THE DISSERTATION

The Art of Seeing: Visual Perception in Design and

Evaluation of Non-Photorealistic Rendering

by Anthony Santella

Dissertation Director: Doug DeCarlo

Visual displays such as art and illustration benefit from concise presentation of in-

formation. We present several approaches for simplifying photographs to create such

concise, artistically abstracted images. The difficulty of abstraction lies in selecting

what is important. These approaches apply models of human vision, models of image

structure, and new methods of interaction to select important content. Important loca-

tions are identified from eye movement recordings. Using a perceptual model, features

are then preserved where the viewer looked, and removed elsewhere. Several visual

styles using this method are presented. The perceptual motivation for these techniques

makes predictions about how they should effect viewers. In this context, we validate

our approach using experiments that measure eye movements over these images. Re-

sults also provide some interesting insights into artistic abstraction and human visual

perception.

ii

Acknowledgements

Thanks go to the many people whose help and support was essential in making this

work possible. None of this would have happened without my advisor Doug DeCarlo.

Thanks go also to my other committe members: Adam Finkelstein, Eileen Kowler,

Casimir Kulikowski and Peter Meer for their advice and encouragement at various (in

some cases many) stages of this process.

Thanks go also to the many friends and family members who have supported and

kept me sane through this long process. I wouldn’t have survived it without my parents

and brothers Nick and Dennis. Special thanks go to Bethany Weber. Thanks also to

Jim Housell, all the old NYU crowd, the grad group at St. Peters and all the supportive

souls in the CS Department, RuCCS and the VILLAGE.

Finally, thanks go to Phillip Greenspun for photos used in several renderings that

appear in chapters 7 and 9, as well as models Marybeth Thomas, Adeline Yeo and

Franco Figliozzi. Special thanks to Georgio Dellachiesa for looking equally thoughtful

in countless illustrative examples.

iii

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. Inspirations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1. Artistic Practice . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2. Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.3. Computer Graphics . . . . . . . . . . . . . . . . . . . . . . . 7

1.2. Our Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2. Abstraction in Computer Graphics . . . . . . . . . . . . . . . . . . . . 11

2.1. Manual Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2. Automatic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3. Level Of Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3. Human Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1. Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1. Eye Movement Control . . . . . . . . . . . . . . . . . . . . . 19

3.1.2. Salience Models . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2. Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3. Limits of Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1. Models of Sensitivity . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2. Sensitivity Away from the Visual Center . . . . . . . . . . . . 26

3.3.3. Applicability to Natural Imagery . . . . . . . . . . . . . . . . 26

iv

4. Vision and Image Processing . . . . . . . . . . . . . . . . . . . . . . . 30

4.1. Image Structure Features and Representation . . . . . . . . . . . . . 30

4.2. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3. Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1. Eye tracking as Interaction . . . . . . . . . . . . . . . . . . . . . . . 38

5.2. Using Visibility for Abstraction . . . . . . . . . . . . . . . . . . . . . 40

6. Painterly Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1. Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2. Applying the Limits of Vision . . . . . . . . . . . . . . . . . . . . . 43

6.3. Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7. Colored Drawings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1. Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1.1. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1.2. Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2. Perceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3. Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8. Photorealistic Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.1. Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.2. Measuring Importance . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.3. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67

9. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

v

9.1. Evaluation of NPR . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

9.1.1. Analysis of Eye Movement Data . . . . . . . . . . . . . . . . 75

9.2. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.2.1. Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.2.2. Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9.2.3. Physical Setup . . . . . . . . . . . . . . . . . . . . . . . . . 78

9.2.4. Calibration and Presentation . . . . . . . . . . . . . . . . . . 79

9.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9.3.1. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9.3.2. Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . 82

9.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

9.4.1. Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 86

9.4.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9.5. Evaluation Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 92

10. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.1. Image Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.1.1. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.1.2. Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

10.2. Perceptual Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.3. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

11. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Curriculum Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

vi

List of Figures

1.1. (a) Henri de Toulouse-Lautrec’s “Moulin Rouge—La Goulue” (Litho-

graphic print in four colors, 1891). (b) Odd Nerdrum’s “Self-portrait

as Baby” (Oil, 2000). Artists control detail as well as other features

such as color and texture to focus a viewer on important features and

create a mood. La Goulue’s swirling under-dress is a highly detailed

focal point of the image, and contributes to the picture’s air of reck-

less excitement. Artists have a fair amount of latitude in how they

allocate detail to create an effect. Nerdrum renders his eyes (usually

one of the most prominent features in a portrait) in a sfumato style

that makes them almost nonexistent. Detail is instead allocated to the

child’s prophetic gesture. These choices change a common baby pic-

ture into something mysterious and unsettling. . . . . . . . . . . . . 4

1.2. Judith Schaechter’s, “Corona Borealis” (Stained glass, 2001). Skill-

ful artists use the formal properties and constraints of a medium for

expressive purposes. The high dynamic range provided by transmit-

ted light and the heavy black outlines of the lead caming that holds

the glass together are used to set the figure off from the background

creating a powerful image of joy in isolation. . . . . . . . . . . . . . 5

2.1. Direct placement of strokes. Complete control of abstraction is pos-

sible when a user provides actual strokes that are rendered in a given

style. Reproduced from [Durand et al, 2001]. . . . . . . . . . . . . . 11

2.2. Manual annotation for textural indication. Important edges on a 3D

model are marked and have texture rendered near them, while it is

omitted in the interior. Reproduced from [Winkenbach and Salesin,

1994]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

vii

2.3. Manual local importance images. Hand painted images can indicate

important areas to be rendered in greater detail or fidelity. Reproduced

from [Hertzmann, 2001] . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4. (a) original image. (b) corresponding salience map [Itti et al, 1998]. (c)

corresponding salience map [Itti and Koch, 2000]. Salience methods

picks out potentially important areas on the basis of contrast in some

space (not limited to intensity). The two methods pictured here differ in

the method of normalization used to enhance contrast between salient

and nonsalient regions. . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1. Patterns of eye movements of a single subject over an image when

given different instructions. Note (1) free observation which shows

fixations that are relatively dispersed yet still focused on relevant ar-

eas. Contrast it with (3) where the viewer is instructed to estimate the

figures’ ages. Reproduced from Yarbus 1967. . . . . . . . . . . . . . 18

3.2. Similar effects to [Yarbus, 1967] are easily (even unintentionally) achieved

when using eye tracking for interaction. Circles are fixations, their di-

ameter is proportional to duration. The first viewer was instructed to

find the important subject matter in the image. The second viewer was

told to ’just look at the image’. The viewer assumed, from prior expe-

rience in perceptual experiments, that he was going to be later asked

detailed questions about the contents of the scene. This resulted in a

much more diffuse pattern of viewing. . . . . . . . . . . . . . . . . . 19

3.3. Log-log plot of contrast sensitivity from equation (3.2) This function

is used to define a threshold between visible and invisible features. . 25

3.4. Cortical Magnification describes the drop-off of visual sensitivity with

angular distance from the visual center. . . . . . . . . . . . . . . . . 27

viii

4.1. (a) Scale space of one dimensional signal. Features disappear through

scale space but no new features appear. (b) Plot of inflection points of

another one dimensional signal through scale space. Reproduced from

[Witkin 1983] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2. Interval tree for 1D signal illustrating decomposition of the signal into

a hierarchy. Reproduced from [Witkin 1983]. . . . . . . . . . . . . . 33

5.1. (a) Computing eccentricities with respect to a particular fixation atp.

(b) A simple attention model defined as a piecewise-linear function for

determining the scaling factorai for fixation f i based on its duration

ti . Very brief fixations (belowtmin) are ignored, with a ramping up (at

tmax) to a maximum level ofamax. . . . . . . . . . . . . . . . . . . . 40

6.1. Painterly rendering results. The first column shows the fixations made

by a viewer. Circles are fixations, size is proportional to duration, the

bar at the lower left is the diameter that corresponds to one second. The

second column illustrates the painterly renderings built based on that

fixation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2. Detail in background adjacent to important features can be inappro-

priately emphasized. The main subject has a halo of detailed shutter

slats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3. Sampling strokes from an anisotropic scale space avoids giving the

image an overall blurred look, but produces a somewhat jagged look in

background areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.4. Color and contrast manipulation. Side by side comparison or render-

ing with and without color and contrast manipulation (precise stroke

placement varies between the two images due to randomness). . . . . 48

7.1. Slices through several successive levels of a hierarchical segmentation

tree generated using our method. . . . . . . . . . . . . . . . . . . . . 51

7.2. Line drawing style results. . . . . . . . . . . . . . . . . . . . . . . . 60

ix

7.3. Stylistic decisions. Lines in isolation (a) are largely uninteresting. Un-

smoothed regions (b) can look jagged. Smoothed regions (c) have a

somewhat vague and bloated look without the black edges superimposed. 61

7.4. Renderings with uniform high and low detail. . . . . . . . . . . . . . 62

7.5. Several derivative styles of the same line drawing transformation. (a)

Fully colored, (b) color comic, (c) black and white comic . . . . . . . 62

8.1. Mean shift filtering tends to create images that no longer look like pho-

tographs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.2. Photo abstraction results . . . . . . . . . . . . . . . . . . . . . . . . 68

8.3. Photo in (a) is abstracted using fixations in (b) in a variety of differ-

ent styles. (c) Painterly rendering, (d) line drawing, (e) locally disor-

dered [Koenderink and van Doorn, 1999], (f) blurred, (g) anisotropi-

cally blurred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.4. (a) Detail of our approach, (b) the same algorithm using an importance

map where total dwell is measured locally. Notice in (b) the leaking of

detail to the wood texture from the object on the desk. Here differences

are relatively subtle; but in general it is preferable to allocate detail in

a way that respects region boundaries. . . . . . . . . . . . . . . . . . 70

8.5. The range of abstraction possible with this technique is limited. With

greater abstraction the scene begins to appear foggy. In some sense it

no longer looks like the same scene. . . . . . . . . . . . . . . . . . . 71

9.1. Example stimuli. Detail points in white are from eye tracking, black

detail points are from an automatic salience algorithm. . . . . . . . . 76

9.2. Illustration of data analysis, per image condition. Each colored collec-

tion of points is a cluster. Ellipses mark 99 % of variance. Large black

dots are detail points. We measure the number of clusters, distance

between clusters and nearest detail point, and distance between detail

points and nearest cluster. . . . . . . . . . . . . . . . . . . . . . . . 80

x

9.3. Statistical significance is achieved for number of clusters over a wide

range of clustering scales. The magnitude of the effect decreases, but

its significance remains quite constantly over a wide interval. Our re-

sults do not hinge on the scale value selected. . . . . . . . . . . . . . 82

9.4. Average results for all analyses per image. . . . . . . . . . . . . . . 84

9.5. Average results for all analyses per viewer. . . . . . . . . . . . . . . 85

9.6. Original photo and high detail NPR image with viewers’ filtered eye

tracking data. Though we found no global effect across these image

types, there are sometimes significantly different viewing patterns, as

can be seen here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

10.1. A rendering from our line drawing system (b), can be compared to

an alternate locally varying segmentation (c). This segmentation more

closely follows the shape of shading contours. . . . . . . . . . . . . 96

10.2. Locally varying segmentation cannot replace a segmentation hierar-

chy. Another example of a locally varying segmentation controlled by

a perceptual model (c), compared to a rendering from our line drawing

system. Note fine detail in the brick preserved near the subjects head

in (c). This is a consequence of the threshold varying continuously as

a function of distance from the fixations on the face. . . . . . . . . . 97

10.3. A rendering from our line drawing system demonstrates how long but

unimportant edges can be inappropriately emphasized. Also, promi-

nent lower frequency edges like creases in clothing are detected in

fragments and filtered out because edges are detected at only one scale. 100

10.4. Attempting technical illustration of mechanical parts pushes our image

analysis techniques close to (if not over) their limits. . . . . . . . . . 103

xi

1

Chapter 1

Introduction

In all eras and visual styles, artists control the amount of detail in the images they

create, both locally and globally. This is not just a technique to limit the effort in-

volved in rendering a scene. It makes a definite statement about what is important

and streamlines understanding. Our goal is to largely automate this artisticabstrac-

tion in computer renderings. The hope is to remove detail in a meaningful way, while

automating individual decisions about what features to include. Eye tracking allows

the capture of what a viewer looks at and indirectly, what they find important. We

demonstrate that this information alone is sufficient to control detail in an image based

rendering, and change the way successive viewers look at the resulting image. Our

method is grounded in the mechanisms and nature of vision—how we see and un-

derstand the world. This is an intuitive idea, if often overlooked. Artists must first

be viewers [Ruskin, 1858] and viewers ultimately consume the resulting images. So,

vision must be central in the design of algorithms for creating imagery.

Vision appears simple and effortless. Because under most circumstances it requires

no conscious effort or exertion, it seems like a trivial operation, something that just

happens, as if the light falling on the eye made one see in the same way it warms a stone.

But sight is the product of an extraordinarily developed and complicated visual system.

In seeing we are all experts, and experts make things seem easy. Without any effort we

can navigate and act in the world and recognize objects even under difficult conditions.

The abilities of our sight outreach even our awareness of them. Experiments have

shown that the eyes of radiologists searching for tumors linger longer over tumors

that they fail to notice and report [Mello-Thoms et al., 2002]. The limited success of

attempts to mimic these human abilities in computer vision systems highlight both the

difficulty of the computations involved, and our phenomenal success at them.

2

The apparent ease with which we see slips when our vision is stressed: struggling

to keep a written page in focus as we fall asleep, searching for a loved one’s face in the

shifting crowd of an airport. At these times we become conscious of sight as a struggle

to organize and make sense of the world. This struggle has continual victories, but also

failures. An old friend waving to us on the street is passed by, a typo make its way into

an important document. The apparent ease of vision also masks our limitations. We

miss much, and are easily overloaded. Sometimes our failures are engineered: a cam-

ouflaged soldier, the proverbial fine print. More often, however, they are accidental.

Some information was present, or presented, and we failed to notice it.

Well designed displays of visual information ensure we don’t miss anything impor-

tant by careful arrangement and manipulation. A wide variety of techniques are used

to make meaning clear. Detail is put just where it is important, shapes can be changed

or removed, colors and textures enhanced or suppressed. Paintings, sketches, technical

illustrations, and even the most apparently photorealistic of art—all products of the hu-

man hand—have been simplified and manipulated for ease of understanding. Reality

is complicated and messy. Rather than realism, what is more often desired is verisimil-

itude. We want the appearance of reality which has been organized and structured to

make its meaning clearer, if necessarily more limited than the infinite complexity of

reality.

Achieving this kind of clarity has always been the job of artists and designers who

make subjective, but not arbitrary, decisions about what is important, and how to con-

vey it. The ubiquity of digital media creates a need for automation in achieving this

kind of good design. The goal is not to replace the artist who creates a carefully crafted

one-off display, but instead to create a potentially vast number of adaptive displays,

tailored to particular situations and viewers. This information would otherwise be dis-

played in some less well-designed manner, laying more of a cognitive burden on the

user. It has been argued in fact, that avoiding this burden is one of the primary char-

acteristics of powerful art [Zeki, 1999]. If good design can be formalized, this will

3

enhance understanding and aid effective communication, as well as improve our own

understanding of the workings of visual communication. This thesis presents some

initial steps toward this goal.

1.1 Inspirations

There are many techniques proposed by various artists, and perhaps even more theory

proposed by various researchers and critics on how to achieve good visual design. Yet

it remains imperfectly understood in all of the fields where it has been studied. Because

of this, a successful practical approach must necessarily draw on elements from many

areas of practice and theory. If a practical system is designed to be as general as

possible, its creation can improve understanding of what visual clarity means, and how

it relates to communication. It can also provide a framework in which to unify concepts

and techniques from many fields.

1.1.1 Artistic Practice

One important source of inspiration for this work is artistic practice and practical the-

ory. Artists have always had strong motivation to capture the attention and interest of

uninterested, sometimes hostile viewers. Much ingenuity has been applied to creating

images that are as gripping and clearly communicative as possible. Careful observation

of such images can yield interesting insights (see Figure 1.1). Similarly, artists have

throughout history given advice on the practice of their craft. Theorists and art histo-

rians have tried to make generalizations and analyze techniques [Ruskin, 1857, Gom-

brich et al., 1970, Graham, 1970, Arnheim, 1988]. This is true in graphic design as

well as fine art. Classical texts like Tufte [1990] try to explore the qualities of good

and bad presentations of information and make generalizations from carefully chosen

examples.

However, these instructions and recommendations are often difficult to apply. They

4

(a) (b)

Figure 1.1: (a) Henri de Toulouse-Lautrec’s “Moulin Rouge—La Goulue” (Litho-graphic print in four colors, 1891). (b) Odd Nerdrum’s “Self-portrait as Baby” (Oil,2000). Artists control detail as well as other features such as color and texture to focusa viewer on important features and create a mood. La Goulue’s swirling under-dress isa highly detailed focal point of the image, and contributes to the picture’s air of recklessexcitement. Artists have a fair amount of latitude in how they allocate detail to createan effect. Nerdrum renders his eyes (usually one of the most prominent features in aportrait) in a sfumato style that makes them almost nonexistent. Detail is instead allo-cated to the child’s prophetic gesture. These choices change a common baby pictureinto something mysterious and unsettling.

are sometimes limited in scope, providing specific instructions for a particular narrow

problem. More often, guidelines are too broad and vague in their application. They

count for their functioning on the judgment of the artist. The advice of artists and

designers often comes in the form of heuristics, rules of thumb to be taken with a grain

of salt, kept in the back of one’s mind, and applied when the moment seems right.

Becoming an expert in a visual field is often a question of cultivating, through practice

and observation, an instinctive sense of when to apply such rules, and conversely when

to break them.

5

Figure 1.2: Judith Schaechter’s, “Corona Borealis” (Stained glass, 2001). Skillfulartists use the formal properties and constraints of a medium for expressive purposes.The high dynamic range provided by transmitted light and the heavy black outlinesof the lead caming that holds the glass together are used to set the figure off from thebackground creating a powerful image of joy in isolation.

1.1.2 Psychology

A somewhat different approach is to study good design with the methodologies of psy-

chology, psychophysics and neuroscience. This is in essence an attempt to understand

good design from first principles: the functioning of the human mind and visual sys-

tem. Visual perception obviously mediates all information that passes from a display

to a user. So, as a form of visual communication, art must be constrained by the laws

of psychology and the visual system [Arnheim, 1988, Zeki, 1999, Ramachandran and

6

Hirstein, 1999]. This is an attractive idea. By understanding the strengths and weak-

nesses of the process that allows us to see, it should be possible to maximize use of the

limited cognitive bandwidth between a display and viewer.

This is perhaps not so far from what artists have done all along. One could view

every daub of paint, every pen stroke as an informal experiment in vision. Artists test

their actions against the evidence of their own visual systems, and make predictions

about how they will affect others. Formal attempts to understand perception and art

are simply more conscious, more systematic, and more interested in understanding the

creative process itself than making a statement through it. A number of psychologists

have speculated on this, and pointed to specific examples from art history [Arnheim,

1988, Leyton, 1992, Zeki, 1999, Ramachandran and Hirstein, 1999]. Studies have in-

deed found empirical evidence of perceptual effects resulting from artistic style or com-

position [Ryan and Schwartz, 1956,Locher, 1996].

Like most attempts to do anything complicated from first principles, looking at art

and design using cognition is hard. There is much that has been understood about the

visual system, but also much that is not. The more basic and low level the area of

visual function is, the more we know about it, and the less useful that information is

for design. Much for example, is known about the physical mechanism of how we per-

ceive color, substantially less is known about how we parse shapes out of a background

and assemble them into objects. It’s not surprising that many researchers looking at art

from a cognitive standpoint consider primarily 20th century painters, like Mondrian,

Kandinsky, or even Picasso at his more abstract, who themselves were largely con-

cerned with the purely formal aspects of pictorial space rather than the semantics of

subject matter. The semantic aspects of vision which reference the rest of the world

and its non-visual aspects are ill understood, so little cognitive research can be brought

to bear on the semantics of art.

Given the limited basic knowledge, general theories of how art functions cogni-

tively are, almost of necessity, rather vague in their application. Ramachandran [1999]

7

for example, suggests that all art is guided by the peak shift principle. This principle,

found in a number of situations in psychology, says that if a response is trained to some

stimuli, the greatest, or peak, response will be found with a stimulus that is greater than

the one used in training. A depiction functions by emphasizing the features that nor-

mally let one know what it is. In this view all art is a form of caricature. However,

this does not tell us the qualities of a successful caricature. In another example, Leyton

[1992] argues that art maximally encodes a causal history that can be read by viewers.

Good art should contain as much information in the form of asymmetry as possible to

stimulate viewers, but not too much, which will disturb them. Though a reasonable

sounding standard, this only hints at what the correct level of complexity is.

The application of psychology to design is difficult. However, we do not need to

build a system directly on these principles. Inspired by them, we can apply knowledge

from low-level vision and computer graphics techniques to build practical systems.

1.1.3 Computer Graphics

A large body of work in computer graphics ignores all these difficulties and sets out

to create attractive synthetic art and illustration. Attempts at algorithmic definitions of

good design surface in a number of areas in computer science, graphics, scientific vi-

sualization, document layout, human computer interaction, and interface design. Con-

cerns of effective art-like visual communication have particularly come to the forefront

in the realm of non-photorealistic rendering, or NPR. This area is perhaps excessively

broad. It includes almost any part of graphics that aims to create images that are not an

imitation of reality. It includes things as diverse as computer generation of geometrical

patterns, instructional diagrams and impressionist paintings. NPR images run a gamut

between the purely ornamental and those designed to convey very specific information.

A large area of research in NPR has been the production of many, often quite impres-

sive, phenomenological models for rendering in various traditional media and styles.

There is however an increasing interest in NPR as not just a way to imitate traditional

8

visual styles, but also as a set of techniques for trying to display visual information in

a concise and abstract way.

The link between concise presentation and imitating traditional artistic styles is not

accidental. Almost all the visual styles of traditional media, line drawings, wood-block

prints, comics, expressionist or impressionist paintings, pencil sketches, necessarily

discard vast amounts of information as a direct consequence of their visual style. There

is, for example, no color or shading in a pure line drawing. However, these images still

carry the essential content that the artist (and viewer) requires of them. Skillful artists

can use the properties and constraints of a medium to enhance the expressiveness of

a work (see Figure 1.2). A brief time spent working with photo filters in a program

like Adobe Photoshop suggests that computer implementations of these styles capture

some of the effects of traditional media, but often in a way that does not adapt to

particular situations with an artist’s flexibility. Artists ultimately can judge their results

as they go. Applying a technique in a blanket manner is often less satisfactory. What

is acceptable as reality in a photograph can look fussy and crowded as a painting.

1.2 Our Goal

Though today’s algorithms cannot model the general intelligence of an artist, we argue

that carefully designed systems can make use of minimal user interaction to create

much more expressive images. Specifically, we look at modulation of local detail, an

important cue used in traditional art and visualization. Including detail only where it

is needed focuses viewer interest and can help clarify the point of an image. As well

as being a feature of art and illustration, applications in visualization could benefit

from this. It would allow the computer to hand-craft displays for clarity and efficient

understanding in a particular situation.

This work does not directly address specific visualization applications. Rather than

exploring visualization directly, art remains the focus, and this thesis remains firmly in

9

the relm of artistic NPR. Our hope however, is that insights gained in this way should

be applicable to a number of areas in visualization. Art is a particularly good place to

explore the link between cognition and design of displays. Specific applications tend

to distract with their own implementation details and domain constraints. Radiology,

for example, is a domain where complexity and high stakes greatly constrain practical

applications. Art encourages a wider view, in which it is easier to look at general

techniques and patterns that are widely useful. Similarly, in evaluation, validation of

a particular system is of limited interest, while evaluation of more general techniques

can provide insights into cognition and be more widely relevant.

Grounding our work in knowledge of visual perception also helps focus attention

away from application engineering and towards general concepts. We are interested in

interactively efficient methods for achieving expressive NPR images. Knowledge of

visual perception suggests that by exploiting the visual system we can reserve human

effort for just the hardest parts of the process of crafting images, and pass the major-

ity of the work over to a computer. For a computer application, the hardest part of

abstraction is deciding what is important. This is not hard for people, since it is done

instinctively. Deciding what to paint a picture of is the easy part for an artist. It is the

mechanics of turning that intention into an image that takes training, time and effort.

This leads us to a simple, minimally interactive method for controlling detail via eye

tracking. As we will soon see, vision research leads us to believe that where people

look indicates importance. Such areas should be portrayed in detail. Conversely, what

viewers don’t look at is unimportant to them and can be removed or de-emphasized.

The same insights about vision that leads to this methodology also leads us to quanti-

tative methods for evaluation. If our approach is successful, increased interest in areas

highlighted with detail should be reflected in eye movements. This methodology holds

the promise of images that are carefully crafted for understanding on sound principles,

and can be formally evaluated for effectiveness. Such images and techniques can in

turn serve as a tool for further investigating human vision in a way targeted toward the

10

questions that are important for crafting images. With more information, even better

techniques and images can be built.

In this thesis we begin in Chapter 2 by laying out the basic problem of control-

ling detail in NPR imagery, and look at the range of techniques that have been used

to address it. In Chapters 3 and 4 we then review the basic background in human and

computer vision underlying our approach to this problem. The nature of vision leads

us to an approach of capturing the intentionality central to design via eye tracking.

Information about where people look alone is sufficient to control detail in a directed

way, allowing us to craft semi-automatic NPR images with much of the attractive and

engaging intentionality of completely hand made art. The basic nature of this interac-

tion is described in Chapter 5. In Chapter 6, 7 and 8 we then present several systems

for creating NPR renderings built on this idea, and discuss their strengths and weak-

nesses. An evaluation of one of these systems is presented in chapter 9, which not only

validates the general approach but gives some interesting insights into abstraction and

human vision. Finally, in Chapter 10 we discuss some directions for future research.

11

Chapter 2

Abstraction in Computer Graphics

In any work of art all parts of the picture plane do not receive equal attention from the

artist. Critical areas are more detailed, while others are left relatively abstract. This is

the case even in quite realistic styles, and in technical illustration. Such effects have not

been ignored in computer graphics and NPR. Local control of detail has been addressed

in several visual styles. Whatever the rendering techniques used, important areas can

be identified and depicted with greater detail, or emphasis on fidelity. Deciding what is

important is difficult to do automatically. Two broad approaches to selecting important

areas can be characterized: manual user annotation, and simple heuristics.

Figure 2.1: Direct placement of strokes. Complete control of abstraction is possiblewhen a user provides actual strokes that are rendered in a given style. Reproduced from[Durand et al, 2001].

2.1 Manual Annotation

At one extreme, near complete control of detail can remain in the hands of a user.

This provides many expressive possibilities at the expense of much interaction. At its

12

Figure 2.2: Manual annotation for textural indication. Important edges on a 3D modelare marked and have texture rendered near them, while it is omitted in the interior.Reproduced from [Winkenbach and Salesin, 1994].

Figure 2.3: Manual local importance images. Hand painted images can indicate im-portant areas to be rendered in greater detail or fidelity. Reproduced from [Hertzmann,2001]

furthest extreme the computer becomes merely a digital paintbrush the user directly

manipulates [Baxter et al., 2001]. A number of intermediate approaches exist that aid

the user in the technicalities of creating an image while still giving them complete

control over detail. The earliest work creating a painting-like appearance, orpainterly

renderingeffect [Haeberli, 1990] took this approach. A user places strokes entirely

by hand, their color being sampled from an underlying source image. The approach

is in effect a form of tracing, where the user ultimately remains in control of stroke

placement and size while, like a traditional media artist, making their own decisions

about which details are important as they go. A similar kind of interaction has been

used [Durand et al., 2001] in generating pencil renderings (see Figure 2.1. The user

places strokes which are shaded and shaped automatically to create a final drawing.

The same stroke based interactive methods are applicable in 3D [Kalnins et al., 2002].

13

One step distant from actually drawing strokes, it is also possible to indicate in-

creased importance for some areas of a rendering using animportance map, where

higher intensity indicates the need for more attention or detail in that area. For exam-

ple in a painterly rendering framework [Hertzmann, 2001], a hand drawn importance

map was used to indicate that a source image should be more closely approximated in

certain locations (see Figure 2.3). Similarly, [Winkenbach and Salesin, 1994] in 3D

hand drawn lines have been used to indicate locations near which textural detail should

be included (see Figure 2.2). In another painterly rendering application [Gooch and

Willemsen, 2002] rectangles to be painted in greater detail could be drawn by hand.

Various digital versions of other media, such as pen and ink [Salisbury et al., 1994]

and watercolor [Curtis et al., 1997] have been developed that provide the user with a

significant control over the detail present in different areas. Such approaches can yield

attractive results, but require careful attention on the part of a user.

(a) (b) (c)

Figure 2.4: (a) original image. (b) corresponding salience map [Itti et al, 1998]. (c)corresponding salience map [Itti and Koch, 2000]. Salience methods picks out poten-tially important areas on the basis of contrast in some space (not limited to intensity).The two methods pictured here differ in the method of normalization used to enhancecontrast between salient and nonsalient regions.

2.2 Automatic Methods

More common in NPR have been purely automatic methods. Automatic methods also

run a gamut, from approaches that process an image in a completely local, uniform

manner to those that automatically extract some quantity from an image as a proxy for

14

importance. Uniform approaches perform some (not necessarily local) operation uni-

formly across an image, and have been used extensively in painterly rendering [Hertz-

mann, 1998,Litwinowicz, 1997,Shiraishi and Yamaguchi, 2000]. A global effect pro-

vides users with only limited control. Rather than being truly uniform, some of these

approaches make a (largely implicit) simple assumption that some low level features

are important and worth preserving. Automatic painterly rendering methods for ex-

ample, largely assume strong high frequency features are important and should be

preserved in a rendering. In fact, painterly techniques vary largely in their method

for respecting these boundaries: aligning strokes perpendicular to the image gradi-

ent [Haeberli, 1990], terminating strokes at edges [Litwinowicz, 1997], or drawing in

a coarse-to-fine fashion [Hertzmann, 1998, Shiraishi and Yamaguchi, 2000, Hays and

Essa, 2004]. Similarly, automatic line drawing approaches (both 2D and 3D) assume

the importance of all lines that meet certain purely geometrical definitions, occluding

contours, creases, [Saito and Takahashi, 1990,Interrante, 1996,Markosian et al., 1997],

and suggestive contours [DeCarlo et al., 2003]. Such techniques can create attractive

images, but lack the selective omission which gives art much of its expressive power.

The kind of omission commonly used in depicting specific objects can sometimes

be explicitly stated. In drawing trees for example, [Kowalski et al., 1999,Deussen and

Strothotte, 2000] you can avoid drawing detail in the center of the tree, especially as the

tree is drawn smaller. Though this may be an accurate characterization of a particular

common style of depiction, it is not generally applicable to any subject.

For general images, there are relatively few options for automatically selecting

important areas. Some attempts have been made to predict importance using various

image analysis techniques. In 3D, image pyramids have been applied to omit detail in

the interior of a shape [Grabli et al., 2004]. In 2D, drawing on vision research, some

approaches have attempted to use salience measures to capture importance. Salience

measures are a guess at the ability of a feature to capture interest based on its low level

properties [Itti et al., 1998,Itti and Koch, 2000]. Similarly motivated salience measures

15

have been applied to attempt to predict features worth preserving in painterly rendering

[Collomosse and Hall, 2003]. Because faces are often an important component of

images, detecting them also provides a useful (though not always reliable) automatic

cue for what areas are important. Face detection has been used alongside salience

methods in other areas of graphics loosely related to NPR where identifying important

features is useful, such as automatic cropping [Chen et al., 2002, Suh et al., 2003] and

recomposing of photographs [Setlur et al., 2004].

2.3 Level Of Detail

An area of computer graphics left out in the above discussion has dealt with many of

these same issues. Various adaptive rendering and level of detail (LOD) schemes have

used the visibility or potential interest of features to skip computations that are unlikely

to be noticed. This is different from our goal. We are interested in detail modulation for

stylistic and expressive reasons. Level of detail seeks to control the computational cost

of rendering through approximation, not abstraction. Though both are concerned with

simplification, LOD and various other corner cutting is usually meant to be invisible,

or nearly so, while expressive abstraction is meant to be seen and indeed have a strong

effect on the way a viewer looks at an image. Though the goals are different, some

of the methodologies overlap. The goal of imperceptible omission has encouraged

researchers to look at perceptually motivated methods. Salience measures have been

applied to concentrate computation on noticeable areas, [Yee et al., 2001, Cater et al.,

2003]. In addition, a variety of low level perceptual models have been applied to try

to quantify the visibility of features and guarantee that simplification is invisible, or

minimize visibility. We adopt several of these metrics in our own efforts. One of

our contributions can be seen as applying and expanding perceptual models originally

adopted in LOD to create expressive artistic abstraction.

Both perceptually motivated LOD methods and the methods we present in this

16

thesis use models of vision to identify expendable areas of an image. It is the functional

definition of an expendable area that differs between the two. In the following chapter

we present the relevant background in human vision necessary for understanding why

such areas exist, and how they may be identified.

17

Chapter 3

Human Vision

A background in human vision is essential in computationally defining artistic abstrac-

tion. We have extraordinarily complex abilities to analyze images, these abilities have

weaknesses and strengths. Level of detail simplification methods seek to exploit the

limits of vision to cut corners in an unnoticeable way. In contrast, we hope to use the

related strengths of the visual system to improve visual design, clarifying content and

make things that need to pop out, pop out. Our interactive technique uses eye move-

ments and the limits of vision to indirectly measure the importance of features. Some

background will clarify the motivation for this approach.

3.1 Eye Movements

The human eye is maximally sensitive over a relatively small central area called the

macula. This area of relatively high resolution is approximately 5 degrees across, while

the most sensitive region (the fovea) is only 1.3 degrees (from a total visual angle of

about 160 degrees) [Wandell, 1995]. Sensitivity rapidly degrades outside of this central

region. Our perception of uniform detail throughout space is a result of continually

switching the point at which our eyes are looking (the point of regard or POR).

This process involves two important types of eye motions:fixations, relatively long

periods spent looking at a particular spot, andsaccades, very rapid changes of eye po-

sition. These are not the only kinds of motion of which the eye is capable. Insmooth

pursuit the eye follows a moving object, and even when fixated the eye continually

makes very small jittery motions. Fixations and saccades however are the most signif-

icant motions when viewing static scenery. Saccades can be initiated consciously, but

for the most part occur naturally as we explore a scene. Though fixating on a location

18

Figure 3.1: Patterns of eye movements of a single subject over an image when givendifferent instructions. Note (1) free observation which shows fixations that are rel-atively dispersed yet still focused on relevant areas. Contrast it with (3) where theviewer is instructed to estimate the figures’ ages. Reproduced from Yarbus 1967.

19

is not identical to attending it, for the most part an attended location is fixated, (i.e. if

we pay attention to something, we strongly tend to look at it directly) [Underwood and

Radach, 1998].

Figure 3.2: Similar effects to [Yarbus, 1967] are easily (even unintentionally) achievedwhen using eye tracking for interaction. Circles are fixations, their diameter is propor-tional to duration. The first viewer was instructed to find the important subject matterin the image. The second viewer was told to ’just look at the image’. The viewer as-sumed, from prior experience in perceptual experiments, that he was going to be laterasked detailed questions about the contents of the scene. This resulted in a much morediffuse pattern of viewing.

3.1.1 Eye Movement Control

Qualitatively, a great deal is known about fixations. Eye movements are highly goal

directed. Viewers don’t just look around at random. Instead, they fixate meaningful

parts of images [Mackworth and Morandi, 1967, Underwood and Radach, 1998, Hen-

derson and Hollingworth, 1998], and fixation duration is related to processing [Just

and Carpenter, 1976, Henderson and Hollingworth, 1998]. Viewing is highly influ-

enced by task. The classic example of this [Yarbus, 1967] showed that viewers ex-

amining the same image, with different tasks to perform, showed drastically differ-

ent patterns of viewing, in which they focused on the features relevant to their task

(see Figure 3.1). Given the same task, the motions of a particular viewer over an

image at different viewings can be quite different, yet the overall distribution of fix-

ations remains similar [Yarbus, 1967]. In real activities, actions, even those thought

20

of as automatic, are usually preceded by (largely unperceived) fixations of relevant

features [Land et al., 1999]. These effects have been noted from some of the earliest

research in the field [Yarbus, 1967], but the mechanisms involved remain for the most

part informally understood.

In general, understanding of most higher-level aspects of eye movement control

is largely qualitative. In limited domains such as reading, attempts have been made

to formulate mathematical models of viewing behavior. For complex natural scenes,

much less is known [Henderson and Hollingworth, 1998]. Clearly any information

used in guiding eye movements must come from the scene. Likewise, the process of

selecting a new location to view must be guided in part by low frequency information

gathered from the periphery during earlier fixations. A matter of debate is whether low-

level visual information gained like this is a direct control of behavior or whether it is

primarily used when integrated into a higher level understanding. The precise factors

involved in control and planning of eye movements are an active and highly debated

topic [Kowler, 1990].

3.1.2 Salience Models

Much effort has gone into attempts to identify purely low-level image measurements

that can account for a significant amount of viewing behavior. Clearly it would be inter-

esting if what appears to be a highly complex behavior requiring general understanding

could be modeled or at least reasonably predicted by a simple approach. Results have

been mixed. Fixation locations do not correlate very well over time with the presence

of simple low level image features such as areas of high contrast, junctions, etc... [Un-

derwood and Radach, 1998].

More complex models have been formulated, such as thesaliencemethods men-

tioned earlier. All measure contrast in one sense or another. In general, salience meth-

ods embody the assumption that unusual features are likely to be important and looked

at. Choice of feature space, and scale of measurement and comparison differ. One

21

popular approach [Itti et al., 1998, Itti and Koch, 2000] uses center surround filters to

measure local contrast in color, orientation and intensity to model general viewing be-

havior. [Rosenholtz, 2001] uses a probabilistic framework to measure the probability of

a feature given a Gaussian model of color or velocity in the surround. This was used to

predict visual search performance. A related salience framework was proposed [Walker

et al., 1998] to select unique image locations to match for image alignment. This ap-

proach used kernel estimation to measure the rarity of local differential features in the

global image wide distribution of those features.

These approaches share the same basic idea but vary in what they attempt to model.

This begs the question of what one is really trying to capture with salience. One can

look at salience as simply a quantitative method of deciding whether something is

present in a particular location in the visual field. In this context, salience doesn’t actu-

ally state the location is important, just that it might be because something is there. It

seems quite plausible that a measure like this plays a role in perception. However, more

is usually claimed for salience, for example that it predicts most of viewing behavior

or the valuable content in an image.

Salience would seem to have some additional predictive power because in a wide

class of images the semantically important subject does contrast with the rest of the

scene. Relatively few people take pictures of their family members dressed in camou-

flage and lurking in the bushes. Nobody takes a picture of a leaf of grass in a field. The

tendency of meaningful features to be visually prominent is by no means universal. It

is also unclear if this is really a property of the world, or a property of pictures people

take, but it does seem to underlie some of the success of salience as an engineering tool

in graphics.

Salience models have also been used to model viewing in narrower domains where

their applicability is more clear. The presence or absence of pop out effects in search

for example [Rosenholtz, 1999, Rosenholtz, 2001] is effectively modeled by simple

salience models that measure how distracting a distracter actually is.

22

Debate about how useful salience is in understanding general viewing is ongoing.

Some optimistically state that salience predictions correlate well with real eye motions

of subjects free viewing images [Privitera and Stark, 2000,Parkhurst et al., 2002]. Oth-

ers are more doubtful and claim that when measured more carefully and in the context

of a goal driven activity, the correlation is quite poor [Land et al., 1999, Turano et al.,

2003]. This mismatch in experimental results fits the intuition that visually promi-

nent, ’eye catching’ features might be more correlated with idle exploration of a scene,

and much less related to eye movements made during a task. In spite of this contro-

versy, salience methods are quite popular and have seen a fair amount of application

in computer graphics. They show some correlation with visually prominent features

and are fairly simple to implement. Code for some is publicly available. Clearly both

semantics and low-level features play a part in eye movements. Further investigation

is necessary to clarify the contributions to viewing behavior of salience and scene se-

mantics. Though they seem unable to model important aspects of viewing behavior,

salience models may provide important measures of visual prominence.

3.2 Eye Tracking

Much of the knowledge above about human eye motion has been gained through the

use of eye-tracking. A system measures a viewer’s eye in one of several manners

and records the point where it is looking, termed thepoint of regardor POR. One

common approach involves a video camera and an infrared light source. The relative

positions of the pupil and corneal reflection in the resulting image are used to calculate

point of regard [Duchowski, 2000]. These systems are reasonably reliable and accurate

and improve with each generation, though they are still subject to drift over time and

variability between viewers. The same technology is used in producing units that sit

in front of a fixed display, and in head mounted units for use in more general scenes.

Video based trackers have the virtue of not interfering directly with a viewer, making

23

them useful as both a natural interactive method and a research tool.

Outside of research in human vision, eye-trackers have seen increasing use as a

mode of human computer interaction. It has also enabled the use of eye movements

as a gauge of cognitive activity for psychological investigations and for evaluation of

visual displays.

Eye position has been used as a cursor for selection tasks in a GUI [Sibert and Ja-

cob, 2000]. They have also been used to indicate a users’ attention to others in a video-

conferencing environment [Vertegaal, 1999]. Another class of use, related to ours, uses

POR to control simplifying images or scenes for efficiency purposes. Knowing where

a user looks enables pruning of information that is not perceptible, and need not be

transmitted in a video stream [Duchowski, 2000]. Similarly, unexamined content need

not be rendered in a 3D environment. In practice, few current systems that make use

of such simplification actually use eye tracking, presumably because of limited avail-

ability, head tracking is typically used instead [Reddy, 2001].

On the whole, eye tracking has been found more useful in interaction where it

serves as an indirect measure of user interest. Eye movements are not under full vol-

untary control. Because of this, when viewers attempt to explicitly point with their

eyes the result tends to lack control and suffer from the so called “Midas Touch” prob-

lem [Jacob, 1993] where struggling to control eye position, like a cursor, based on

visual feedback creates even more uncontrolled looking, touching on many irrelevant

or undesirable locations.

The same involuntary link of eye movement to thought processes that makes eye

tracking a bad mouse have made it useful as an indirect measure of interest and cog-

nitive activity. Eye tracking has been used to evaluate the effectiveness of informa-

tional displays including application interfaces [Crowe and Narayanan, 2000], web

pages [Goldberg et al., 2002], and air traffic control systems [Mulligan, 2002]. As

mentioned earlier, eye movements may even reveal information that viewers are trying

to report, but cannot, because it is not consciously available. Experiments have shown

24

that professional radiologists examining slides look longer at locations where tumors

are present, even when they fail to identify and report them [Mello-Thoms et al., 2002].

In the future, this might hold the promise of computer assisted technologies to avoid

such mistakes. Several consulting companies currently sell evaluation services using

eye tracking to graphic design houses and web content creators among others1.

3.3 Limits of Vision

Eye movements are related to the resolutional limitations of the eye. At any of the fix-

ations with which a viewer explores a scene, the most detailed information is received

only in the fovea, but lower frequency information is received throughout the visual

field. These limits on sensitivity within the visual field are not a weakness of the visual

system. On the contrary, they are part of our ability to efficiently process wide fields

of view and integrate information across eye movements and changes in viewpoint.

3.3.1 Models of Sensitivity

Quantitative models ofvisual acuityandcontrast sensitivityhave been developed to

model sensitivity to stimuli with different properties. Models of acuity predict whether

an observer can detect a black feature of a particular size on a white background. Con-

trast sensitivity measures an observer’s ability to discriminate a repeating pattern of a

particular contrast and frequency from a uniform gray field. The drop-off in these sen-

sitivities away from the visual center is modeled as a function ofeccentricity, location

relative to the point of fixation.

Contrast sensitivity has been studied extensively in a variety of conditions usually

using monochromatic sinusoidal gratings (smoothly varying, repeating patterns of light

and dark bands). This sensitivity declines sharply with eccentricity [Kelly, 1984,Man-

nos and Sakrison, 1974, Koenderink et al., 1978]. Contrast threshold is defined as the

1http://www.eyetools.com, http://www.factone.com, http://www.veridicalresearch.com

25

(unitless) contrast value (0 to 1 with 1 being maximal contrast) at which a grating and

uniform gray become indistinguishable. Contrast sensitivity is the reciprocal of this

value.

100 10110-2

10-1

100

101

102

103

Contrast Sensitivity

frequency cycles/degree

inve

rse

cont

rast

visible

invisible

Figure 3.3: Log-log plot of contrast sensitivity from equation (3.2) This function isused to define a threshold between visible and invisible features.

Many researchers have empirically studied human contrast sensitivity and several

have developed mathematical models from their data. Researchers in computer science

have also used existing data and models in applications. Different aspects of a stimuli

are important in different situations. Fitting models to data collected from different

viewers under different circumstances gives somewhat different results. Two examples

are given here to illustrate the form these mathematical models take.

Kelly [1984] developed a mathematical model for the contrast sensitivity curve (at

the center of the visual field) including appropriate scaling factors describing the effects

of velocity (v) as well as frequency (f in cyles/degree) of a grating on sensitivity.

A( f ,v) = (6.1+7.3(log10(v/3)3)v f2e−2 f (v+2)/4.59 (3.1)

Mannos and Skarinson [1974] fit a mathematical model appropriate to still imagery

to results of prior empirical studies for use as a metric in evaluating image compression.

26

A( f ) = smax2.6(0.0192+0.144f )e−(0.144f)1.1(3.2)

Wheresmax is the peak contrast sensitivity (this is around 400, but varies from

person to person).

3.3.2 Sensitivity Away from the Visual Center

A number of researchers have explored how sensitivity varies with eccentricity [Kelly,

1984,Rovamo and Virsu, 1979]. At larger eccentricities (expressed in degrees of visual

angle) the contrast sensitivity function is multiplied by another function which models

the drop-off of sensitivity in the visual periphery. This function is termed the cortical

magnification factor. It is not radially symmetric, but drops off faster vertically than

horizontally. It can be approximated [Rovamo and Virsu, 1979] with separate formulas

for decrease in sensitivity in four areas. For simplicity a bound from the most sensitive

area can be used in estimating visibility [Reddy, 2001,Reddy, 1997].

M(e) =1

1+0.29e+0.000012e3 (3.3)

The cubic term can usually be ignored, as its contribution in the range of eccentricities

normal in a screen display is negligible [Reddy, 1997]. The contrast sensitivity is then

M(e) ·A( f ).

3.3.3 Applicability to Natural Imagery

Some caution is necessary in applying these models derived from simple monochro-

matic repeating patterns to complex natural imagery. Though these models have been

applied with good results in graphics [Reddy, 2001], our goal of creating visible ab-

straction rather than conservative level of detail is more ambitious, and more likely to

stress the models involved.

27

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

eccentricity degrees

Cortical Magnification

Figure 3.4: Cortical Magnification describes the drop-off of visual sensitivity withangular distance from the visual center.

How to measure contrast is relatively obvious in gratings, there are only two ex-

trema. A single contrast exists for the entire grating. Between two regions in a scene

the meaning of contrast is less clear. Regions are neither uniform in color nor uni-

formly varying. No strong perceptually motivated approach to this problem appears to

have been formulated. Lillesaeter [1993] attempts to address this by defining a contrast

between a nonuniform figure and ground. This contrast measure is a weighted aver-

age of the contrast between the region and background and the integral of the contrast

along the edge of the region. This is demonstrated to provide more intuitive results

than simpler alternatives on regions with flat colors. Issues related to sampling in real

images are not addressed. Measuring contrast in a color image presents another prob-

lem. Contrast in colored gratings has been studied, and much work has been done in

general on color perception. However, there does not appear to be a simple general

contrast sensitivity model defined in color space [Regan, 2000]. Adapting a luminance

based model therefore remains a plausible course of action in designing a model for a

practical application.

Applying the notion of visibility for a grating to a non-repeating pattern of regions

28

also presents problems. The hump-like shape of the contrast sensitivity curve tells us

something counterintuitive if the size of an area is treated as proportional to an inverse

frequency [Reddy, 2001]. Very low frequencies are much less visible than some higher

ones at a given contrast. This is because detectability of a grating is related to the

density of receptive fields of corresponding size. There are upper bounds on the size of

human receptive fields. Intuitively, a large slowly varying sine wave may be difficult

to see.

This has been less of a concern in previous work where judgments were being made

mostly about high frequency parts of the curve [Reddy, 2001], but will be noticeable

when visibly abstracting images.

It can be argued [Reddy, 1997] that natural images, at least in places (and certainly

the uniform color regions that we will ultimately use in rendering) more closely resem-

ble square wave, rather than sine, gratings. Since a square wave can be approximated

by the sum of an infinite sequence of sine waves, and sensitivity to combined sinu-

soidal patterns is closely related to that of the independent components [Campbell and

Robson, 1968] one might think the visibility for low frequency square waves would

be higher than that for equal frequency sine waves. The actual relation has been stud-

ied empirically [Campbell and Robson, 1968] and confirms this intuition. For square

waves at frequencies below about 1 cycle/degree sensitivity levels off rather than drop-

ping. A theoretical derivation of the difference is presented in [Campbell and Robson,

1968]. It matches some but not all features of the empirical data.

These concerns remind us that when applying these models to real images they

cannot serve as an accurate absolute perceptual measure of visibility. Rather, they

provide a plausiblerelative sense of the visibility of different features. The absolute

contrast or acuity threshold at which a feature becomes visible is not necessary for our

application. What is important is the relative ordering of feature visibility, that allows

us to create a prioritization. It is necessary to model visual sensitivity only up to the

level where results correspond to our intuitions about this prioritization.

29

To apply these models in actual scenes, we need to decide on a definition of the

features whose visibility we are judging with these methods. For example, these mod-

els have been used in 3D level of detail [Reddy, 1997] to avoid rendering invisible

features. In this context the obvious choice of feature is a polygon which may or may

not be included in the rendering. For images the choice is less clear, as image prop-

erties can be measured in an unstructured, local way or an image can be partitioned

into a more structured representation. We review some of the possibilities for image

representation in the following chapter.

30

Chapter 4

Vision and Image Processing

4.1 Image Structure Features and Representation

(a) (b)

Figure 4.1: (a) Scale space of one dimensional signal. Features disappear throughscale space but no new features appear. (b) Plot of inflection points of another onedimensional signal through scale space. Reproduced from [Witkin 1983]

Image representation and processing is a large field of relevance in both human and

computer vision. We concentrate on some basic concepts relevant to the task of simpli-

fying images. Scale space theory provides a way of characterizing the different scales

of information present in an image and making correspondences between features at

different scales. Segmentation divides an image into distinct regions, enabling an ex-

plicit, non-local representation of image content. Edge detection provides a measure

of the prominent boundaries in an image.

An important unifying concept in image analysis is that the same image data can

be represented in many forms. In any of these certain information in the image is

explicit and other information is less easy to access [Marr, 1982]. The information and

representation appropriate is task dependent. A variety of representations with different

properties are available. With the exception of 3D techniques, NPR applications have

largely used low-level representations, often functioning locally on the original image

itself. However, human artistic processes operate on richer representations. Ruskin,

31

one of the 19th century’s most prominent art historians and theorists, famously argued

that in teaching art technique, the most important lesson was teaching the student to

see [Ruskin, 1858]. There seems to be an assumption in image based NPR that seeing

is simply capturing a bitmap representation of the scene, and that it can be considered

accomplished in the presence of a source photo. Human vision however is much more

than simply capturing an image. If a computer is to produce artistic renderings that

capture some of the expressiveness of real art, especially in highly abstracted styles,

some higher level representation is necessary, analogous to those created in the artists

head as she understands the scene before her, and begins to paint. The better suited

this representation is to the task, the easier it should be to drastically simplify an image

while retaining its important features.

The lowest level representation is the image itself, analogous to the retinal image.

This is the starting point of any further representation, making explicit the light inten-

sities at each pixel. There is structure here that can be more explicitly represented in

other ways. Information in the image exists over a variety of scales, small and large

features, making up parts and whole objects in the scene.

One common way to come to terms with the multiple scales of information in an

image is through itsscale space. From a single image, a three dimensional stack of

images is generated in which each contains progressively coarser scale information.

Again, this representation has an analogue in human vision where neurons have recep-

tive fields of different sizes, in effect generating a multi scale representation from the

retinal image.

Scale space has come to refer to such a space of increasingly simple images gener-

ated by a range of processes. Generically this can be thought of as a stack of images

with decreasing information contained at each level as scale increases. This stack is

in theory continuous, in practice sampled at some discrete interval. Starting with the

original image, detail is progressively lost until a uniform color is all that remains (see

Figure 4.1).

32

A number of constructions for such a space have been developed. Perhaps the sim-

plest approach creates something like an image pyramid, successively downsampling

the image so it is more coarsely pixelated. This approach has a problem in that de-

tailed, high frequency information (the edges between the new larger pixels) may have

been introduced which was not in the original image. This is the problem ofspurious

resolution[Koenderink, 1984]. New information has been hallucinated into existence

by imposing a coarser grid structure on the data. Convolution with a Gaussian kernel

(blurring) generates a space that avoids this problem [Witkin, 1983,Koenderink, 1984].

In fact this blurring has been proven [Koenderink, 1984] to be the unique way to gen-

erate a scale space which is both uniform or uncommitted, (i.e., the process is uniform

across image space and through the scale dimension), and also avoids spurious reso-

lution. Information disappears but cannot be created. In one dimension, this ensures

that any feature will only disappear as scale increases. In two dimensions new features,

maxima for example can appear. However in both cases clear judgments can be made

about what features exist at what range of scales.

That the process of blurring is uniform is an advantage in that filtering can be

applied to any signal, one doesn’t need to have a model of what the important features

present are. A disadvantage is that coarser features are more coarsely located, the

blurring process that reveals them distorts their spatial extent.

If you know what you’re looking for, there is no reason why the blurring operation

must be uniform or uncommitted. A number of nonuniform or nonlinear scale spaces

have been formulated which do not introduce false content but remove information

selectively in certain locations. One of the best known of such methods is anisotropic

diffusion [Perona and Malik, 1990]. Here the diffusion process is not uniform but rather

inversely proportional to the magnitude of the gradient at any position. This results in

an edge preserving blurring which removes low contrast detail while preserving strong

edges. This has the advantage that edges are better preserved in their initial location

until the point at which they disappear. Niessen et al [1997] compares this and several

33

other nonlinear methods in the context of segmentation. Nonlinear methods perform

well but are significantly more expensive.

A practical application must sample the continuous scale space at some discrete

intervals. One would like to sample sufficiently finely to capture interesting events,

the order of disappearance of different features, but not more densely than need be.

Looking at the linear scale space, Koenderink [1984] derives an appropriate sampling

as logarithmic along the scale axis corresponding to a uniform sampling in the scale

parametert, the standard deviation of the Gaussian kernel used in blurring. This is

intuitive. At small scales many tiny regions are merging quite often, requiring dense

sampling. At higher scales, there are fewer regions, fewer events to capture, and much

less dense sampling int is required. The issue is the same for nonlinear spaces. Re-

lating scales in different spaces is not straightforward. Some attempt at doing this has

been made in [Niessen, 1997].

Figure 4.2: Interval tree for 1D signal illustrating decomposition of the signal into ahierarchy. Reproduced from [Witkin 1983].

While a scale space such as this begins to capture structural relations of features

across scales, this is still largely an implicit representation. To make this explicit,

features at different scales need to be directly related to each other. Witkin [1983]

34

addresses this problem in 1D signals. In the scale space of a one dimension signal

features will never appear at coarse scales. So, any features found at a coarse scale

can (if the sampling is dense enough) be traced directly back to their fine scale origin.

This allows localization of features found at a coarse level. Witkin demonstrates this

choosing as a feature zero crossings in the second derivative, inflection points in the

signal (Figure 4.1).

Similarly, using these correspondences across scale it is also possible to create

a structure that captures the relationship between all features at all scales. Intervals

between two zero crossings (which again correspond to sections of the signal between

two inflection points) disappear in only one way. Two successive zero crossings merge

together, with the result that three intervals, the one between the crossings and those on

either side, merge into one. These three intervals can be made children of the resulting

interval to create an interval tree which characterizes the structure of the signal at all

scales. Witkin observes that those intervals which have longer persistence through

scale space appear to be those identified by human observers as subjectively salient or

important in the signal.

Extending this nice analytical derivation to a practical application in 2D is not

trivial. In 2D features such as maxima, or curves defined by inflection points may split

into two at coarser scales. Koenderink [1984] suggests the use of equiluminance curves

in the image as a 2D equivalent to Witkins intervals. Generic equiluminance curves

form a single closed curve. There are two singularities: extrema where the curve is

just a point, and saddle points where the curve forms multiple loops which intersect at

one point. Each loop may contain other saddle points and has to contain at least one

extrema [Koenderink and van Doorn, 1979]. The nesting of these saddle points gives

the structure of the image regions. Though new saddle points may appear inside a loop,

centermost saddle points must disappear before outer ones. Because of this the saddle

points present at all scales can be represented as a tree. Such a structure is difficult to

calculate in practice. It is not obvious how to find these saddle points efficiently or if

35

they provide a subjectively intuitive partitioning of the image. In addition it’s not clear

how color could be handled. In a naive approach, each band would produce its own

surface with its own saddle points, resulting in 3 separate scale space trees that would

need to be unified in some way.

4.2 Segmentation

The process described above of dividing up a signal based on the intervals between

features is a particular approach to the general problem of segmentation. This problem

again occurs in both computer and human vision. Segmentation makes explicit the

association (or disassociation) between different areas of an image. It produces an ex-

plicit representation of parts of the image that are associated with each other, assigning

each pixel to one, usually connected group or region. These regions should be uniform

by some measure. Separate regions, at least the adjoining ones, should be markedly

different. How people do this, parsing shapes and objects from the background is only

partially understood. In computer vision, a tremendous variety of methods have been

devised to define similarity measures for this using color, gray scale intensity, texture

etc. This segmentation is usually a partitioning of an image at a single scale. However

it is sometimes desirable to define a segmentation over a range of scales.

Scale space has been considered in segmentation. It is typically used to make seg-

mentations produced with other methods more robust. Niessen et al [1997] link pixels

with their neighbors who have similar color in both the spatial and scale dimensions

to create a hierarchy. The end product is a single flat segmentation taking its set of

regions from a coarse scale and their spatial extent from a fine scale. A similar ap-

proach is taken in Bangham et al [1998]. Here, the desire is to create a hierarchical

segmentation tree that describes the image over a variety of scales. An alternate ap-

proach [Ahuja, 1996] creates a multi-scale representation without explicitly generating

a scale space.

36

Each of these methods compute a hierarchical representation of image structure.

However, there is no clear relation between the hierarchy and the theoretical hierarchy

induced by scale space. This is not a major concern; scale space structure is attractive

because of its simple formal definition, but is not the single correct answer in any

meaningful sense for a given practical application. Hierarchical representations are

not general purpose, desirable properties depend on the application. For the purposes

of image abstraction, an important question is whether each subtree in the structure

represents some coherent area or region. This is guaranteed in some geometric sense

by scale space proper, since nodes occur in the tree only when features disappear. In

contrast, methods for building a hierarchy that iteratively merge regions, may have

many intermediate nodes that consist of fragments of regions that have not yet all

merged together. Such hierarchical representations are harder to use directly for tasks

that require meaningful regions like image abstraction.

Scale space captures the structure of intensities in the image, not the structure of

what is pictured. In some cases these may correspond closely (e.g., eyes, nose, and

mouth in a head), but this is not necessarily the case. A hierarchy that corresponds to

actual objects in the image would allow abstraction on an object by object basis. This

is a difficult problem. Some attempts have been made to rearrange subtrees based on

color (so for example a hole in an object would be associated with the background,

not the object) [Bangham et al., 1998]. Any general solution would need to draw on

analogues of high-level vision tasks that do not currently exist.

4.3 Edge Detection

Edges are another important image feature. A region is a uniform area. An edge is

the boundary or discontinuity between uniform regions. Edges are important in human

vision; they are one of the low level features built into the visual systems. Edges are

commonly detected using derivative filters. Discontinuities produce a filter response

37

that can be processed to extract edges as chains of positions. Like regions, edges

themselves exist at a number of scales. Edges at different scales are usually detected

by using derivative filters of different widths or, equivalently, by first convolving the

image with a blurring kernel. A popular procedure for performing the detection and

thresholding is the Canny edge detector [Shapiro and Stockman, 2001]. An interest-

ing modification on this procedure adds a measure of how closely the local image

resembles an edge model [Meer and Georgescu, 2001]. This allows faint edges to be

captured while false alarm responses from texture can be suppressed. As with regions,

larger scale features are detected with less spatial resolution. Nonlinear anisotropic dif-

fusion [Perona and Malik, 1990] was originally proposed as a better method of blurring

images for coarse scale edge detection since it removes fine scale detail while better

preserving the shape and position of high contrast low frequency edges.

Edges and regions are related to each other. A region identifies a homogeneous

area. An edge indicates a break between two areas. Since edges and regions are closely

related but their identification usually draws on different measures, each feature can be

used to improve results from the other [Christoudias et al., 2002]. These two features,

regions and edges provide a fairly complete representation of image content, one that

knowledge of human vision suggests is biologically important. These features are the

building block on which our rendering techniques will be built.

38

Chapter 5

Our Approach

The ideas and techniques above suggest a particular path for achieving minimally inter-

active abstraction in computer renderings. Eye tracking can serve as a bridge between

the computer and a user’s interests and intentions. As the evidence discussed above

suggests, locations of a viewer’s fixations can be reasonably interpreted as identifying

content that is important to that viewer. Preserving detail where the viewer looked

and removing it elsewhere should create a sensible NPR image that captures what the

viewer found important. The nature of abstraction in art suggests this is worthwhile

because it may help future viewers see the main point of the image more easily. It may

even be able to steer successive viewers into the same patters of viewing as the original

viewer, encouraging a particular interpretation of the image.

5.1 Eye tracking as Interaction

Eye tracking is used in our work as a minimal form of interaction to answer the ques-

tion of what is important in an image. Because eye movements are informed by seman-

tic features and fixations are economically placed, they provide a guide to important

features that require preservation. Because they are natural and effortless, they are a

desirable modality for interaction. Abstracting an image becomes as simple as looking

at the image, an action that requires no conscious effort.

Interaction in all of our systems proceeds in the same way. A viewer simply looks

at a photograph for a set period of time, usually around five seconds while their eye

movements are monitored by an eye tracker. The recording is then processed to identify

fixations, and discard saccades. The fixations found are taken to represent important

features and detail in these locations will be preserved.

39

An apparent contradiction is worth noting here. There seem to be two possibilities

concerning this interaction. Either the viewer will look at distracting image features,

and these will be included in the rendering, making fixation data useless for determin-

ing importance. Or, the viewer will not look at supposedly distracting elements, in

which case removing the information seems to be of negligible value. This is related to

the basic question of whether abstracting images is itself worthwhile, or if we should

just let people do the abstraction in their head.

There are several responses to this. Firstly, there is a great deal of content like

texturing on a wall or detail in grass that is not directly examined but is certainly visible.

This kind of information will be removed. Artists certainly seem to manipulate this

kind of detail in a purposeful way, suggesting it is worth doing. In a particular style,

the prominence of small features could be emphasised inappropriately without this kind

of omission. An ink drawing of a field of grass with each leaf depicted in silhouette

would be distracting.

Even on occasions when the eyes fixate low level visual distracters, or regions of

high contrast or brightness, which don’t say anything about the image meaning, these

fixations can be expected to be shorter in duration [Underwood and Radach, 1998]. So

we can still hope to remove such distracters despite their being looked at, and avoid the

distraction for future viewers.

This suggests it will be important to have a model of attention that takes into ac-

count the length of a fixation. Considering fixation duration will allow brief fixations

to be discounted as distraction, while giving more weight to longer fixations at which

it can be assumed more processing occurs. The simple model we use for this is shown

in Figure 5.1 (b). This model is a piecewise linear function in which fixations below a

certain duration have no effect, ramping relatively quickly up to a maximal weight at

fixations of at least a particular duration. This is a very simple model considering the

complexity of visual cognition. More sophisticated attention models may be useful.

Various information might be useful such as the time course and grouping of fixations.

40

p

(xi , yi)

di (p)

fixation

ei (p)

display

ti

ai

tmaxtmin

amax

0

1

(a) (b)

Figure 5.1: (a) Computing eccentricities with respect to a particular fixation atp. (b)A simple attention model defined as a piecewise-linear function for determining thescaling factorai for fixation f i based on its durationti . Very brief fixations (belowtmin)are ignored, with a ramping up (attmax) to a maximum level ofamax.

5.2 Using Visibility for Abstraction

To apply fixation data to simplification, we need to link fixations to individual decisions

about what content to include. The models of visual sensitivity discussed above allow

us to decide what features are visible. However, a difficulty exists when applying

these models to image simplification. These models all define a boundary between

perceptible and imperceptible features. In our application the goal is to remove features

in a way that is perceptible but makes sense to the viewer and preserves meaning. On

a well positioned monitor, nearly everything should be visible, down to nearly a pixel

resolution in the course of a brief viewing, so an accurate model could tell us to include

everything.

In order to accomplish abstraction, the hope is that these models can be scaled back

by some global constant, representative of a particular amount of abstraction. This will

allow us to remove visible detail in a way that still reflects the relative perceptibility of

features given a particular viewers eye movements. By making our algorithm interpret

the viewer as having, in a sense, worse and worse eyesight, we can force it to remove

visible information in a way that is still perceptually motivated. To the best of our

knowledge this is the first work to attempt to formalize abstraction in this manner.

41

This constant scaling factor can be seen either as a separate scaling factor, indi-

cating degree of abstraction or as part of the attention model mentioned above. When

folded into the attention model Figure 5.1 (b) asamax it can be seen as representing

a certain background amount of detail, that is not interesting even when a location is

fixated.

In general, a framework for accomplishing abstraction using this methodology in-

cludes 3 main choices: first a system of image analysis to represent the image content

in a meaningful way, second a method of indicating which of this content is actually

important, third a style in which to render the important content. In successive chap-

ters we present several systems that share the same basic interactive technique of eye

tracking described above but vary in the particulars of these decisions.

42

Chapter 6

Painterly Rendering

Our first system for creating abstracted NPR images does so in a painterly style. Painterly

rendering creates an image from a succession of individual strokes, mimicking the

methodology of a human painter. The intuition behind our system is that in most paint-

ing styles, painters use fewer and larger strokes to model secondary subject matter such

as background figures or scenery. This system is relatively simple, and served as an

initial proof of concept for our interactive technique. The model of image content is

simple and unstructured, the perceptual model equally minimal.

6.1 Image Structure

Our approach to painterly rendering [Santella and DeCarlo, 2002] has no explicit

model of visual form, it uses a simple scale space representation. This is possible be-

cause painterly rendering is a somewhat forgiving style in which imprecision in stroke

placement typically is not distracting.

Our approach follows an existing methodology, [Hertzmann, 1998] in which curved

strokes of different widths are painted with colors taken from the appropriate level of

a scale space of blurred versions of the original image. Our only model of image

contents is this scale space of images, and the corresponding image gradient at each

scale. This information is used in a standard algorithm to generate candidate strokes.

These strokes are the features to which we apply our perceptual model. The features

we consider are therefore not image features properly speaking, but strokes that exist

only in the rendering, not in the original image.

43

6.2 Applying the Limits of Vision

Given a choice of feature, we now need to make judgments about what features to

include. To do this we need to pick a model of visual sensitivity and decide how to

apply it to our system. The simplest model we could use would be an acuity model that

modulates brush size. This corresponds to considering each brush stroke in isolation

as a mark of maximal contrast with its background. Such a model is a fairly large

oversimplification, but provides an intersting starting point.

Simple acuity models like this have been used in graphics before [Reddy, 1997].

In that work, Reddy fit a function to the threshold frequencies provided by a contrast

sensitivity model for maximal contrast features at varying velocities and eccentricity.

This provides an acuity model that is simple to apply. It takes an input speed and

eccentricity and outputs a threshold frequency. Thought simple to apply and based on

psychophysical data, this model is not useful for our purposes. Because the model is

crafted to be highly conservative a fairly large central region 5.79 degrees in width

in assigned a maximal acuity. A conservative estimate like this may be desirable for

imperceptible LOD. Where the goal is to remove unattended information this model is

overly conservative. The circular region where detail is maximal can be highly visible

and distracting in an abstracted rendering.

We would prefer a function closer in shape to the actual drop-off in visual sen-

sitivity. A similar model that provides more intuitive results is equally simple. The

maximum frequency humanly visible is assigned to just the center of the visual field

(G = 60cycles/degree). From there sensitivity drops off as a function of the standard

cortical magnification factor equation (3.3). This produces a continuous degrading of

detail from center of vision purely on a frequency basis. This value is scaled by the

simple attention model described in Section 5.1 to produce a final frequency threshold.

Each fixation defines a potentially different threshold at each point in the image. The

highest threshold, (usually corresponding to the closest fixation) is used as the actual

44

threshold at point p, termedfmax(p).

We model a brush stroke with widthD as a half cycle in a grating and compare

the resulting frequencyf = 1/2D to the cutoff provided by our perceptual model. The

stroke is included if it is lower in frequency than the cutoff.

6.3 Rendering

To actually produce candidate brush strokes our approach uses a standard algorithm

[Hertzmann, 1998] to approximate the underlying image with constant color spline

strokes that generally run perpendicular to the image gradient. These strokes originate

at points on a randomly jittered grid. An extended stroke is created by successively

lengthening the stroke by a set step size in the direction perpendicular to the image

gradient. When the color difference between the start and end points of the stroke

crosses a threshold, or the stroke reaches a maximal length the stroke terminates (see

[Hertzmann, 1998] for further details).

Our method varies in a few particulars. When strokes are created by moving per-

pendicular to the image gradient, they can be excessively curved and worm like in

appearance. Hertzmann’s strokes are B-splines, which can meander to a sometimes

excessive extent. Real paint strokes do not tend to do this. Even in paintings by artists

that one thinks of as using very salient curving strokes, for example van Gogh, com-

pound curves are made up of multiple gently curving strokes. In response to this our

maximal stroke length is shorter and we use a single bezier curve for each stroke, using

a subset of the calculated control points. This produces somewhat more natural, sim-

ple curves. However, even these can curve more sharply than is normal with real paint

strokes.

Strokes are painted from coarse to fine with the entire image being covered with

the coarsest strokes used. For finer scale strokes, a choice is made at each stroke origin

point whether a stroke of that size is necessary based on the perceptual model. Only

45

Figure 6.1: Painterly rendering results. The first column shows the fixations made bya viewer. Circles are fixations, size is proportional to duration, the bar at the lowerleft is the diameter that corresponds to one second. The second column illustrates thepainterly renderings built based on that fixation data.

46

necessary strokes are generated.

In addition to removing detail, artists also use color as a vehicle for abstraction.

Vibrant colors and high contrast can enhance the importance of a feature or make it

easier to see. Muted color and contrast can de-emphasize unimportant items. Stroke

colors can be adjusted to achieve this [Haeberli, 1990]. Our perceptual model provides

a means of deciding where to make these adjustments. For instance, lowering the

contrast in unviewed regions makes them even less noticeable; raising it emphasizes

viewed objects.

Since color contrast is not well understood [Regan, 2000] we use a simple ap-

proach to adjust colors. Thoughwherewe apply these manipulations is controlled by

our perceptual model, the extent of these manipulations was simply picked by experi-

mentation. We start by defining a function of locationu(p) ranging from 0 (where the

user did not look at pointp) to 1 (where the user fixatedp for a sufficiently long period

of time). This is defined as the ratio between the perceptual threshold at point p, and

the maximal threshold possible:

u(p) =fmax(p)amaxG

(6.1)

We then adjust color locally for each stroke based on this function.

• Contrast enhancement: Contrast is enhanced by extrapolating from a blurred ver-

sion of the image atp out beyond the original pixel value. The amount of extrap-

olation changes linearly withu(p), beingcmin whenu = 0 andcmax whenu = 1

(a cmin andcmax of 1 would produce no change).cmin andcmax are global style

parameters for controlling the type of contrast change. For example, choosing

[cmin,cmax] to be[0,2] raises contrast where the user looked, and lowers contrast

where they didn’t. (Default:[cmin,cmax] = [0,2].)

• Saturation enhancement: Colors can also be enhanced; colors are intensified

in important regions and de-saturated in background areas. The transformation

47

proceeds the same as with contrast, now specified using[smin,smax], and extrap-

olating between the original pixel value and its corresponding luminance value.

As an example, choosing[smin,smax] to be[0,1] just desaturates the unattended

portions of the image. (Default:[smin,smax] = [0,1.2].)

Figure 6.2: Detail in background adjacent to important features can be inappropriatelyemphasized. The main subject has a halo of detailed shutter slats.

6.4 Results

Results from this technique capture some of the abstraction present in paintings (see

Figure 6.1). Focal objects are emphasized with more tight rendering, intense color and

contrast. Background features are accordingly de-emphasized. All this is done with

virtually no effort on the part of the user. In contrast, hand painting strokes or even

painting a detail map to control stroke size would require greater effort.

This painterly rendering framework has some limitations. Of course neither the

placement nor individual appearance of paint strokes seriously mimic real paint. Real-

istic paint strokes were never a goal here. How to accomplish this to various degrees

of approximation is fairly well understood. Placement of strokes is a more interesting

shortcoming of this approach. Using few strokes is a major part of painterly abstrac-

tion. In our renderings, despite throwing out small strokes in most places, too many

48

Figure 6.3: Sampling strokes from an anisotropic scale space avoids giving the imagean overall blurred look, but produces a somewhat jagged look in background areas.

(a) (b)

Figure 6.4: Color and contrast manipulation. Side by side comparison or renderingwith and without color and contrast manipulation (precise stroke placement varies be-tween the two images due to randomness).

strokes of too small a size are used to approximate any given part of the image. Other

painterly rendering methods exist which could be used to carefully place strokes while

retaining our underlying methods for choosing detail levels [Shiraishi and Yamaguchi,

2000].

Aside from the limitations of the rendering techniques we’ve appropriated, our ap-

proach to abstraction has some inherent limitations of its own. Since there is no explicit

model of image structure, detail can only be modulated in a continuously varying way

across the image. Stroke placement is designed to respect edges, however the size of

49

strokes does not. Detail spreads from important locations to neighboring areas creating

distracting haloing artifacts (see Figure 6.2).

In addition, sampling coarse stroke colors from blurred versions of the image

tends to give results an overall blurry look, especially when down sampled. An artist

wouldn’t actually blur colors and shapes in this way, but would preserve more of a re-

gions original appearance while removing detail. Abstraction in paintings, even when

objects are rendered indistinctly is not just blurring. High frequency information is

not blurred out. Rather, small elements are removed completely. One way to accom-

plish this would be to sample strokes from an anisotropic scale space that blurs out

detail while still retaining sharp edges. An example of a painterly rendering that does

this is shown in Figure 6.3. Stroke detail has been modulated by the same perceptual

model. Color has been left unchanged. This image does preserve more of the original

image structure. However, because intensities have not been blurred together in the

background, imperfections in placement of coarse strokes are emphasized. In back-

ground areas strokes appear jagged, their orientations varying excessively. A more

global strategy for orienting [Hays and Essa, 2004] and placing [Gooch and Willem-

sen, 2002] strokes along with blending might help overcome these difficulties.

50

Chapter 7

Colored Drawings

Some of the limitations of a purely local model of image structure are addressed by our

second NPR system. This approach utilizes a richer model of image structure to create

images in a line drawing style with dark edge strokes and regions of flat color [DeCarlo

and Santella, 2002]. This style resembles ink and colored wash drawings or lithograph

prints such as those of Toulouse-Lautrec. However, it is something of a minimal style.

Images are rendered with the primitives of our new image representation itself. The

more structured image representation used here allows us to create a more simplified

visual style. It also allows us to control detail across the image in a discontinuous way.

This means we can keep detail from leaking from important objects to unimportant

surrounding regions.

7.1 Feature Representation

The primary image representation underlying this visual style is a hierarchical seg-

mentation that approximates the scale space structure of the image. As noted above,

analytical calculation of this structure is problematic. Our choice of approximation was

motivated by the desire for reasonable efficiency and also by a requirement that regions

at each level of the tree should, as much as possible, be reasonable areas to draw. Any

of them might wind up in a rendering with some given distribution of detail.

7.1.1 Segmentation

Our approach is to use a robust mean shift segmentation independently at each level of

a linear scale space of blurred images (which are downsampled for efficiency). We use

publicly available code to accomplish this (http://www.caip.rutgers.edu/riul).

51

Figure 7.1: Slices through several successive levels of a hierarchical segmentation treegenerated using our method.

52

To avoid artifacts from downsampling a dense image pyramid is created. Each

level of the pyramid is smaller than its predecessor by a factor of square root of two.

Each level is segmented. Once each of these independent segmentations has been gen-

erated, the regions from each separate level are then linked to a parent region chosen

from the next coarser segmentation. This is done using a simple method that chooses a

parent region with maximal overlap, using color information to disambiguate difficult

choices, see [DeCarlo and Santella, 2002]. This is not a particularly robust method

of assignment. We are however conservative, putting off merging if there is no parent

that is a reasonably good match. These ill matching regions are propagated up to the

parent level (implying that the tree is not necessarily complete). Because the segmen-

tations on each level are themselves quite good, difficult situations do not often occur.

This approach allows leveraging existing segmentation methods and implementations,

and is flexible enough to incorporate alternate segmentation methods or alternate scale

spaces, like anisotropic diffusion.

Inherently, the way different colored regions corresponding to different objects

merge together at very coarse scales is somewhat unpredictable and unstable. An en-

gineering decision was to simply not sample very coarse scales. The coarsest scale

segmentation was selected to still contain a fair number of regions. Coarser scales with

only a few regions are not usually useful for rendering images. For all images (most

at 1024x768 resolution) 9 levels of downsampling were used. All the coarsest scale

regions were simply set as children of the tree root. Figure 7.1 illustrates some of the

slices through one of these trees.

Though good, these segmentations are neither perfect on each level nor in decisions

made across scale. Certain features consistently present difficulties. Textures tend to

be over segmented. Smoothly varying regions are broken into a number of patchy,

roughly constant color areas. Better segmentation techniques could be incorporated

into our method as they materialize.

A separate concern is the extent to which an image region hierarchy is appropriate

53

for abstraction, since image structure does not directly correspond to object structure.

Our system makes no attempt to turn a scale space structure into one that represents

the structure of actual scene objects [Bangham et al., 1998]. This is a difficult prob-

lem though there are some interesting possibilities for applying image understanding

techniques [Torralba et al., 2004] to add a top down component to the bottom up image

structure generated here.

For the time being we have addressed this problem by creating a simple interface

that allows interactive editing of the segmentation tree. A user can take a segmentation

tree as a starting point, and manually assign children areas to different parents, or split

and merge regions. With a fair amount of effort this could allow an object hierarchy

to be built. More realistically, occasional segmentation errors can be corrected fairly

easily. On the whole however we have not found this hand editing necessary. It was not

used in any of the published results in DeCarlo and Santella [2002] . In creating a larger

set of 50 renderings for [Santella and DeCarlo, 2004a] it was used to correct a few

prominent segmentation errors in a handful of the images. Most segmentation errors

that violate object boundaries appear high in the segmentation tree. These incorrect

regions are rendered only when the area is highly abstracted. In these abstracted areas

violations of object boundaries are usually not particularly distracting.

7.1.2 Edges

Our initial implementation tried rendering dark edges using a subset of the edge borders

of the regions. This did not on the whole provide sufficiently clean and flexible edges.

Edges extended beyond where desired. Because only average color for each region

was stored, significant region boundaries were difficult to distinguish from boundaries

due to smooth shading variation. Because of this a separate representation of edges

was used. Edges were detected with a robust variant of the Canny edge detector [Meer

and Georgescu, 2001]. Edges were detected on only one scale. This presents some

limitations but was enough to capture a reasonable set of important edges.

54

The combination of edges and regions provides a reasonable model of image con-

tent. The structure of this model will allow us to abstract images by more intelligently

removing detail in a coherent region-based manner.

7.2 Perceptual Model

Our richer model of image structure provides us the opportunity to use richer models

of visual perception to interpret eye tracking data in the context of our image. Though

useful, a perceptual model based only on frequency like that used in our painterly

rendering system is limited.

For example, in the output of our painterly system simplified parts of the image are

still painted with quite a large number of strokes. Most of these strokes are quite similar

in color. If larger strokes were used in all background areas, those background areas

with content would be completely obliterated. As it is, the long uniformly colored

strokes introduce detail, exaggerating the local color variation in mostly blank areas

(this is akin to the problem of aliasing or spurious resolution). This is in sharp contrast

to skilled examples of real painting where coarsely rendered forms of near uniform

color are blocked in with just a few very large strokes. The problem is that strokes

are selected based purely on size. It would be desirable to remove features based on

contrast as well as size.

This system continues to use a frequency model like that in [Santella and DeCarlo,

2002] for judging line strokes. Here, frequency is taken as proportional to the length of

the stroke, rather than its width. This decision is somewhat arbitrary, though it results

in the intuitive behavior of shorter lines being more easily filtered out. Since strokes

are rendered in our system with a width proportional to their length, one could look at

the perceptual model as measuring the prominence of the feature being drawn rather

than the original edge.

We can use a contrast sensitivity model to judge the visibility of regions that have

55

both a size and color. Of the available contrast sensitivity models, Equation (3.2) seems

appropriate because it is derived from a variety of experimental data and has been

used in computational setting on real images. This is the model we use [Santella and

DeCarlo, 2002].

As mentioned earlier, a perceptual model by itself will remove little content. Most

of the features we are interested in are visible. Some kind of global scaling down of

relative visibility is necessary to create an interesting amount of abstraction. Several

possibilities for scaling the model exist. Scaling can be applied to frequency making

smaller regions progressively less visible and removing them from the image as was

done for painterly strokes. Scaling in frequency is fairly intuitive in some respects.

Given a desired smallest feature size, it is simple to derive a scaling factor that will

include features of this frequency foveally, and degrade size further at larger eccentric-

ities [Santella and DeCarlo, 2002].

Scaling frequency in a contrast sensitivity model is a bit problematic. The contrast

sensitivity function has a hump like shape (see Figure 3.3). Visibility degrades at both

extremes of frequency. This produces unintuitive behavior. When scaling up frequen-

cies, a region can become more visible as it becomes smaller, before becoming less

visible. Several approaches to solving this are possible. As mentioned above, there is a

different pattern of contrast sensitivity for square wave gratings. A square wave model

might be more appropriate for our region features since, like a square wave grating,

they have a sharp (high frequency) boundary. A simple mathematical model for the

square wave contrast sensitivity function does not seem to be available but an approx-

imation built by analyzing the frequency spectrum of the square wave signal has been

derived [Campbell and Robson, 1968]. Using this model is a bit complicated. An al-

ternative is to simply use the maximal sensitivity for regions larger than the frequency

corresponding to the peak sensitivity. This replaces the low frequency slope of the

function with a horizontal line at peak sensitivity. This could be considered a first level

of approximation of the square wave model, which very roughly states that visibility is

56

largely governed by the most visible frequency component in a square grating.

Another possible approach for scaling to remove content is to scale only in the

contrast domain. After calculating a contrast threshold value this can be scaled before

being compared to the actual contrast of a region. This behaves in a more intuitive

manner, as more scaling always removes more content.

From our experimentation, scaling in the contrast domain seems to be the more

useful option for scaling region visibility. It corresponds more to our intuitive sense of

which detail should be removed first, as a fairly wide range of frequency regions are

desired in a final image. Removing large low contrast regions using frequency scaling

also removes the few desired high frequency regions. Contrast scaling preferentially

removes lower contrast regions, which looks better. This may be due in part to our

segmentation method. Because it breaks shading into many patchy regions, there is a

desire to get rid of these sometimes large but low contrast shading regions faster than

proportionallysmaller but higher contrast features.

Ultimately, [DeCarlo and Santella, 2002] we combined both these approaches in

our final system. Scaling is applied only to contrast and region frequencies are capped

at a minimum value to keep the visibility of low frequency regions from degrading

too much. This is an approximate patch to problems in the applicability of the current

contrast sensitivity model. The most perceptually realistic approach is still an open

question.

Once a contrast threshold has been calculated, the contrast of a particular feature

needs to be measured for comparison to that threshold. As mentioned above, contrast

models are derived over gratings in which there is one relatively obvious way to mea-

sure contrast. Our regions are more complex, they include color and are non-uniform

(in the initial image).

One approach is to measure contrast between the average color of parent and child

regions in the tree. This could be thought of as capturing whether the next branching

represents a significant change. This works reasonably, but can be susceptible to chance

57

features in the segmentation tree. For example, two pairs of black and white regions

might merge to form two gray regions. These then merge to form a single gray region.

The difference between parent and child region colors on this level would be minor.

An approach that seems more successful is to use the average of the contrasts be-

tween a region’s color and those of adjoining regions in the same level of the tree. In

this case it would be possible to take a cue from [Lillestaeter, 1993] and measure con-

trast between regions as a weighted average of contrast between their average colors

and the contrast across their shared edge in the initial image. The correct scale at which

to measure contrast across the edge is however unclear. We choose ultimately to mea-

sure contrast between a region and adjoining regions on that level using only region

color. The contrast for the region is the average of contrast with each of its neighbors

weighted by extent of the border they share.

Since we are not aware of any simple color contrast framework, we take a simple

approach to measuring contrast between individual pairs of region colors. We use a

slight variation of the Michelson contrast:‖c1−c2‖‖c1‖+‖c2‖ . Each color exists in the percep-

tually uniform L*u*v* color space, so the measure provides a steady increase with

increasing perceptual differences in color. This formula sensibly reduces to the stan-

dard Michelson contrast in monochromatic cases. More sophisticated models of color

perception for regions may help the system work better on a broader class of images.

7.3 Rendering

For regions, we choose to emphasize abstraction by removing even more information

than was directly specified by the (scaled) perceptual model. Instead of applying the

perceptual model to regions uniformly across the image, we divide the image into fore-

ground regions where the standard perceptual model is used, and background regions

where a more aggressive version of the model is used. Unfixated background areas are

identified. In them, a high constant eccentricity is used in place of the actual distance

58

to the nearest fixation, removing more detail in these regions.

Foreground regions are regions that appear to have been examined by the viewer. A

region is considered examined if a fixation rests in its bounding circle, or if the region

is in a subtree that is approximately fovea sized and centered on a fixation. Foreground

regions could be identified by searching the segmentation tree from its leaves upward.

In practice, it is simplest to descend the tree, applying the default model to regions

that contain fixations. When a subtree that matches a fovea is identified, the standard

model is applied throughout this subtree. When a subtree does not contain a fixation,

the background model is used. This avoids having to touch the majorityof elements in

the tree that will not be rendered.

Applying the model produces a trimmed tree. The leaves of this smaller tree are

the regions that will be rendered into an image. The set of selected regions and lines

are then rendered by drawing the regions as areas of flat color and drawing the edges

in black on top of them. This is a simple style and there are relatively few explicit style

choices.

One important stylistic choice is region smoothing. Before rendering, regions are

smoothed by an amount proportional to their size. This removes high frequency detail

along the border of large regions giving the regions a smooth, organic look. Smooth-

ing is needed because the spatial extent of a region is taken from the union of its

child leaves at the lowest level of segmentation. Without smoothing the region border

would contain inappropriate distracting detail. Edges are smoothed by a small constant

amount. The resulting misalignment between highly smoothed region boundaries in

abstracted areas and the corresponding edges adds to the ’sketchy’ look of results.

Edges are filtered in several ways based on their length, in order to eliminate clutter

from the many fragmentary edges resulting from edge detection. Very short edges of

only a few pixels are directly thrown out. Somewhat longer edges, (by default less than

15 pixels in length) are drawn only if near the border of a region and included by the

acuity model. Longer edges are compared only against the acuity threshold.

59

Selected edges are drawn with a width proportional to their length. Edge width

interpolates between a minimal and maximal thickness (3 and 10 pixels) for edges

between 15 and 500 pixels in length. Thickness is constant outside this range. Edges

also taper to points at either end. This is a rough approximation of the appearance of

many lines in traditional media illustration. Thickness being proportional to length is

a debatable choice. It occasionally produces odd results, but does succeed in adding

some variation to line weight, while capturing the sense that long lines are usually more

important.

Rendering lines tapered at either end serves the additional purpose of disguising the

sometimes broken nature of automatically detected edges. Without this, the fact that a

single edge is often broken or dotted in many places tends to be highly distracting. Our

lines make no attempt to simulate the fine grained look of real brush or pen and ink

lines, something which has been done [Gooch and Gooch, 2001] and could be applied

if desired.

7.4 Results

Some results from this system can be seen in Figure 7.2. The visual style of flat colored

regions and dark lines is attractive. The abstraction provided by eye tracking seems to

succeed in highlighting important areas. How important this abstraction is can be seen

by comparing the result with abstraction to those in Figure 7.4. Images with uniformly

high detail look excessively busy, while those with uniform low detail appear overly

simplified. Abstraction is vital in producing clear images with a distribution of detail

that is not distracting.

Uniformly detailed images are created by removing the cortical magnification fac-

tor from the perceptual model. The constant scaling factor on contrast now provides

a single global control for simplification. This model is applied uniformly across the

image. Regions are still removed based on contrast sensitivity; low contrast and small

60

Figure 7.2: Line drawing style results.

61

(a)

(b)

(c)

Figure 7.3: Stylistic decisions. Lines in isolation (a) are largely uninteresting. Un-smoothed regions (b) can look jagged. Smoothed regions (c) have a somewhat vagueand bloated look without the black edges superimposed.

62

Figure 7.4: Renderings with uniform high and low detail.

regions are eliminated first. However the effect of this uniform simplification is not

nearly as successful.

The additional scaling factors on the region and edge perceptual models provide

detail sliders for the user. These preserve the relative distribution of detail between

examined and unexamined locations while reducing or increasing the overall amount

of detail. Most images pictured use similar settings for these values. Though there is

some variability from image to image in what global amount of detail looks best, a

single scaling factor usually looks acceptable on most images. Tweaking for the very

best look involves searching only a small area of parameter space.

(a) (b) (c)

Figure 7.5: Several derivative styles of the same line drawing transformation. (a) Fullycolored, (b) color comic, (c) black and white comic

63

Results confirm our intuition that regions and edges are complimentary features.

Both are necessary to create comprehensible images. Each feature in isolation fails to

convey the scene. This is illustrated in Figure 7.3. Edges in isolation, in part because

of broken outlines fail to convey the sense of a scene made up of solid objects. Regions

in isolation do clearly make up a scene, but without edges their smoothed borders are

distracting. As in the ink and colored wash styles commonly used in illustration, dark

edges add a kind of definition that is difficult to achieve with color alone.

Regions and edges make up our rendering style, but they can also be considered

building blocks of many other styles. Figure 7.5 illustrates some trivial derivative

styles, a color comic book look created by thresholding dark regions to black, and

a black and white comic look created by fully thresholding the image. More interest-

ing styles can also be built from the same building blocks. A natural possibility that we

have not attempted is a painterly style taking advantage of this structured image model.

Watercolor would be a particularly interesting possibility [Curtis et al., 1997]. Regions

could provide areas of color to fill with simulated watercolor, while rather than being

explicitly rendered, strong edges could indicate locations where a hard edge, wet in dry

technique should be used.

We have argued that abstraction is a quality of any visual display designed with

the purpose of clear communication. Even depictions usually considered true to life

contain similar kinds of abstraction. Photorealist painters do this with subtle manipu-

lations of tone and texture. Photographers composing studio shots do the same thing

by manipulating the physical objects present. Graphic artists touching up photographs

act similarly, editing out small distracting features. In the next chapter we present a

very simple preliminary attempt to define a semi-automatic photorealistic abstraction

using the same techniques we’ve applied to artistic rendering.

64

Chapter 8

Photorealistic Abstraction

Our goal is to perform abstraction in a photographic style, removing detail while pre-

serving the sense that an image is an actual photograph. This is a more challenging

goal than abstraction in an artistic style. Artistic styles provide a clear statement that

an image does not directly reflect reality, and provides a fairly free license to change

content. Viewers are much less forgiving of artifacts in an image that claims to be an

accurate depiction of reality. The approach we present here is far from a solution to

this problem but presents some interesting images, and suggest semi-automatic photo-

graphic abstraction is possible.

8.1 Image Structure

The technique used here is simple. Anisotropic diffusion [Perona and Malik, 1990] is

used to smooth away small, light detail while preserving strong edges. Our contribution

is using eye tracking data to locally control the amount of simplification, allowing for

meaningful photorealistic abstraction.

Figure 8.1: Mean shift filtering tends to create images that no longer look like pho-tographs.

65

Mean shift filtering is an alternative space for simplification. In theory it should

provide greater control of the simplification process. Achieving a photo-like result with

it is a bit tricky since small areas quickly converge to a constant color. A high contrast

discontinuous border is then visible between that region and adjoining areas that have

converged to a different mode. Though an interesting form of image simplification,

mean shift filtered images no longer look like photographs (see Figure 8.1).

Though its definition implicitly takes into account edges, anisotropic smoothing is

defined on a flat image. In performing abstraction we’d still like to use the structure

of the image to avoid artifacts like those seen in our painterly rendering system, where

detail leaks from important features into surrounding background. To achieve this, we

combine a segmentation with eye tracking data to create a piecewise constant impor-

tance map reflecting the interest shown in each image region. This importance image

will control the amount of smoothing performed.

8.2 Measuring Importance

Though the attention model we used in the work above took dwell time into account

to tell if a fixation was really meaningful, it was not primarily intended to measure the

relative importance of different areas of the image. As long as a fairly long fixation

was present in an area, it was considered important. Relative importance was largely a

function of distance from a fixation. Here however, we want to capture a finer grained

measure of relative feature importance to control a more delicate process of abstraction.

As mentioned above, the length of fixations tells us something about how important

a feature is because they relate to time spent processing the content of a location [Just

and Carpenter, 1976]. Fixations can vary quite widely in duration. It appears from

our initial experiments that the total dwell time in two fixated locations does provide at

least a rough measure of their relative importances. With this in mind, our strategy is to

create an importance map that is brighter in important areas. This is done by coloring

66

a segmentation using an estimate of the total amount of time spent fixating different

parts of the image.

We begin by breaking an image into regions using a flat mean shift segmentation

of the image. Using a multiscale segmentation based on our previous work is an in-

teresting possibility, which we have so far not explored. Based on fixations, we assign

a weight, which can be considered an importance value, or empirical salience to each

region in the image. Conceptually, we wish to count the amount of time spent fixating

each region in our segmentation. However, a fixation might not actually rest within the

boundary of the region that best represents the feature being examined. No segmen-

tation is perfect, so for example a fixation may rest in a region that represents half of

the object examined. In addition drift and noise in the point of regard as measured by

the eye tracker can cause fixations to sit just over a boundary in another object entirely.

Because eye trackers are at best accurate to about a degree of visual angle, (roughly

25 pixels in our setup) noise is particularly apparent when a small object is fixated.

Depending on how small the object is, the corresponding fixation has a good chance

of lying within a surrounding background region. Some smoothing of containment to

deal with this problem was implicitly built into our previous work because bounding

circles were used to calculate intersections between fixations and regions.

To explicitly deal with this we make a soft assignment between each fixation and

each region and weight each region by the sum of these values. This smooths the

containment of fixations in regions because each fixation contributes to a range of

segments near it, rather than simply the one it rests in.

To set the contribution for a fixationfi = xi ,yi , ti to segmented regionr j we compute

the average (A) over the distances between the fixation and all points in the region. We

define a threshold distance T=175 pixels (more generally about 7 degrees of visual

angle) and the contribution for fixationfi to r j ’s weight is equal to(1−A/T) if A < T,

0 otherwise. The weight for each region is the sum of the weight contributed by each

fixation. Weights are capped at a maximum viewing time of 1 second. The region is

67

then drawn into the importance map using this intensity.

The result of this is an image where each region in the segmentation has a constant

color that reflects the total time spent examining that region. This is fundamentally dif-

ferent from the perceptual measures used in our other approaches. It is not a measure

of visibility, but instead a measure of how much something has been looked at. It is

similar conceptually to the subject/background distinction used to render background

areas particularly abstractly in our line drawing style. While that was a binary dis-

tinction, this approach creates a relative measure of importance using fixation length.

More sophisticated ways of matching fixations and regions are possible but we have

found this sufficient for our prototype.

The resulting subject map is then used in a very straightforward way. At each point

in the image,n iterations of anisotropic diffusion are performed wheren interpolates

linearly between 1 at the brightest parts of the importance map and an abstraction

parameterM at the darkest parts (M=250 in most results shown here).

8.3 Results and Discussion

Some results of this process are pictured in Figure 8.2. The abstraction is much subtler

than in other styles, but when viewed at high resolution an interesting subtle falloff

in detail, largely small scale texture, is noticeable. This captures at least a bit of the

effect that can sometimes be seen in photorealistic paintings where low contrast detail

seems to disappear, while some high contrast details, for example in reflections and

specularities, appear to be emphasized.

Figure 8.4 illustrates the importance of taking region boundaries into account in

assigning importance to image regions. If importance just varies locally based on dwell

time in the vicinity, fixated objects have a halo of detail around them.

A number of limitations to this approach are obvious. Clearly it doesn’t capture

the flexibility an artist uses in abstracting texture, and removing entire elements. Even

68

Figure 8.2: Photo abstraction results

69

(a) (b)

(c) (d)

(e) (f)

(g)

Figure 8.3: Photo in (a) is abstracted using fixations in (b) in a variety of differentstyles. (c) Painterly rendering, (d) line drawing, (e) locally disordered [Koenderinkand van Doorn, 1999], (f) blurred, (g) anisotropically blurred.

70

(a) (b)

Figure 8.4: (a) Detail of our approach, (b) the same algorithm using an importancemap where total dwell is measured locally. Notice in (b) the leaking of detail to thewood texture from the object on the desk. Here differences are relatively subtle; but ingeneral it is preferable to allocate detail in a way that respects region boundaries.

for performing simple textural abstraction it is limited. Importantly, the total amount

of abstraction possible without creating disturbing artifacts is limited. When relatively

few iterations of smoothing are performed, abstraction is limited and small high con-

trast features in the least important areas remain quite distinct. In contrast, if many

iterations of smoothing are performed, blurring becomes very apparent and the image

takes on a foggy appearance that distracts from the scene (see Figure 8.5).

There is an interesting unanswered question here of what features are important

for the percept of a realistic image. What about an image makes it appear like a pho-

tograph, as distinct from a highly finished traditional painting or a painting by a pho-

torealist artist. The range of contrasts is clearly one cue, that has been shown to be

perceptually important for material recognition [Adelson, 2001, Fleming et al., 2003]

Figure 8.3 provides an interesting comparison of abstraction performed using a number

of different methods, which give very different impressions.

Though a principled understanding of the perception involved is ultimately nec-

essary, there are various techniques that might currently be brought to bear on this

problem. Anisotropic diffusion provides a gradient weight that controls how sensitive

to the local gradient blurring is. At one extreme, blurring is uniform. At the other

71

Figure 8.5: The range of abstraction possible with this technique is limited. Withgreater abstraction the scene begins to appear foggy. In some sense it no longer lookslike the same scene.

extreme, variations in the image are so carefully respected that almost no blurring oc-

curs. This parameter could also be varied based on importance, though it is not clear

how useful this parameter would be. A similar process of filtering using a mean shift

or bilateral filtering framework might provide more control. Though mean shift filter-

ing tends to produce images that no longer look like photographs, a careful scheme for

controlling the number of iterations, color and spatial scale of filtering might overcome

this problem. Ultimately, to capture a wider variety of artistic effects a more structured

understanding of texture and grouping of scene elements is important; these are more

difficult problems.

72

Chapter 9

Evaluation

Though the problem of photorealistic abstraction is difficult, our results for artistic

styles suggest we have succeeded in achieving meaningful abstraction. Results look

interesting and the reduction of detail does not seem visually jarring. Often in graphics

this kind of informal impression is evaluation enough.

This is not an illegitimate viewpoint. In the context of art or entertainment, formal

evaluation may not be necessary. An appeal to visual intuitions about what looks good

can be enough. Though our methods are targeted at creating artistic images for enter-

tainment, we are interested in applying these techniques to illustration or visualization.

Because of this we would like to be able to empirically evaluate the claim that our sys-

tem can direct viewers to areas highlighted with detail. Even this does not categorically

prove the technique actually makes images easier to understand. Being able to show

this would require a visualization application, where goals and task related factors are

in play. However, showing that abstraction directs visual interest would prove a quan-

tifiable perceptual effect resulting from our technique. To establish this we perform

a user study, comparing viewing behavior over our images to viewing of the original

photographs and renderings in our style created with several different distributions of

detail.

We first motivate our choice of evaluation technique, then present the specifics of

how we conducted our experiments. Results and some implications of our findings are

then discussed. Our aim is not just to validate our system, but is instead a threefold

goal:

• Present a method of evaluation new to NPR (Section 9.2)—one based on tracking

viewers’ eye movements.

73

• Use this method to provide quantitative validation for our system (Section 9.4)

as well as interesting new insights into the role of detail in imagery (Section 9.5).

• Explain why this evaluation methodology is widely applicable in NPR, even

when the NPR system itself does not use eye tracking.

9.1 Evaluation of NPR

Prior methodologies used to evaluate NPR fall into one of two categories. The first

method polls a representative number of users, collecting their opinions to find out

how they respond to the system. Schumann et al. [1996] polled architects for their

impressions of sketchy and traditional CAD renderings, and based on the results, ar-

gued for the suitability of sketchy renderings for conveying the impression of tentative

or preliminary plans. Similarly, Agrawala and Stolte [2001] demonstrate the effective-

ness of their map design system using feedback from real users.

The second approach measures users’ performance at specific tasks as they use a

system (or its output). When the task depends on information gained from using the

system, performance provides a measure of how effectively the system conveys infor-

mation. An early study [Ryan and Schwartz, 1956] looked at the time required to judge

the position of features in photos and hand rendered illustrations in different styles.

Faster responses suggested more simple illustrations were clearer. Interrante [1996]

assessed renderings of transparent surfaces using medical imaging tasks. Performance

provided a measure of how clearly the rendering method conveyed shape information.

Gooch and Willemsen [2002] tested users’ ability to walk blindly to a target location

in order to understand spatial perception in a non-photorealistic virtual environment.

Gooch et al. [2004] compared performance on learning and recognition tasks using

photographs and NPR images of faces. Heiser et al [2004] evaluated automatic in-

structional diagrams by having subjects assemble physical objects and assessing their

speed and errors. Investigations like this draw on established research methodologies

74

in psychology and psychophysics.

Both of these methods have their limitations. For example, the goal of imagery is

not always task related. In advertising or decorative illustration (and possibly in much

fine art) the goal is more to attract the eye than to convey information. Success is mea-

surable, but not by a natural task. Surveys have their own limitations. The information

desired may not be reliably available to subjects by introspection. In addition, both task

performance and user approval ratings assess only the quality of a system as a whole.

Neither directly say why a pattern in performance or experience occurs. To understand

this, the system needs to be systematically changed and the experiment repeated. This

process can be costly and time consuming (or impossible). Any additional information

that aids the interpretation of results is therefore highly valuable.

We evaluate only one of the several styles of rendering presented in this work, the

segmentation based line and color drawings [DeCarlo and Santella, 2002]. This system

was chosen for evaluation in large part because it is the most developed of these sys-

tems. It is also an interesting candidate for evaluation in that it performs a very clean

aggressive kind of simplification. Unlike the other methods, it removes everything

from abstracted regions leaving them completely featureless. There is also no ran-

domness in the algorithm, as opposed to the painterly renderings system where there

are random variations in stroke placement. This allows multiple, otherwise identical,

renderings for comparison to be created with different distributions of detail.

Our hope is that removing detail can enhance image understanding. Further, suc-

cessive viewers may be encouraged to examine the image in a way similar to the first

viewer, and take away a similar meaning or impression. There is no natural task in

which to evaluate this effect, because our goal is creating artistic imagery rather than

visualizations for some task. Systematic questioning of viewers might substantiate the

intuition that the images are well designed, but would not inform future work.

Here, we present an alternate evaluation methodology which draws on established

75

psychophysical research. This approach analyzes eye movements and provides an ob-

jective measure of cognition. It can be the basis of evaluation, or provide complemen-

tary evidence when a task or other method is available. Regardless of the context in

which the user is viewing an image, the common factor is the act of looking. This

mediates all information that passes from the display to the user. In all of this work,

this key insight has provided an easy and intuitive method for abstraction. For the same

reason, we apply eye tracking to evaluation. These choices are independent; evaluation

via eye tracking is a general methodology that can be used regardless of how imagery

is created. Our study also looks at renderings that are created without the use of eye

tracking.

9.1.1 Analysis of Eye Movement Data

Basic parsing of eye movements into fixations and saccades has already been discussed

Section 3.1. Once individual fixations have been isolated, it is often useful to impose

more structure on the data. In looking at an image, viewers examine many different fea-

tures, some closely spaced on a single object, others more distant. A common pattern

of looking is to scan a number of different features and then return back to particularly

interesting ones. Multiple close fixations suggest interest and increased processing in

the same location. Because of this, cumulative interest in a location is often a valu-

able measurement. This was used as the basis of an importance map in our system for

photorealistic abstraction Section 8. When the location of features is known, this is

often measured by counting viewing time spent within a bounding box [Salvucci and

Anderson, 2001]. When there are not predetermined features,clusteringcan be used

to characterize regions of interest in a data driven fashion [Privitera and Stark, 2000].

Nearby fixations are clumped together, yielding larger, spatially structured units of vi-

sual interest. The number of clusters indicates the number of regions of interest (ROI)

present, and the number of points contained in them provides a measure of cumulative

76

interest. This is achieved using a mean shift clustering that considers only the x,y posi-

tions of locations viewed [Santella and DeCarlo, 2004b]. In the experiment described

in the next section this will reveal important information about how viewers look at

images.

Photo Detail Points

High Detail Low Detail

Salience Eye Tracking

Figure 9.1: Example stimuli. Detail points in white are from eye tracking, black detailpoints are from an automatic salience algorithm.

77

9.2 Experiment

9.2.1 Stimuli

The images used in this experiment were 50 photographs, and four NPR renderings

of each photo for a total of 250 images and five conditions. Most photos were taken

from an on-line database1. Photos spanned a broad range of scenes. Images that could

not be processed successfully were avoided, such as blurry or heavily textured scenes.

Prominent human faces were also excluded, although human figures were present in

a number of the images. All NPR images were generated using the method of De-

Carlo and Santella [2002] presented in Chapter 6. The four renderings differed in how

decisions about the inclusion of detail were made.

The five conditions are pictured in Figure 9.1, they are:

Photo: This is the unmodified photograph.

High Detail: A low global threshold on contrast ensures that most detail is retained,

removing primarily areas of low contrast texture and shading.

Low Detail: A high contrast threshold is used, removing most detail throughout

the image. The resulting image is drastically simplified but still for the most part

recognizable.

Eye Tracking: Detail is modulated as in [DeCarlo and Santella, 2002], using a

prior record of a viewer’s eye movements over the image. Detail is preserved in lo-

cations the original viewer examined (here we call these locationsdetail points) and

removed elsewhere. The eye tracking data was recorded from a single subject who

viewed each image for five seconds (and was instructed to simply look at the image).

Salience Map: Detail is modulated in the same manner as eye tracking, but the

detail points are selected automatically by a salience map algorithm [Itti et al., 1998,Itti

1http://philip.greenspun.com

78

and Koch, 2000]2. The algorithm has a model of the passage of time. So, like fixations,

each point has an associated duration. Five seconds worth of detail points were created.

The locations viewed by people and chosen by the salience algorithm can be similar in

some cases, but in general result in renderings with noticeably different distributions

of detail.

This set of conditions represents a systematic manipulation of an image. The effects

of NPR style, detail, and abstraction are separated. Local simplification is present in

two forms: one based on a viewer, and the other on purely low level features. Because

detail is controlled by choosing the levels of a hierarchical segmentation, simplified

images consist of a subset of the features in higher detail images. The eye tracking and

salience conditions are rendered literally using a part of the tree used to render the high

detail condition, while the low detail case generally includes the least content.

9.2.2 Subjects

Data was collected from a total of 74 subjects including 50 undergraduates participat-

ing for course credit and 24 subjects (graduate and undergraduate) participating for

pay.

9.2.3 Physical Setup

All images were displayed on a 19 inch LCD display at 1240 x 960 resolution. The

screen was viewed at a distance of approximately 33.75 inches, subtending a visual

angle of approximately 25 degrees horizontally. Eye movements were monitored us-

ing an ISCAN ETL-500 table-top eye-tracker (with a RK-464 pan/tilt camera). The

movement of the pan/tilt unit introduced too much noise in practice; it was not active

during the experiment. Instead, subjects placed their heads in an optometric chin rest

to minimize head movements.

2available at http://iLab.usc.edu

79

9.2.4 Calibration and Presentation

Eye trackers need to be calibrated in order to map a picture of a subject’s eye to a

position in screen space. This is accomplished by having the viewer look at a series

of predetermined points. In our experiments, a nine point calibration was used. The

quality of this calibration was checked visually, and also recorded. Every 10 images,

the calibration was checked and re-calibration was performed if necessary. Recordings

were used to measure the average quality of the calibrations. Errors had a standard

deviation of approximately 24 pixels (about a half degree), which agrees with the pub-

lished sensitivity of the system. Note that this does not account for systematic drift

from small head movements during a viewing.

After calibration, subjects were instructed to look at a target in the center of the

screen and click the mouse to view the first picture when ready. On the user’s click, the

image was presented for 8 seconds, and eye movements were recorded. After this, the

target reappeared for one second. A question then appeared. The subject clicked on a

radio button to select their response, clicked again to go on, and the process repeated.

Subjects normally saw one condition of each of the 50 images. The condition

and order were randomized. While viewing the images, subjects were told to pay

attention so they could answer questions which came after each image. Questions

were divided into two types, the order of which was randomized. Questions asked the

viewer either to rate how much they liked the image on a scale of 1 to 10, or whether

they had already seen the image, or a variant of it, earlier in the experiment. Occasional

duplicate images were inserted randomly when this question was used; data for these

repeated viewings is not included in the analysis. The questions were selected to keep

the viewer’s attention from drifting, while at the same time not giving them specific

instructions which might bias the way they looked at the image.

80

Figure 9.2: Illustration of data analysis, per image condition. Each colored collectionof points is a cluster. Ellipses mark 99 % of variance. Large black dots are detail points.We measure the number of clusters, distance between clusters and nearest detail point,and distance between detail points and nearest cluster.

9.3 Analysis

9.3.1 Data Analysis

Analysis draws upon a number of established measures and techniques tailored to our

experiment, to provide complimentary evidence about how stylization and abstraction

modifies viewing. Some processing is common to all our analysis. First, all eye move-

ment data is filtered to discard point of regard samples during saccades. We then per-

form clustering on the filtered samples [Santella and DeCarlo, 2004b]. The clusters are

not always meaningful, but on the whole they correspond well to features of interest in

the image. There is reason to believe the number of points contained in a cluster may

reveal how important a feature is; this is not considered here.

Our clustering method requires a scale choice. Clusters whosemodesare closer

than this scale value will be collapsed together. We select a scale of 25 pixels (roughly

half a degree) for all analysis, which is about the level of tracker noise present. Results

depend on the scale choice used in the clustering process. Clearly, at coarser scales

there will be fewer clusters and a smaller difference between the condition means. We

argue below that this does not affect interpretation of our results.

81

All clustering was conducted in two ways. In the first, which we will refer to asper

viewer analysis, each viewer’s data was clustered separately. In the second analysis,

which we will refer to asper image analysis, data for all viewers of a particular image

was combinedbeforeclustering. It is reasonable to think that as one adds data from

individual viewers, the data will approach some hypothetical distribution of image fea-

ture interest [Wooding, 2002]. This second analysis may therefore provide a better

measure of aggregate effects.

Below, we describe the measurements performed using the clusters. See Figure 9.2

for an illustration of the data.

Clusters: Because clusters roughly correspond to areas examined in the image,

we would expect to find fewer clusters in the eye tracking and salience cases if they

succeed in focusing viewer interest. We might also expect uniform simplification to

reduce the number of clusters, because it reduces detail.

Distance (from data to detail points) : In the eye tracking and salience condi-

tions, we wish to measure whether interest is focused on the locations where detail is

preserved. The change in distance from each of the cluster centers to the closest detail

point between conditions tells us how effective the manipulation is in drawing interest

to these locations. If the abstraction is successful, we would expect that clusters will

be closer. This tests the system as a whole. There will be no change in distance if

our hypothesis is wrong, which would mean that varying detail does not attract more

focused interest. It is also possible that in a particular image there was no detail that

could be put in a particular location, because there was none in the original image, or

because it cannot be represented in our system’s visual style.

Distance (from detail points to data): Implicit in the choice of detail points is

the assumption that viewers should look atall of the locations. This is not captured

by the distance measure. A viewer could spend all the time looking at one detail point

yielding a zero distance. To quantify this, it is possible to measure the distancefrom

each detail point to the closest cluster. A high average value means the locations of

82

a significant number of detail points were not closely examined. This distance will

decrease in salience and eye tracking conditions if detail modulation makes people

look at high detail areas that were not normally examined.

0.0

0.004

0.009

0.014

0.019

0 50 100 150 200Cluster Scale

P V

alue

4

6

8

0

0.2

0.

0.

0.

1

1.2

Diff

ere

nce

Hig

h-E

ye T

rack

p value

effect magnitude

Figure 9.3: Statistical significance is achieved for number of clusters over a wide rangeof clustering scales. The magnitude of the effect decreases, but its significance remainsquite constantly over a wide interval. Our results do not hinge on the scale valueselected.

9.3.2 Statistical Analysis

Data for all subjects was clustered as discussed in Section 9.3.1. In total there are 10

eye tracking records for each of the 50 images in each of the 5 conditions for a total

of 2500 individual recordings. More data than this was gathered; a matched number of

recordings for each condition was selected randomly. As noted in Section 9.2.4, data

was recorded in blocks where one of two questions was asked. Analysis showed no

effect of the questions, and these results are based on roughly equal numbers of images

presented in blocks of each question type.

Analysis of variance (ANOVA) are used to test whether differences between con-

ditions are significant. These tests produce ap value: the probability the measured

83

difference could occur by chance. The per viewer case gives itself naturally to statisti-

cal testing by a two-way repeated measure ANOVA. In this context a two-way ANOVA

separately tests both the contribution that the particular image and the condition make

to the results. This lets one look at the effect of a condition while factoring out the

variation among the different images. A repeated measure analysis treats each view-

ers’ eye tracking record as an independent measurement, so there are 10 data points per

image and condition pair. In the per image analysis, the 10 recordings are collapsed

together and data is analyzed instead by a simple two-way ANOVA. There is now only

one data point per image and condition pair, so it is more difficult to show a statistically

significant effect. We want to know not only if some of these conditions are different

from each other, but also which pairs are different. This requires a number of tests.

When performing this kind of analysis there is a concern that, since each test is asso-

ciated with a certain probability that the results could occur by chance, there will be

an unacceptably high cumulative risk that some positive results may occur by chance.

Several approaches exist to deal with this problem. We adopt a common methodology

for minimizing this risk. One test is used to establish that all of the means are not

the same, and only if this test succeeds are pairwise tests performed. This method is

implicit in all pairwise test results reported.

9.4 Results

Figure 9.4 graphs the average results for all measures. The take-away message, quan-

tified below, is that on the whole:

• Eye tracking and salience conditions have fewer clusters than photo and uniform

detail conditions in all analyses. In the per image analysis, eye tracking has fewer

clusters than salience.

• Distance between the viewed locations and the detail points decreased as a result

of modulating detail.

84

Per Image Analysis, data for all viewers of an image is clustered together.

high eye sal. photo low25

30

35

40

45

50

Nu

mbe

r Cluste

rs

Number of Clusters

high eye high sal.100

150

200

250

Distance

salience detail pointseye track detail points

Distance from Cluster to Detail Point

hi. eye. hi. sal.20

30

40

50

60

70

De

tail P

oin

t Dis

tan

ce

eye track detail points salience detail points

Distance from Detail Point to Cluster

Figure 9.4: Average results for all analyses per image.

85

Per Viewer Analysis, data for each viewing is clustered separately.

high eye sal. photo low7

8

9

10

11

12

13

14

Nu

mbe

r Cluste

rs

Number of Clusters

high eye high sal.100

110

120

130

140

150

160

170

180

Distance


Distance from Cluster to Detail Point

hi. eye. hi. sal.50

100

150

De

tail

Po

int

Dis

tan

ce


Distance from Detail Point to Cluster

Figure 9.5: Average results for all analyses per viewer.

86

• Distance between detail points and viewed locations showed no change; how-

ever the distances for salience points were significantly higher than those for eye

tracking points.

9.4.1 Quantitative Results

Clusters: In the per viewer analysis, there was about one fewer cluster in the eye

tracking and salience conditions, compared to the others. This means each viewer

examined one fewer region on average. Analysis showed this difference was significant

(p < .001). There was no significant difference (p > .05) between the photo or uniform

detail conditions, or between eye tracking and salience.

In the per image analysis, eye tracking had about 6 fewer clusters than uniform

detail and photo conditions, while salience had about 3 fewer. Eye tracking differed

significantly from all other conditions including salience (p < .001). Salience differed

from original atp < .01, and from high and low atp < .05.

Distance (from data to detail points): Clusters in the eye tracking condition were

about 20 pixels closer to the eye tracking detail points than high detail clusters, in both

per viewer and per image analysis (p < .0001). Salience clusters were about 10 pixels

closer to salience detail points (per viewer:p < .0001, per image:p < .01). This is

not spatially very large, but it represents a consistent shift of cluster centers towards the

detail points. The magnitudes of the two shifts (10 and 20 pixels) were not significantly

different from each other. For per image analysis, distances measured to eye tracking

detail points were significantly higherp< .01) than corresponding distances to salience

points.

Distance (from detail points to data): There was no significant change (p > .05)

for saliency or eye tracking renderings in either analysis. In both analyses however,

the distances were significantly smaller (p < .001) when measured from eye tracking

detail points than from salience detail points (a difference of about 40 in the per viewer

and 10 in the per image condition).

87

All of the two-way ANOVAs tested the significance of both the experimental condi-

tion, and the particular image. In all tests, the effect of the image was highly significant

(p < .001). This is neither surprising nor particularly informative. It simply states that

individual images have varying numbers of interesting features and they are distributed

differently in relation to the detail points.

As mentioned above, all of this analysis used clusters created with a particular

choice of scale. Figure 9.3 shows that results do not depend on this choice. The differ-

ence between mean number of clusters in the high detail and eye tracking conditions

(per viewer analysis) is plotted along with the correspondingp value. Though the

magnitude of the difference varies,p values show an effect of approximately equal sig-

nificance over a range of scales. The effects we have shown are therefore not due to

the particular scale selected.

9.4.2 Discussion

These results provide evidence that local detail modulation does change the way view-

ers examine an image. Eye tracking and salience renderings each had significantly

fewer clusters than all uniform detail images in both per image and per viewer analysis

(significance is stronger in the per viewer analysis, but that is to be expected based

on the number of samples). Distancesto detail points also showed an improvement

for both salience and eye tracking renderings. This indicates that not only were fewer

places examined, but the examined points were closer to the detail points. Distances

fromdetail points to data did not show improvement. This indicates that though interest

was concentrated by the manipulation in places with detail, it did not bring new interest

to detail points that were not already interesting in the high detail renderings. Viewers

look more at detailed locations when other areas have been simplified, but this did not

benefit all locations equally. Rather, locations that were already somewhat interested

received increased interest. Results do not prove enhanced or facilitated understanding

per se; however, this is strongly suggested by the more focused pattern of looking.

88

Results also indicate that although improvement can be seen with detail modulation

based on both eye tracking and salience, the two behave differently. Modulation based

on both produces fewer clusters of interest, and decreased distanceto detail points.

However, in the per image analysis, the number of clusters for the eye tracking condi-

tion was significantly lower than the salience condition. Also, the distances measured

from salience points were consistently higher than thosefrom eye tracking points; this

is further evidence that eye tracking points are more closely examined. Distanceto

detail points shows the opposite relationship (though more weakly) and argues against

this conclusion. However, we show below that this is almost certainly due to the num-

ber of detail points, and is not meaningful.

These results fit our intuition that the locations a viewer examined will, in general,

be a better predictor of future viewing than a salience model, which has no sense of

the meaning of image contents. There is considerable controversy in the human vision

literature about how much of eye movements can be accounted for by low level feature

salience. Some optimistically state that salience predictions correlate well with real

eye motions [Privitera and Stark, 2000,Parkhurst et al., 2002]. Others are more doubt-

ful and claim that when measured more carefully and in the context of a goal driven

activity, the correlation is quite poor [Turano et al., 2003,Land et al., 1999]. Our results

show salience points (at least those produced by the algorithm used) are less interesting

in general. Though abstraction does attract increased interest to salience points; people

look nearer to some of them, but still at less of them overall.

It is not clear at first glance that distance values measured against the eye tracking

and salience detail points can legitimately be compared to each other in order to judge

if they are functionally equivalent. Differences may be due to the number and distribu-

tion of the two kinds of detail points, rather than their locations relative to features in

the images and hence the locations of fixation data collected. One very obvious differ-

ence clouds comparison of fixation and salience detail points. The salience algorithm

produces more fixations than real viewers, so there were typically more detail points

89

in the salience case (10.9 per image on average) than eye tracking (5.96 on average).

This would seem to bias distancesto salience detail points toward lower values, possi-

bly bias distancesfromsalience points to higher values and make comparison difficult.

This is not as bad as it might seem because there is usually a fair amount of redundancy

in the salience detail points. Multiple points lie close to each other, so twice as many

points does not mean twice as many actual locations. Still this complicates quantitative

interpretation. In fact we do see that distancesto detail points are higher overall for

fixation detail points, if interpreted as reflecting on the fixation data this would suggest

the strange idea that salience points are examined more closely.

Fortunately, some simple controls indicate that the lower distancesto salience

points is an unimportant artifact, while the higher distancesfrom salience detail points

is meaningful. Replacing recorded fixation data with random points allows one to test

if a particular effect is due to the relationship between detail points and data, or if detail

points alone drive the effect. If an effect disappears when random data is substituted

for recorded fixations, it was driven by data. If it persists the detail points drive the

effect.

Replacing all fixation data for all viewers with uniform random points eliminates

all effects in distancesfrom detail points. The location of fixation data drives this

difference. In contrast, when assesed using this random fixation data, distancesto

detail points are still significantly higher for eye tracking and lower for salience detail

points. This difference is at least partly driven by the detail points themselves (most

likely the fact that there are more of them in the salience condition, more points in a

confined space will mean a shorter distance to the nearest one).

A second control adds evidence that the higher number of detail points in the sal-

ince case is responsible for the overall lower distance to salience detail points. We

can discard some detail points in the salience case so that the number of detail points is

equal across conditions. We would expect effects driven by the number of points to dis-

appear in this case. When this is done, there is no qualitative change in the magnitude

90

of distancesfrom detail points. Distancesto salience detail points however become

significantly higher than thoseto eye track points: a reversal. The main effect of our

experiment (the decreasing distance resulting from detail modulation) is not affected by

this. These two controls provide fairly strong evidence that eye tracking detail points

really are more examined than salience points overall.

The second control could have done before creating renderings and recording data.

This would have provided a more complete match between both conditions. It would

however provide some extra, usually not available information to the salience algo-

rithm, the number of fixations a viewer made, a rough guide to how many important

features are present.

In contrast to the changes caused by abstraction, there is little evidence that the

style manipulation alone produces a significant change in viewing. There is no sig-

nificant difference between the photo and high detail images in number of clusters. A

qualitative comparison of fixation scatter plots in these two conditions also suggests

the distribution of points in both is largely similar. There are however some large dif-

ferences in the effect on individual images. In some images large areas of low contrast

texture are removed by the stylization itself. In these cases, viewer interest is different

between the high detail and photo conditions (see Figure 9.6 for an example). Remov-

ing prominent but low contrast texture is abstraction, but it is abstraction over which

one has no control. Rather, it is built implicitly into the system (in this case into the

segmentation technique). The opposite effect can also occur: the style can attract at-

tention to less noticeable features. Notice in Figure 9.6, how drawing ripples on the

water in black has attracted the eye to them. A method for quantifying when and where

these effects occur is a topic for future research. These appear to be primarily low-level

effects, so work on salience [Itti et al., 1998] may provide a good starting point.

Interestingly, our results also indicate the number of regions of interest in an image

is not primarily driven by detail. It is surprising how much the pattern of interest in the

low detail case qualitatively and quantitatively resembles the high detail. The highly

91

detailed and highly simplified renderings have the same number of clusters while the

mixed detail images, in which the number of regions lies between these extremes,

have less. This implies it islocally increased detail that attracts the eye. Too much,

or too little detail everywhere leads to a broader dispersion of interest. Locally high

detail seems to attract the eye, globally low detail in particular appears to produced

scattered fixations. The distribution of fixation is similar overall but groups of fixations

are smaller and more scattered (clusters in the low detail case in fact represent lower

total dewll times on average than in the other conditions). This pattern is suggestive

of a (failed) search for interesting content. Substantiating and quantifying this is an

interesting subject for future research and has direct application in designing future

NPR and visualization systems. It would for example, be interesting to see how detail

relates to the time course of viewing and how behavior might vary with longer or

shorter viewing times.

In summary:

• viewers look at fewer locations in images simplified using eye tracking and

salience data,

• these locations tend to be near locations where detail is locally retained,

• neither the NPR style itself, nor application of uniform simplification modify

number of locations examined, and

• this effect exists for both eye tracking and salience detail points, but there is less

interest overall in salience points.

These results might seem like exactly what one would expect, given the use of

abstraction in art. This was not the only possible outcome however. One could imagine

that abstraction performed in this semi-automatic manner would simply now work.

Simplifying arbitrary areas of an image might confuse viewers, and they might, for

example, spend all their time looking at background regions obscured by simplification,

92

trying to figure out what is there. This bears a certain resemblance to behavior in the

low detail images. But when detail is retained in some locations and removed elsewhere

viewers seem to get the point, and explore the detailed areas.

9.5 Evaluation Conclusion

These results validate our attempt to focus interest by manipulating image detail using

eye tracking data. Results also have broader implications for those designing NPR sys-

tems, using salience maps in graphics, and designing future experimental evaluations

of NPR systems.

Our results showmeaningfulabstraction is important for effective NPR. Abstrac-

tion that does not carry any meaning is implicit in many NPR styles; for example,

there are no shading cues in a pure line drawing. Uniform control of detail is also

common in NPR systems. These are important considerations. But, both were tested

in this study and produced no change in the number of locations viewers examined.

In contrast, meaningful abstraction clearly affected viewers in a way that supports an

interpretation of enhanced understanding. Directed meaningful abstraction should be

considered seriously in designing future NPR systems.

Similarly, although low level (salience map) and high level (eye track) detail points

behave similarly in their ability to captureincreasedinterest, they differ in theirabso-

lutecapture of interest. The increased capture of interest seems to be a low level effect;

people don’t bother looking where there isn’t anything informative. However, seman-

tic factors are also active. The locations that interest another person are influenced by

image meaning and are a better predictor than salience of where future viewers will

look.

This has implications for the use of salience maps in graphics. It would be highly

desirable to automatically locate places viewers will look in a number of applica-

tions, not the least of which would be automatic abstraction in NPR and visualization.

93

Though salience points behave similarly to eye tracking points in part of our analy-

sis, results indicate that on the whole salience is not suitable for this purpose. It can

be successful in adaptive rendering applications [Yee et al., 2001], where it is only

necessary that people be somewhat more likely to look at selected locations. Salience

does provide information about how likely the structural qualities of a feature are to

attract interest. However, we want to encourage a later viewer to get the same con-

tent from an image that an earlier, perhaps more experienced viewer examined closely.

Current salience map algorithms are hardly expert viewers. This kind of application re-

quires better predictions, motivated by semantic information that salience is generally

unlikely to provide.

In addition, eye tracking may be a useful technique for evaluations of other NPR

systems. It provides an alternative to questionnaires and task performance measures.

Even when a task based method is possible, eye tracking can be useful in investigating

what features underlie the performance observed. Information is, after all, extracted

from the imagery by looking, and the large body of research available suggests that

locations examined indeed reveal the information being used to complete a task.

The experiments performed by Gooch et al.[2004] can serve to illustrate this. If

users can perform a task better using an NPR drawing of a face rather than a photo-

graph, it is valuable to see where clusters of visual interest occur in the two conditions.

This information may explain performance. It could focus future experiments and in-

form design choices about rendering faces, without exhaustive experimental testing.

Similarly, in evaluation of assembly diagrams [Agrawala et al., 2003], eye tracking

can provide very specific information about how people use such instructions. For ex-

ample, eye tracking could help further explain the way users interleave actions in the

world and examination of the relevant part of the instructions. Eye tracking records are

directly and obviously related to the imagery that evokes them. This makes them very

interpretable—a desirable quality in any measurement. In turn, this guards against the

danger [Kosara et al., 2003] of performing a user evaluation that ultimately doesn’t

94

yield any useful result.

Figure 9.6: Original photo and high detail NPR image with viewers’ filtered eye track-ing data. Though we found no global effect across these image types, there are some-times significantly different viewing patterns, as can be seen here.

Our experimental results quantify our intuition that our technique can focus interest

on areas highlighted with increased detail in abstracted images. Loosely interpreted,

our results could even be looked at as an experimental confirmation of the widely held

informal theory [Graham, 1970] that art functions in part by carefully guiding viewers

eyes through an image. Our results also resonate with findings from the literature

on the psychology of art [Locher, 1996] that suggests viewers spend more time in

long fixations in the somewhat vaguely defined category of ’well balanced’ images

while less long fixations occur in ’unbalanced images’. This convergence of work

from different fields is encouraging.

95

Chapter 10

Future Work

We have demonstrated the effectiveness of our approach to artistic abstraction. This

success encourages investigation in a number of related areas. These include improve-

ments and extensions to image processing and representation techniques, better per-

ceptual models to control abstraction, and application of these and related techniques

to practical problems.

10.1 Image Modeling

The models of image features used in our work are drawn from the state of the art,

but this is constantly changing. As image processing and understanding techniques

improve these can be incorporated into a richer model of image contents. Interesting

developments are possible in a number of areas.

10.1.1 Segmentation

Our model of image regions could be extended in a number of ways. Our segmentation

technique is limited by a piecewise constant color model of image regions. A segmen-

tation technique that could model a segment as a region of uniform texture or smooth

variation would better represent meaningful areas of the image. Once able to capture

coherent textured areas, how to abstractly render them becomes an interesting ques-

tion. Simply rendering them in a mean or median color is possible. More meaningful

textural abstraction presents an interesting challenge. A natural question to ask is what

features of a texture make it look like what it is. The ability to create NPR versions of

textures from images could be applied in 3D as well as image based NPR.

The inability of the segmenter to capture shading on smoothly varying regions is

96

(a) (b) (c)

Figure 10.1: A rendering from our line drawing system (b), can be compared to analternate locally varying segmentation (c). This segmentation more closely follows theshape of shading contours.

also problematic. Ideally, one would like the boundaries of regions created by shading

to be clearly distinguishable from other regions, and to smoothly follow isocontours of

the image. Instead, mean shift segmentation tends to produce patchy jigsaw puzzle like

regions (see Figure 10.1). If segmentation parameters are changed to create a coarser

segmentation, the result collapses an entire area of gradient leaving a number of small

island regions of greater variation dotted throughout.

It also seems that, when available, fixation information itself should be able to

provide an extra guide to the segmentation process. A fixation gives a fairly strong clue

that some important fine scale feature exists in its vicinity. This information should be

of some benefit to segmentation.

We have made some initial experiments addressing these issues. We begin by cre-

ating a segmentation using an alternate segmentation technique that tends to follow

isocontours [Montanvert et al., 1991]. This method iteratively merges regions based

on a class label derived from color information. We also use fixation data to locally

control the color threshold used in the segmentation. The contrast threshold used to

97

(a)

(b)

(c)

Figure 10.2: Locally varying segmentation cannot replace a segmentation hierarchy.Another example of a locally varying segmentation controlled by a perceptual model(c), compared to a rendering from our line drawing system. Note fine detail in the brickpreserved near the subjects head in (c). This is a consequence of the threshold varyingcontinuously as a function of distance from the fixations on the face.

98

decide whether two regions should merge is calculated using the same contrast sensi-

tivity model applied in our region and line drawing system. The result of this is a single

segmentation that displays locally varying resolution with smaller regions being pre-

served where a viewer looked. This achieves a kind of abstraction very similar to the

renderings in our colored line drawing style (see Figures 10.1 10.2). Note that shading

region boundaries in these images follow much more natural curves.

This technique has limitations as a form of abstraction. Detail is determined lo-

cally, so we again see detail preserved near important features, like the bricks near

the figure’s head in Figure 10.2. Though currently this technique only creates a single

segmentation, it could be easily extended to create a hierarchy that would allow us to

modulate detail discontinuously. Even when creating a hierarchy, it may be useful to

segment important areas more finely. Indeed, even if our goal is not abstraction, but

rather segmentation for its own sake, the locally varying resolution of such segmenta-

tions might be useful.

Various additional data could also be added to our segmentation. Items such as cars

and people can be identified [Torralba et al., 2004] and these labels could be added to

sets of regions in our representation. Knowing what an object is would aid more in-

formed abstraction. This contrasts with our system that views everything as a collection

of blobs. Such information could also be used in automatic attempts to modify the seg-

mentation tree to better reflect object structure. If some set of regions at the finest level

are identified as say, a car, the segmentation hierarchy can be modified to ensure that

these regions form a subtree that does not merge with the background until the whole

car is reduced to one region.

This style of abstracted rendering has been extended to video [Wang et al., 2004].

However, abstraction was performed largely by hand selecting groups of 3D space-

time regions. More fully automatic methods would require a 3D analogue of our 2D

hierarchical segmentation. Such a blob hierarchy could be created at fairly high com-

putational expense by repeated mean shift with successively larger kernels. A careful

99

iterative merging of regions could potentially create similar results at much less cost.

10.1.2 Edges

Our edge representation is another interesting area for improvement. Edges in our

current results are a weak point. They are detected at only one scale and tend to be

very broken in appearance, with an excessive number of small textural edges scattered

about. Very short edges are discarded. This filtering makes fine detail impossible to

capture with lines. In addition, because edges are detected at only one scale, we know

nothing about the range of frequencies an edge exists at and so are pressed into the

questionable decision of using edge length as a size measure.

Detecting edges at multiple scales is an obvious next step. There are several ways

this might be done. One would be to create a hierarchical edge representation similar

to our region hierarchy. Some work has been done on this problem. Edges have been

detected at multiple scales and correspondences made across scales to trace coarse

scale edges to their fine scale causes [Hong et al., 1982]. Such approaches seek to

achieve both of the normally conflicting goals of robust detection of features at all

scales, and fine localization of feature position. This work could be built on to represent

all edges in an image as a collection of tree structures of connected edges.

A more modern and popular approach to deal with multi-scale edges is scale se-

lection. Only a single set of edges is detected, but the scale at which they are detected

varies locally [Lindeberg, 1998]. Conceptually, one searches everywhere for edges at

a range of scales and picks the scale with the maximal edge response. This approach

does not consider the tracing of coarse scale features to fine scale. But this is not par-

ticularly necessary. Ideally at least, features detected at coarse scales actually exist at a

coarse scale, and finer scales in those locations will only contain noise. This approach

provides a more complete, continuous set of edges. It also provides important addi-

tional information, for each point on each edge there is an corresponding contrast and

scale value. The availability of this information suggests the use of more interesting

100

Figure 10.3: A rendering from our line drawing system demonstrates how long butunimportant edges can be inappropriately emphasized. Also, prominent lower fre-quency edges like creases in clothing are detected in fragments and filtered out becauseedges are detected at only one scale.

perceptual models in making decisions about edge inclusion.

10.2 Perceptual Models

Like representation of edges, decisions about edge inclusion are a weak point in our

current approach. Currently, an acuity model uses edge length as a proxy for fre-

quency. This succeeds in removing shorter edges in unimportant regions, but is poorly

motivated perceptually and produces some unintuitive artifacts. This can be seen for

example in Figure 10.3 where unimportant edges in the background are inappropri-

ately included because of their great length. Edges in are system are in fact, filtered

not once but three times: first by the hysteresis threshold used in the original edge de-

tection scheme, second by a global length threshold, only then are edges judged by our

perceptual model.

If scale selection is used, we would have for each point on each edge a frequency

estimate as well as a contrast measure at that scale. This would allow us to use a

contrast sensitivity model to judge edge inclusion. A decision could be made at each

point along each edge, or a single scale and contrast could be assigned to the whole of

101

each candidate edge, perhaps using the median value. As we currently do with regions,

we could then plug frequency into the model and receive a contrast threshold that can

be compared to the measured contrast along the edge.

Recall that in applying contrast sensitivity models to regions, some modifications

were made to avoid the unintuitive effect of very large scale regions having lower vis-

ibilities. This was loosely justified by the properties of square wave gratings. For

multi-scale edges the unmodified model makes sense. Very coarse scale edges, such as

the edge of a soft shadow, are in fact less visually prominent and would be correctly

judged as less worthy of inclusion than somewhat higher frequency edges of similar

contrast. A model like this could take over all judgments about what constitute signifi-

cant edges. Such an approach could be used to intuitively filter detected edges outside

of NPR. Scale selection detects a very large number of low contrast high frequency

edges. A variety of strength measures have been used to filter them out. A model

like this would provide a perceptually motivated metric, as well as a way of creating

locally varying thresholds based on viewer input. A complete approach to edge extrac-

tion and filtering would require higher level effects like grouping and completion, but

perceptual metrics like this could be an interesting first step.

Similarly, there is room for improvement in perceptual models of region visibility.

The next step is less clear here. Better psychophysical models of color contrast sen-

sitivity could be applied if available. Better methods of measuring contrast between

regions and their surround would also be useful. Our current approach takes into ac-

count only the mean color of each region. It is easy to think up examples where this

provides a poor measure of the distinguishability of two regions.

A method that measured contrast using the color histograms of two regions would

likely be an improvement. Taking into account both the interior of the regions and the

characteristics of their shared border [Lillestaeter, 1993], could distinguish object and

shading boundaries. Alternatively, we could reduce region visibility to the visibility of

boundaries between two regions. This could be done using the contrast and frequency

102

of a best fitting edge along the border between them. Since we are measuring not the

size of a region, but the frequency of the boundary between two regions, an unmodified

contrast sensitivity model is again appropriate. Region boundaries due to slow shading

changes would have appropriately low contrasts, low frequencies and therefore low

visibility. In another alternative, the whole range of frequencies and corresponding

contrasts present on the border could be looked at, and visibility could be based on

the most visible among these. This kind of perceptually driven scale selection might

produce some interesting effects.

All of these additions and modifications could lead to more interesting and ex-

pressive imagery. We have shown that the abstraction embodied in these images can

communicate what a viewer found important and provide an effective guide to future

viewers. A component that remains missing in our argument that this methodology

will be useful in visualization is a demonstration of its benefits in a practical task.

10.3 Applications

The presence of similar abstraction in many technical and practical illustrations en-

courages us that there are many applications of these techniques in visualization and

illustration. A practical problem has been choosing a domain in which to test our

method. Our approach gives itself to illustrative rather than exploratory applications,

since the methodology requires that someone know what is important, so their fixa-

tions can be used to clarify the point for successive viewers. The domains where this

might be most useful present some challenges for our current image analysis. Med-

ical images for example tend to be low contrast, noisy and difficult to segment with

general-purpose segmentation techniques. Photographs of technical apparatus such as

for example a car engine (see Figure 10.4) present their own difficulties. Though clean

man made edges are generally easier to segment, these images are very crowded and

often poorly lit. In these circumstances, segmentations fail to respect object structure in

103

Figure 10.4: Attempting technical illustration of mechanical parts pushes our imageanalysis techniques close to (if not over) their limits.

104

a way that can be confusing. Extra sources of information, such as sets of photos taken

with flashes in different locations have been used to ease image analysis in situations

like this [Raskar et al., 2004]. Despite these technical challenges, we feel confident

that these methods of abstraction will be useful for illustration in a number of domains.

These applications are not limited to photo abstraction. Similar kinds of abstraction

can be performed in 3D scenes. This removes the difficulties of image analysis, though

it presents a number of new challenges. Beyond textural indication in line drawings,

abstraction in 3D scenes has received relatively little attention. Perceptual metrics like

those we present could provide an interesting basis for a general framework of 3D

abstraction.

105

Chapter 11

Conclusion

Our goal was to create images that capture some of the expressive omission of art.

Several kinds of such images have been presented. These methods are motivated by

artistic practice and current models of human visual perception. Such images have

been experimentally shown to create a difference in the way viewers look at images.

This suggests our method has the ability to direct a viewers gaze, or at least focus

interest in particular areas. We therefore believe that these techniques are applicable

not only to art but also to wider problems of graphical illustration and visualization.

Rather than just a test of our system, our experiments can be seen as empirical val-

idation on controlled stimuli of the general idea that artists direct viewers gaze through

detail modulation. Our success also provides an experimental confirmation of sorts for

the hypothesis [Zeki, 1999] that at least part of the appeal of great art lies in the artists

careful control of detail, enticing the viewer with information, while not overwhelming

them with irrelevant detail. This balance serves to engage the viewer, leaving them free

to ponder an image’s meaning, without the burden of having to decipher its contents.

Detail modulation in illustration and art is a complex topic which we have only

begun to investigate. The work presented here has already inspired related approaches

from other researchers to problems in cropping, [Suh et al., 2003] and fluid visualiza-

tion [Watanabe et al., 2004]. Detail modulation is only part of visual artistry—one

of the many techniques available. Color, contrast, shape, and a host of higher level

concerns are manipulated in art and play a part in well designed images. All of these

techniques have some cognitive motivation. Understanding this perceptual basis is

an important guide in creating effective automatic instantiations of these techniques.

Continuing investigation of the role and functioning of abstraction in its many forms,

especially through building new quantitative models, should yield new ways to create

106

easily understood illustration.

The work presented here suggests a number of general insights to guide future

investigation:

• An understanding of the cognitive processing involved in human understanding

of an image or stimuli is important for effective stylized and abstracted illustra-

tion.

• The importance of some user input in our system highlights the fact that current

automatic techniques cannot replace the semantic knowledge of a human viewer.

People can perform abstraction but so far computers cannot in the domain of

general images.

• The fact that eye tracking is sufficient for some level of abstraction in our con-

text makes an interesting point. It suggests that the understanding underlying

abstraction, and perhaps other artistic judgments, is not some mysterious abil-

ity of a visionary few, but a basic visual competence. Though not everyone can

draw, everyone it seems can control abstraction in a computer rendering.

• Eye tracking is a useful tool for understanding in this context. It is useful not only

as a minimal form of interaction, but also as a cognitive measure for evaluation

and for understanding what features are attended and hence may be critical to

processing.

• In a perceptually motivated framework, experimentation is useful not only to

evaluate or validate a final system, but also to investigate and, if possible, build

quantitative models of perception as it relates to questions of interest. Work in

psychology and cognitive science can provide a framework for undestanding a

problem, as well as general methodologies. Sometimes models applicable to

a specific problem are also available. However, these do not always address

questions in the way most useful to those building applied systems. This provides

107

a need for a cyclical process of cognitive investigation and system engineering

to build more effective systems for visual communication.

These considerations suggest a future path for research in NPR that diverges some-

what from traditional areas of investigation, but holds the promise of a consistent intel-

lectual underpinning for an expanding field, as well, of course, as the promise of more

expressive and perhaps ultimately artistic computer generated imagery.

108

References[Adelson, 2001] Adelson (2001). On seeing stuff: the perception of materials by hu-

mans and machines.Proceedings of the SPIE, 4299:1–12.

[Agrawala et al., 2003] Agrawala, M., Phan, D., Heiser, J., Haymaker, J., Klingner, J.,Hanrahan, P., and Tversky, B. (2003). Designing effective step-by-step assemblyinstructions. InProceedings of ACM SIGGRAPH 2003, pages 828–837.

[Agrawala and Stolte, 2001] Agrawala, M. and Stolte, C. (2001). Rendering effectiveroute maps: improving usability through generalization. InProceedings of ACMSIGGRAPH 2001, pages 241–249.

[Ahuja, 1996] Ahuja, N. (1996). A transform for multiscale image segmentation byintegrated edge and region detection.IEEE Transactions on Pattern Analysis andMachine Intelligence, 18(12):1211–1235.

[Arnheim, 1988] Arnheim, R. (1988).The Power of the Center. University of Cali-fornia Press.

[Bangham et al., 1998] Bangham, J., Hidalgo, J. R., Harvey, R., and G.Cawley (1998).The segmentation of images via scale-space trees. InProceedings of British Ma-chine Vision Conferenc, pages 33–43.

[Baxter et al., 2001] Baxter, B., Scheib, V., and Lin, M. (2001). Dab: interactive hap-tic painting with 3d virtual brushes.Proceedings of ACM SIGGRAPH 2001, pages403–421.

[Campbell and Robson, 1968] Campbell, F. and Robson, J. (1968). Application offourier analysis to the visibility of gratings.Journal of Physiology, 197:551–566.

[Cater et al., 2003] Cater, K., Chalmers, A., and Ward, G. (2003). Detail to attention:Exploiting visual tasks for selective rendering. InProceedings of the EurographicsSymposium on Rendering, pages 270–280.

[Chen et al., 2002] Chen, L., Xie, X., Fan, X., Ma, W., Shang, H., and Zhou, H.(2002). A visual attention mode for adapting images on small displays.MSR-TR-2002-125, Microsoft Research, Redmond, WA.

[Christoudias et al., 2002] Christoudias, C., Georgescu, B., and Meer, P. (2002). Syn-ergism in low level vision. InProceedings ICPR 2002, pages 150–155.

[Collomosse and Hall, 2003] Collomosse, J. P. and Hall, P. M. (2003). Genetic paint-ing: a salience adaptive relaxation technique for painterly rendering. TechnicalReport CSBU2003-02, Dept. of Computer Science, University of Bath.

109

[Crowe and Narayanan, 2000] Crowe, E. C. and Narayanan, N. H. (2000). Comparinginterfaces based on what users watch and do. InProceedings of the Eye TrackingResearch and Applications (ETRA) Symposium 2000, pages 29–36.

[Curtis et al., 1997] Curtis, C. J., Anderson, S. E., Seims, J. E., Fleischer, K. W., andSalesin, D. H. (1997). Computer-generated watercolor. InProceedings of ACMSIGGRAPH 97, pages 421–430.

[DeCarlo et al., 2003] DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., and Santella, A.(2003). Suggestive contours for conveying shape. InProceedings of ACM SIG-GRAPH 2003.

[DeCarlo and Santella, 2002] DeCarlo, D. and Santella, A. (2002). Stylization andabstraction of photographs. InProceedings of ACM SIGGRAPH 2002, pages 769–776.

[Deussen and Strothotte, 2000] Deussen, O. and Strothotte, T. (2000). Computer-generated pen-and-ink illustration of trees. InProceedings of ACM SIGGRAPH2000, pages 13–18.

[Duchowski, 2000] Duchowski, A. (2000). Acuity-matching resolution degradationthrough wavelet coefficient scaling.IEEE Trans. on Image Processing, 9(8):1437–1440.

[Durand et al., 2001] Durand, F., Ostromoukhov, V., Miller, M., Duranleau, F., andDorsey, J. (2001). Decoupling strokes and high-level attributes for interactive tradi-tional drawing. InProceedings of the 12th Eurographics Workshop on Rendering,pages 71–82.

[Fleming et al., 2003] Fleming, R. W., Dror, O. R., and Adelson, E. (2003). Real-world lillumination and the perception of surface reflectance properties.Journal ofVision, 3:347–368.

[Goldberg et al., 2002] Goldberg, J. H., Stimson, M. J., Lewenstein, M., Scott, N., andWichansky, A. M. (2002). Eye tracking in web search tasks: design implications.In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium2002, pages 51–58.

[Gombrich et al., 1970] Gombrich, E. H., Hochberg, J., and Black, M. (1970).Art,Perception, and Reality. John Hopkins University Press.

[Gooch and Willemsen, 2002] Gooch, A. A. and Willemsen, P. (2002). Evaluatingspace perception in NPR immersive environments. InProceedings of the SecondInternational Symposium on Non-photorealistic Animation and Rendering (NPAR),pages 105–110.

[Gooch and Gooch, 2001] Gooch, B. and Gooch, A. (2001).Non-Photorealistic Ren-dering. A K Peters.

110

[Gooch et al., 2004] Gooch, B., Reinhard, E., and Gooch, A. (2004). Human facialillustration: Creation and psychophysical evaluation.ACM Transactions on Graph-ics, 23:27–44.

[Grabli et al., 2004] Grabli, S., Durand, F., and Sillion, F. (2004). Density measure forline-drawing simplification. InProceedings of Pacific Graphics.

[Graham, 1970] Graham, D. (1970).Composing Pictures. Van Nostrand Reinhold.

[Haeberli, 1990] Haeberli, P. (1990). Paint by numbers: Abstract image representa-tions. InProceedings of ACM SIGGRAPH 90, pages 207–214.

[Hays and Essa, 2004] Hays, J. H. and Essa, I. (2004). Image and video-basedpainterly animation. InProceedings of the Third International Symposium on Non-photorealistic Animation and Rendering (NPAR), pages 113–120.

[Heiser et al., 2004] Heiser, J., Phan, D., Agrawala, M., Tversky, B., and Hanrahan,P. (2004). Identification and validation of cognitive design principles for automatedgeneration of assembly instructions. InAdvanced Visual Interfaces, pages 311–319.

[Henderson and Hollingworth, 1998] Henderson, J. M. and Hollingworth, A. (1998).Eye movements during scene viewing: An overview. In Underwood, G., editor,EyeGuidance in Reading and Scene Perception, pages 269–293. Elsevier Science Ltd.

[Hertzmann, 1998] Hertzmann, A. (1998). Painterly rendering with curved brushstrokes of multiple sizes. InProceedings of ACM SIGGRAPH 98, pages 453–460.

[Hertzmann, 2001] Hertzmann, A. (2001). Paint by relaxation. InComputer GraphicsInternational, pages 47–54.

[Hong et al., 1982] Hong, T.-H., Shneier, M., and Rosenfeld, A. (1982). Border ex-traction using linked edge pyramids.IEEE Transactions on Systems, Man and Cy-bernetics, 12:660–668.

[Interrante, 1996] Interrante, V. (1996).Illustrating Transparency: communicatingthe 3D shape of layered transparent surfaces via texture. PhD thesis, University ofNorth Carolina.

[Itti and Koch, 2000] Itti, L. and Koch, C. (2000). A saliency-based search mechanismfor overt and covert shifts of visual attention.Vision Research, 40:1489–1506.

[Itti et al., 1998] Itti, L., Koch, C., and Niebur, E. (1998). A model of saliency-basedvisual attention for rapid scene analysis.IEEE Transactions on Pattern Analysisand Machine Intelligence, 20:1254–1259.

[Jacob, 1993] Jacob, R. J. (1993). Eye-movement-based human-computer interactiontechniques: Toward non-command interfaces. In Hartson, H. and Hix, D., editors,Advances in Human-Computer Interaction, Volume 4, pages 151–190. Ablex Pub-lishing.

111

[Just and Carpenter, 1976] Just, M. A. and Carpenter, P. A. (1976). Eye fixations andcognitive processes.Cognitive Psychology, 8:441–480.

[Kalnins et al., 2002] Kalnins, R. D., Markosian, L., Meier, B. J., Kowalski, M. A.,Lee, J. C., Davidson, P. L., Webb, M., Hughes, J. F., and Finkelstein, A. (2002).WYSIWYG NPR: Drawing strokes directly on 3D models. InProceedings of ACMSIGGRAPH 2002, pages 755–762.

[Kelly, 1984] Kelly, D. (1984). Retinal inhomogenity: I. spatiotemporal contrast sen-sitivity. Journal of the Optical Society of America A, 74(1):107–113.

[Koenderink and van Doorn, 1979] Koenderink, J. and van Doorn, A. (1979). Thestructure of two dimensional scalar fields with applicaitons to vision.BiologicalCybernetics, 30:151–158.

[Koenderink, 1984] Koenderink, J. J. (1984). What does the occluding contour tell usabout solid shape?Perception, 13:321–330.

[Koenderink et al., 1978] Koenderink, J. J., M.A. Bouman, A. B. d. M., and Slappen-del, S. (1978). Perimetry of contrast detection thresholds of moving spatial sinewave patterns. II. the far peripheral visual field (eccentricity 0-50).Journal of theOptical Society of America A, 68(6):850–854.

[Koenderink and van Doorn, 1999] Koenderink, J. J. and van Doorn, A. (1999). Thestructure of locally orderless images.International Journal of Computer Vision,31(2/3):159–168.

[Kosara et al., 2003] Kosara, R., Healey, C., Interrante, V., Laidlaw, D., and Ware,C. (2003). User studies: Why, how and when?IEEE Computer Graphics andApplications, 23(4):20–25.

[Kowalski et al., 1999] Kowalski, M. A., Markosian, L., Northrup, J. D., Bourdev, L.,Barzel, R., Holden, L. S., and Hughes, J. (1999). Art-based rendering of fur, grass,and trees. InProceedings of ACM SIGGRAPH 99, pages 433–438.

[Kowler, 1990] Kowler, E. (1990). The role of visual and cognitive processes in thecontrol of eye movements. In Kowler, E., editor,Eye Movements and Their role inVisual and Cognitive Processes, pages 1–70. Elsevier Science Ltd.

[Land et al., 1999] Land, M., Mennie, N., and Rusted, J. (1999). The roles of visionand eye movements in the control of activities of daily living.Perception, 28:1311–1328.

[Leyton, 1992] Leyton, M. (1992).Symmetry, causality, mind. MIT Press.

[Lillestaeter, 1993] Lillestaeter, O. (1993). Complex contrast, a definition for struc-tured targets and backgrounds.Journal of the Optical Society of America,10(12):2453–2457.

112

[Lindeberg, 1998] Lindeberg, T. (1998). Edge detection and ridge detection with au-tomatic scale selection.International Journal of Computer Vision, 30(2):117–154.

[Litwinowicz, 1997] Litwinowicz, P. (1997). Processing images and video for an im-pressionist effect. InProceedings of ACM SIGGRAPH 97, pages 407–414.

[Locher, 1996] Locher, P. J. (1996). The contribution of eye-movement research to anunderstanding of the nature of pictorial balance perception: a review of the litera-ture. Empirical Studies of the Arts, 14(2):146–163.

[Mackworth and Morandi, 1967] Mackworth, N. and Morandi, A. (1967). The gazeselects informative details within pictures.Perception and Psychophysics, 2:547–552.

[Mannos and Sakrison, 1974] Mannos, J. L. and Sakrison, D. J. (1974). The effects ofa visual fidelity criterion on the encoding of images.IEEE Trans. on InformationTheory, 20(4):525–536.

[Markosian et al., 1997] Markosian, L., Kowalski, M. A., Trychin, S. J., Bourdev,L. D., Goldstein, D., and Hughes, J. F. (1997). Real-time nonphotorealistic ren-dering. InProceedings of ACM SIGGRAPH 97, pages 415–420.

[Marr, 1982] Marr, D. (1982).Vision: A Computational Investigation into the HumanRepresentation and Processing of Visual Information. W.H. Freeman, San Fran-cisco.

[Meer and Georgescu, 2001] Meer, P. and Georgescu, B. (2001). Edge detection withembedded confidence.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 23(12):1351–1365.

[Mello-Thoms et al., 2002] Mello-Thoms, C., Nodine, C. F., and Kundel, H. L.(2002). What attracts the eye to the location of missed and reported breast cancers?In Proceedings of the Eye Tracking Research and Applications (ETRA) Symposium2002, pages 111–117.

[Montanvert et al., 1991] Montanvert, Meer, P., and Rosenfeld, A. (1991). Hierar-chical image analysis using irregular tesselations.IEEE Transactions on PatternAnalysis and Machine Intelligence, 13(4):307–316.

[Mulligan, 2002] Mulligan, J. B. (2002). A software-based eye tracking system forthe study of air traffic displays. InProceedings of the Eye Tracking Research andApplications (ETRA) Symposium 2002, pages 69–76.

[Niessen, 1997] Niessen, W. (1997). Nonlinear multiscale representations for imagesegmentation.Computer Vision and Image Understanding, 66(2):233–245.

[Parkhurst et al., 2002] Parkhurst, D., Law, K., and Niebur, E. (2002). Modeling therole of salience in the allocation of overt visual attention.Vision Research, 42:107–123.

113

[Perona and Malik, 1990] Perona, P. and Malik, J. (1990). Scale-space and edge de-tection using anisotropic diffusion.IEEE Transactions on Pattern Analysis andMachine Intelligence, 12(7):629–639.

[Privitera and Stark, 2000] Privitera, C. M. and Stark, L. W. (2000). Algorithms fordefining visual regions-of-interest: Comparison with eye fixations.IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 22(9):970–982.

[Ramachandran and Hirstein, 1999] Ramachandran, V. S. and Hirstein, W. (1999).The science of art.Journal of Consciousness Studies, 6(6-7).

[Raskar et al., 2004] Raskar, R., Tan, K.-H., Feris, R., Yu, J., and Turk, M. (2004).Non-photorealistic camera: depth edge detection and stylized rendering using multi-flash imaging. InProceedings of ACM SIGGRAPH 2004, pages 679–688.

[Reddy, 1997] Reddy, M. (1997).Perceptually Modulated Level of Detail for VirtualEnvironments. PhD thesis, University of Edinburgh.

[Reddy, 2001] Reddy, M. (2001). Perceptually optimized 3D graphics.IEEE Com-puter Graphics and Applications, 21(5):68–75.

[Regan, 2000] Regan, D. (2000).Human Perception of Objects: Early Visual Process-ing of Spatial Form Defined by Luminance, Color, Texture, Motion and BinocularDisparity. Sinauer.

[Rosenholtz, 1999] Rosenholtz, R. (1999). A simple saliency model predicts a numberof motion popout phenomena.Vision Research, 39:3157–3163.

[Rosenholtz, 2001] Rosenholtz, R. (2001). Search asymmetries? what search asym-metries?Perception and Psychophysics, 63:476–489.

[Rovamo and Virsu, 1979] Rovamo, J. and Virsu, V. (1979). An estimation and ap-plication of the human cortical magnification factor.Experimental Brain Research,37:495–510.

[Ruskin, 1857] Ruskin, J. (1857).The Elements of Drawing. Smith, Elder and Co.

[Ruskin, 1858] Ruskin, J. (1858). Address at the opening of the Cambridge school ofart.

[Ryan and Schwartz, 1956] Ryan, T. A. and Schwartz, C. B. (1956). Speed of per-ception as a function of mode of representation.American Journal of Psychology,pages 60–69.

[Saito and Takahashi, 1990] Saito, T. and Takahashi, T. (1990). Comprehensible ren-dering of 3-D shapes. InProceedings of ACM SIGGRAPH 90, pages 197–206.

[Salisbury et al., 1994] Salisbury, M. P., Anderson, S. E., Barzel, R., and Salesin, D. H.(1994). Interactive pen-and-ink illustration. InProceedings of ACM SIGGRAPH94, pages 101–108.

114

[Salvucci and Anderson, 2001] Salvucci, D. and Anderson, J. (2001). Automated eye-movement protocol analysis.Human-Computer Interaction, 16:39–86.

[Santella and DeCarlo, 2002] Santella, A. and DeCarlo, D. (2002). Abstractedpainterly renderings using eye-tracking data. InProceedings of the Second Interna-tional Symposium on Non-photorealistic Animation and Rendering (NPAR), pages75–82.

[Santella and DeCarlo, 2004a] Santella, A. and DeCarlo, D. (2004a). Eye trackingand visual interest: An evaluation and manifesto. InProceedings of the Third In-ternational Symposium on Non-photorealistic Animation and Rendering (NPAR),pages 71–78.

[Santella and DeCarlo, 2004b] Santella, A. and DeCarlo, D. (2004b). Robust cluster-ing of eye movement recordings for quantification of visual interest. InProceedingsof the Eye Tracking Research and Applications (ETRA) Symposium 2004.

[Schumann et al., 1996] Schumann, J., Strothotte, T., and Laser, S. (1996). Assessingthe effect of non-photorealistic rendering images in computer-aided design. InACMHuman Factors in Computing Systems, SIGHCI, pages 35–41.

[Setlur et al., 2004] Setlur, V., Takagi, S., Raskar, R., Gleicher, M., and Gooch, B.(2004).

[Shapiro and Stockman, 2001] Shapiro, L. and Stockman, G. (2001).Computer Vi-sion. Prentice-Hall.

[Shiraishi and Yamaguchi, 2000] Shiraishi, M. and Yamaguchi, Y. (2000). An algo-rithm for automatic painterly rendering based on local source image approximation.In Proceedings of the First International Symposium on Non-photorealistic Anima-tion and Rendering (NPAR), pages 53–58.

[Sibert and Jacob, 2000] Sibert, L. E. and Jacob, R. J. K. (2000). Evaluation of eyegaze interaction. InProceedings CHI 2000, pages 281–288.

[Suh et al., 2003] Suh, B., Ling, H., Bederson, B. B., and Jacobs, D. W. (2003). Auto-matic thumbnail cropping and it’s effectivness.ACM Conference on User Interfaceand Software Technolgy (UIST 2003), pages 95–104.

[Torralba et al., 2004] Torralba, A., Murphy, K., and Freeman, W. (2004). Contextualmodels for object detection using boosted random fields. InAdv. in Neural Infor-mation Processing Systems.

[Tufte, 1990] Tufte, E. R. (1990).Envisioning Information. Graphics Press.

[Turano et al., 2003] Turano, K. A., Geruschat, D. R., and Baker, F. H. (2003). Ocu-lomotor strategies for the direction of gaze tested with a real-world activity.VisionResearch, 43:333–346.

115

[Underwood and Radach, 1998] Underwood, G. and Radach, R. (1998). Eye guid-ance and visual information processing: Reading, visual search, picture perceptionand driving. In Underwood, G., editor,Eye Guidance in Reading and Scene Percep-tion, pages 1–27. Elsevier Science Ltd.

[Vertegaal, 1999] Vertegaal, R. (1999). The gaze groupware system: Mediating jointattention in mutiparty communication and collaboration. InProceedings CHI ’99,pages 294–301.

[Walker et al., 1998] Walker, K. N., Cootes, T. F., and Taylor, C. J. (1998). Locatingsalient object features.in Proceedings BMVC, 2:557–567.

[Wandell, 1995] Wandell, B. A. (1995).Foundations of Vision. Sinauer AssociatesInc.

[Wang et al., 2004] Wang, J., Xu, Y., Shun, H.-Y., and Cohen, M. (2004). Video toon-ing. In Proceedings of ACM SIGGRAPH 2004, pages 574–583.

[Watanabe et al., 2004] Watanabe, D., Mao, X., Ono, K., and Imamiya, A. (2004).Gaze-directed streamline seeding. InAPGV 2004.

[Winkenbach and Salesin, 1994] Winkenbach, G. and Salesin, D. H. (1994).Computer-generated pen-and-ink illustration. InProceedings of ACM SIGGRAPH94, pages 91–100.

[Witkin, 1983] Witkin, A. (1983). Scale-space filtering. pages 1019–1021.

[Wooding, 2002] Wooding, D. S. (2002). Fixation maps: quantifying eye-movementtraces. InProceedings of the Eye Tracking Research and Applications (ETRA) Sym-posium 2002, pages 31–36.

[Yarbus, 1967] Yarbus, A. L. (1967).Eye Movements and Vision. Plenum Press.

[Yee et al., 2001] Yee, H., Pattanaik, S. N., and Greenberg, D. P. (2001). Spatio-temporal sensitivity and visual attention in dynamic environments.ACM Trans-actions on Graphics, 29:39–65.

[Zeki, 1999] Zeki, S. (1999). Inner Vision: An Exploration of Art and the Brain.Oxford University Press.

116

Curriculum VitaAnthony Santella

2005 Ph.D. in Computer Science, Certificate in Cognitive Science from RutgersUniversity

1999 B.A in Computer Science from New York University

2001-2004Research Assistant, The VILLAGE, Department of Computer Science, Rut-gers University

1999-2001Teaching Assistant, Department of Computer Science, Rutgers University

Publications

A. Santella and D. DeCarlo, ”Visual Interest and NPR: an Evaluation and Manifesto”.In Proceedings of the Third International Symposium on Non-Photorealistic Anima-tion and Rendering (NPAR) 2004, pp 71-78

A. Santella and D. DeCarlo, ”Robust Clustering of Eye Movement Recordings forQuantification of Visual Interest”. In Proceedings of the Third Eye Tracking Researchand Applications (ETRA) 2004, pp 27-34

D. DeCarlo, A. Finkelstein, S. Rusinkiewicz and A. Santella, ”Suggestive Contoursfor Conveying Shape”. In ACM Transactions on Graphics, 22(3) (SIGGRAPH 2003Proceedings), pp 848-855

D. DeCarlo and A. Santella, ”Stylization and Abstraction of Photographs”. In ACMTransactions on Graphics, 21(3) (SIGGRAPH 2002 Proceedings), pp 769-776

A. Santella and D. DeCarlo, ”Abstracted Painterly Renderings Using Eye-trackingData”. In Proceedings of the Second International Symposium on Non-PhotorealisticAnimation and Rendering (NPAR) 2002, pp 75-82

Documents

THE ART OF SEEING: VISUAL PERCEPTION IN …research.cs.rutgers.edu/~asantell/thesis.pdfthe art of seeing: visual perception in design and evaluation of non-photorealistic rendering