Art and Visual Perception Thesis

8/14/2019 Art and Visual Perception Thesis

1/128

THE ART OF SEEING: VISUAL PERCEPTION INDESIGN AND EVALUATION OF

NON-PHOTOREALISTIC RENDERING

BY ANTHONY SANTELLA

A Dissertation submitted to the

Graduate SchoolNew Brunswick

Rutgers, The State University of New Jersey

in partial fulllment of the requirements

for the degree of

Doctor of Philosophy

Graduate Program in Computer Science

Written under the direction of

Doug DeCarlo

and approved by

New Brunswick, New Jersey

May, 2005


2/128


3/128

ABSTRACT OF THE DISSERTATION

The Art of Seeing: Visual Perception in Design and

Evaluation of Non-Photorealistic Rendering

by Anthony Santella

Dissertation Director: Doug DeCarlo

Visual displays such as art and illustration benet from concise presentation of in-

formation. We present several approaches for simplifying photographs to create such

concise, artistically abstracted images. The difculty of abstraction lies in selecting

what is important. These approaches apply models of human vision, models of image

structure, and new methods of interaction to select important content. Important loca-

tions are identied from eye movement recordings. Using a perceptual model, features

are then preserved where the viewer looked, and removed elsewhere. Several visual

styles using this method are presented. The perceptual motivation for these techniques

makes predictions about how they should effect viewers. In this context, we validate

our approach using experiments that measure eye movements over these images. Re-

sults also provide some interesting insights into artistic abstraction and human visual

perception.

ii


4/128

Acknowledgements

Thanks go to the many people whose help and support was essential in making thiswork possible. None of this would have happened without my advisor Doug DeCarlo.

Thanks go also to my other committe members: Adam Finkelstein, Eileen Kowler,

Casimir Kulikowski and Peter Meer for their advice and encouragement at various (in

some cases many) stages of this process.

Thanks go also to the many friends and family members who have supported and

kept me sane through this long process. I wouldnt have survived it without my parents

and brothers Nick and Dennis. Special thanks go to Bethany Weber. Thanks also to

Jim Housell, all the old NYU crowd, the grad group at St. Peters and all the supportive

souls in the CS Department, RuCCS and the VILLAGE.

Finally, thanks go to Phillip Greenspun for photos used in several renderings that

appear in chapters 7 and 9, as well as models Marybeth Thomas, Adeline Yeo and

Franco Figliozzi. Special thanks to Georgio Dellachiesa for looking equally thoughtful

in countless illustrative examples.

iii


5/128

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. Inspirations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1. Artistic Practice . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2. Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.3. Computer Graphics . . . . . . . . . . . . . . . . . . . . . . . 7

1.2. Our Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2. Abstraction in Computer Graphics . . . . . . . . . . . . . . . . . . . . 11

2.1. Manual Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2. Automatic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3. Level Of Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3. Human Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1. Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1. Eye Movement Control . . . . . . . . . . . . . . . . . . . . . 19

3.1.2. Salience Models . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2. Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3. Limits of Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1. Models of Sensitivity . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2. Sensitivity Away from the Visual Center . . . . . . . . . . . . 26

3.3.3. Applicability to Natural Imagery . . . . . . . . . . . . . . . . 26

iv


6/128

4. Vision and Image Processing . . . . . . . . . . . . . . . . . . . . . . . 30

4.1. Image Structure Features and Representation . . . . . . . . . . . . . 30

4.2. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3. Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1. Eye tracking as Interaction . . . . . . . . . . . . . . . . . . . . . . . 38

5.2. Using Visibility for Abstraction . . . . . . . . . . . . . . . . . . . . . 40

6. Painterly Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1. Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2. Applying the Limits of Vision . . . . . . . . . . . . . . . . . . . . . 43

6.3. Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7. Colored Drawings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1. Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1.1. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.1.2. Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2. Perceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3. Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8. Photorealistic Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.1. Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.2. Measuring Importance . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.3. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67

9. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

v


7/128

9.1. Evaluation of NPR . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

9.1.1. Analysis of Eye Movement Data . . . . . . . . . . . . . . . . 75

9.2. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.2.1. Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.2.2. Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9.2.3. Physical Setup . . . . . . . . . . . . . . . . . . . . . . . . . 78

9.2.4. Calibration and Presentation . . . . . . . . . . . . . . . . . . 79

9.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9.3.1. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9.3.2. Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . 82

9.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

9.4.1. Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 86

9.4.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9.5. Evaluation Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 92

10. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.1. Image Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9510.1.1. Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.1.2. Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

10.2. Perceptual Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.3. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

11. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108Curriculum Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116

vi


8/128

List of Figures

1.1. (a) Henri de Toulouse-Lautrecs Moulin RougeLa Goulue (Litho-

graphic print in four colors, 1891). (b) Odd Nerdrums Self-portrait

as Baby (Oil, 2000). Artists control detail as well as other features

such as color and texture to focus a viewer on important features and

create a mood. La Goulues swirling under-dress is a highly detailed

focal point of the image, and contributes to the pictures air of reck-

less excitement. Artists have a fair amount of latitude in how they

allocate detail to create an effect. Nerdrum renders his eyes (usually

one of the most prominent features in a portrait) in a sfumato style

that makes them almost nonexistent. Detail is instead allocated to the

childs prophetic gesture. These choices change a common baby pic-

ture into something mysterious and unsettling. . . . . . . . . . . . . 4

1.2. Judith Schaechters, Corona Borealis (Stained glass, 2001). Skill-

ful artists use the formal properties and constraints of a medium forexpressive purposes. The high dynamic range provided by transmit-

ted light and the heavy black outlines of the lead caming that holds

the glass together are used to set the gure off from the background

creating a powerful image of joy in isolation. . . . . . . . . . . . . . 5

2.1. Direct placement of strokes. Complete control of abstraction is pos-

sible when a user provides actual strokes that are rendered in a given

style. Reproduced from [Durand et al, 2001]. . . . . . . . . . . . . . 11

2.2. Manual annotation for textural indication. Important edges on a 3D

model are marked and have texture rendered near them, while it is

omitted in the interior. Reproduced from [Winkenbach and Salesin,

1994]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

vii


9/128


10/128

4.1. (a) Scale space of one dimensional signal. Features disappear through

scale space but no new features appear. (b) Plot of inection points of

another one dimensional signal through scale space. Reproduced from

[Witkin 1983] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2. Interval tree for 1D signal illustrating decomposition of the signal into

a hierarchy. Reproduced from [Witkin 1983]. . . . . . . . . . . . . . 33

5.1. (a) Computing eccentricities with respect to a particular xation at p .

(b) A simple attention model dened as a piecewise-linear function for

determining the scaling factor a i for xation f i based on its duration

t i. Very brief xations (below t min ) are ignored, with a ramping up (at

t max ) to a maximum level of a max . . . . . . . . . . . . . . . . . . . . 40

6.1. Painterly rendering results. The rst column shows the xations made

by a viewer. Circles are xations, size is proportional to duration, the

bar at the lower left is the diameter that corresponds to one second. The

second column illustrates the painterly renderings built based on that

xation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2. Detail in background adjacent to important features can be inappro-priately emphasized. The main subject has a halo of detailed shutter

slats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3. Sampling strokes from an anisotropic scale space avoids giving the

image an overall blurred look, but produces a somewhat jagged look in

background areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.4. Color and contrast manipulation. Side by side comparison or render-

ing with and without color and contrast manipulation (precise stroke

placement varies between the two images due to randomness). . . . . 48

7.1. Slices through several successive levels of a hierarchical segmentation

tree generated using our method. . . . . . . . . . . . . . . . . . . . . 51

7.2. Line drawing style results. . . . . . . . . . . . . . . . . . . . . . . . 60

ix


11/128

7.3. Stylistic decisions. Lines in isolation (a) are largely uninteresting. Un-

smoothed regions (b) can look jagged. Smoothed regions (c) have a

somewhat vague and bloated look without the black edges superimposed. 61

7.4. Renderings with uniform high and low detail. . . . . . . . . . . . . . 62

7.5. Several derivative styles of the same line drawing transformation. (a)

Fully colored, (b) color comic, (c) black and white comic . . . . . . . 62

8.1. Mean shift ltering tends to create images that no longer look like pho-

tographs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.2. Photo abstraction results . . . . . . . . . . . . . . . . . . . . . . . . 68

8.3. Photo in (a) is abstracted using xations in (b) in a variety of differ-

ent styles. (c) Painterly rendering, (d) line drawing, (e) locally disor-

dered [Koenderink and van Doorn, 1999], (f) blurred, (g) anisotropi-

cally blurred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.4. (a) Detail of our approach, (b) the same algorithm using an importance

map where total dwell is measured locally. Notice in (b) the leaking of

detail to the wood texture from the object on the desk. Here differences

are relatively subtle; but in general it is preferable to allocate detail in

a way that respects region boundaries. . . . . . . . . . . . . . . . . . 70

8.5. The range of abstraction possible with this technique is limited. With

greater abstraction the scene begins to appear foggy. In some sense it

no longer looks like the same scene. . . . . . . . . . . . . . . . . . . 71

9.1. Example stimuli. Detail points in white are from eye tracking, black

detail points are from an automatic salience algorithm. . . . . . . . . 76

9.2. Illustration of data analysis, per image condition. Each colored collec-

tion of points is a cluster. Ellipses mark 99 % of variance. Large black

dots are detail points. We measure the number of clusters, distance

between clusters and nearest detail point, and distance between detail

points and nearest cluster. . . . . . . . . . . . . . . . . . . . . . . . 80

x


12/128

9.3. Statistical signicance is achieved for number of clusters over a wide

range of clustering scales. The magnitude of the effect decreases, but

its signicance remains quite constantly over a wide interval. Our re-

sults do not hinge on the scale value selected. . . . . . . . . . . . . . 829.4. Average results for all analyses per image. . . . . . . . . . . . . . . 84

9.5. Average results for all analyses per viewer. . . . . . . . . . . . . . . 85

9.6. Original photo and high detail NPR image with viewers ltered eye

tracking data. Though we found no global effect across these image

types, there are sometimes signicantly different viewing patterns, as

can be seen here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

10.1. A rendering from our line drawing system (b), can be compared to

an alternate locally varying segmentation (c). This segmentation more

closely follows the shape of shading contours. . . . . . . . . . . . . 96

10.2. Locally varying segmentation cannot replace a segmentation hierar-

chy. Another example of a locally varying segmentation controlled by

a perceptual model (c), compared to a rendering from our line drawing

system. Note ne detail in the brick preserved near the subjects headin (c). This is a consequence of the threshold varying continuously as

a function of distance from the xations on the face. . . . . . . . . . 97

10.3. A rendering from our line drawing system demonstrates how long but

unimportant edges can be inappropriately emphasized. Also, promi-

nent lower frequency edges like creases in clothing are detected in

fragments and ltered out because edges are detected at only one scale. 100

10.4. Attempting technical illustration of mechanical parts pushes our image

analysis techniques close to (if not over) their limits. . . . . . . . . . 103

xi


13/128

1

Chapter 1

IntroductionIn all eras and visual styles, artists control the amount of detail in the images they

create, both locally and globally. This is not just a technique to limit the effort in-

volved in rendering a scene. It makes a denite statement about what is important

and streamlines understanding. Our goal is to largely automate this artistic abstrac-

tion in computer renderings. The hope is to remove detail in a meaningful way, while

automating individual decisions about what features to include. Eye tracking allows

the capture of what a viewer looks at and indirectly, what they nd important. We

demonstrate that this information alone is sufcient to control detail in an image based

rendering, and change the way successive viewers look at the resulting image. Our

method is grounded in the mechanisms and nature of visionhow we see and un-

derstand the world. This is an intuitive idea, if often overlooked. Artists must rst

be viewers [Ruskin, 1858] and viewers ultimately consume the resulting images. So,

vision must be central in the design of algorithms for creating imagery.

Vision appears simple and effortless. Because under most circumstances it requires

no conscious effort or exertion, it seems like a trivial operation, something that just

happens, as if the light falling on the eye made one see in the same way it warms a stone.

But sight is the product of an extraordinarily developed and complicated visual system.

In seeing we are all experts, and experts make things seem easy. Without any effort we

can navigate and act in the world and recognize objects even under difcult conditions.

The abilities of our sight outreach even our awareness of them. Experiments have

shown that the eyes of radiologists searching for tumors linger longer over tumors

that they fail to notice and report [Mello-Thoms et al., 2002]. The limited success of

attempts to mimic these human abilities in computer vision systems highlight both the

difculty of the computations involved, and our phenomenal success at them.


14/128

2

The apparent ease with which we see slips when our vision is stressed: struggling

to keep a written page in focus as we fall asleep, searching for a loved ones face in the

shifting crowd of an airport. At these times we become conscious of sight as a struggle

to organize and make sense of the world. This struggle has continual victories, but alsofailures. An old friend waving to us on the street is passed by, a typo make its way into

an important document. The apparent ease of vision also masks our limitations. We

miss much, and are easily overloaded. Sometimes our failures are engineered: a cam-

ouaged soldier, the proverbial ne print. More often, however, they are accidental.

Some information was present, or presented, and we failed to notice it.

Well designed displays of visual information ensure we dont miss anything impor-

tant by careful arrangement and manipulation. A wide variety of techniques are used

to make meaning clear. Detail is put just where it is important, shapes can be changed

or removed, colors and textures enhanced or suppressed. Paintings, sketches, technical

illustrations, and even the most apparently photorealistic of artall products of the hu-

man handhave been simplied and manipulated for ease of understanding. Reality

is complicated and messy. Rather than realism, what is more often desired is verisimil-

itude. We want the appearance of reality which has been organized and structured tomake its meaning clearer, if necessarily more limited than the innite complexity of

reality.

Achieving this kind of clarity has always been the job of artists and designers who

make subjective, but not arbitrary, decisions about what is important, and how to con-

vey it. The ubiquity of digital media creates a need for automation in achieving this

kind of good design. The goal is not to replace the artist who creates a carefully crafted

one-off display, but instead to create a potentially vast number of adaptive displays,

tailored to particular situations and viewers. This information would otherwise be dis-

played in some less well-designed manner, laying more of a cognitive burden on the

user. It has been argued in fact, that avoiding this burden is one of the primary char-

acteristics of powerful art [Zeki, 1999]. If good design can be formalized, this will


15/128

3

enhance understanding and aid effective communication, as well as improve our own

understanding of the workings of visual communication. This thesis presents some

initial steps toward this goal.

1.1 Inspirations

There are many techniques proposed by various artists, and perhaps even more theory

proposed by various researchers and critics on how to achieve good visual design. Yet

it remains imperfectly understood in all of the elds where it has been studied. Because

of this, a successful practical approach must necessarily draw on elements from many

areas of practice and theory. If a practical system is designed to be as general as

possible, its creation can improve understanding of what visual clarity means, and how

it relates to communication. It can also provide a framework in which to unify concepts

and techniques from many elds.

1.1.1 Artistic Practice

One important source of inspiration for this work is artistic practice and practical the-

ory. Artists have always had strong motivation to capture the attention and interest of

uninterested, sometimes hostile viewers. Much ingenuity has been applied to creating

images that are as gripping and clearly communicative as possible. Careful observation

of such images can yield interesting insights (see Figure 1.1). Similarly, artists have

throughout history given advice on the practice of their craft. Theorists and art histo-

rians have tried to make generalizations and analyze techniques [Ruskin, 1857, Gom-

brich et al., 1970, Graham, 1970, Arnheim, 1988]. This is true in graphic design as

well as ne art. Classical texts like Tufte [1990] try to explore the qualities of good

and bad presentations of information and make generalizations from carefully chosen

examples.

However, these instructions and recommendations are often difcult to apply. They


16/128

4

(a) (b)

Figure 1.1: (a) Henri de Toulouse-Lautrecs Moulin RougeLa Goulue (Litho-graphic print in four colors, 1891). (b) Odd Nerdrums Self-portrait as Baby (Oil,2000). Artists control detail as well as other features such as color and texture to focusa viewer on important features and create a mood. La Goulues swirling under-dress isa highly detailed focal point of the image, and contributes to the pictures air of recklessexcitement. Artists have a fair amount of latitude in how they allocate detail to createan effect. Nerdrum renders his eyes (usually one of the most prominent features in aportrait) in a sfumato style that makes them almost nonexistent. Detail is instead allo-cated to the childs prophetic gesture. These choices change a common baby pictureinto something mysterious and unsettling.

are sometimes limited in scope, providing specic instructions for a particular narrow

problem. More often, guidelines are too broad and vague in their application. They

count for their functioning on the judgment of the artist. The advice of artists and

designers often comes in the form of heuristics, rules of thumb to be taken with a grain

of salt, kept in the back of ones mind, and applied when the moment seems right.

Becoming an expert in a visual eld is often a question of cultivating, through practice

and observation, an instinctive sense of when to apply such rules, and conversely when

to break them.


17/128

5

Figure 1.2: Judith Schaechters, Corona Borealis (Stained glass, 2001). Skillfulartists use the formal properties and constraints of a medium for expressive purposes.

The high dynamic range provided by transmitted light and the heavy black outlinesof the lead caming that holds the glass together are used to set the gure off from thebackground creating a powerful image of joy in isolation.

1.1.2 Psychology

A somewhat different approach is to study good design with the methodologies of psy-

chology, psychophysics and neuroscience. This is in essence an attempt to understand

good design from rst principles: the functioning of the human mind and visual sys-

tem. Visual perception obviously mediates all information that passes from a display

to a user. So, as a form of visual communication, art must be constrained by the laws

of psychology and the visual system [Arnheim, 1988, Zeki, 1999,Ramachandran and


18/128

6

Hirstein, 1999]. This is an attractive idea. By understanding the strengths and weak-

nesses of the process that allows us to see, it should be possible to maximize use of the

limited cognitive bandwidth between a display and viewer.

This is perhaps not so far from what artists have done all along. One could view

every daub of paint, every pen stroke as an informal experiment in vision. Artists test

their actions against the evidence of their own visual systems, and make predictions

about how they will affect others. Formal attempts to understand perception and art

are simply more conscious, more systematic, and more interested in understanding the

creative process itself than making a statement through it. A number of psychologists

have speculated on this, and pointed to specic examples from art history [Arnheim,

1988, Leyton, 1992, Zeki, 1999, Ramachandran and Hirstein, 1999]. Studies have in-

deed found empirical evidence of perceptual effects resulting from artistic style or com-

position [Ryan and Schwartz, 1956,Locher, 1996].

Like most attempts to do anything complicated from rst principles, looking at art

and design using cognition is hard. There is much that has been understood about the

visual system, but also much that is not. The more basic and low level the area of

visual function is, the more we know about it, and the less useful that information is

for design. Much for example, is known about the physical mechanism of how we per-

ceive color, substantially less is known about how we parse shapes out of a background

and assemble them into objects. Its not surprising that many researchers looking at art

from a cognitive standpoint consider primarily 20th century painters, like Mondrian,

Kandinsky, or even Picasso at his more abstract, who themselves were largely con-

cerned with the purely formal aspects of pictorial space rather than the semantics of

subject matter. The semantic aspects of vision which reference the rest of the world

and its non-visual aspects are ill understood, so little cognitive research can be brought

to bear on the semantics of art.

Given the limited basic knowledge, general theories of how art functions cogni-

tively are, almost of necessity, rather vague in their application. Ramachandran [1999]


19/128

7

for example, suggests that all art is guided by the peak shift principle. This principle,

found in a number of situations in psychology, says that if a response is trained to some

stimuli, the greatest, or peak, response will be found with a stimulus that is greater than

the one used in training. A depiction functions by emphasizing the features that nor-mally let one know what it is. In this view all art is a form of caricature. However,

this does not tell us the qualities of a successful caricature. In another example, Leyton

[1992] argues that art maximally encodes a causal history that can be read by viewers.

Good art should contain as much information in the form of asymmetry as possible to

stimulate viewers, but not too much, which will disturb them. Though a reasonable

sounding standard, this only hints at what the correct level of complexity is.

The application of psychology to design is difcult. However, we do not need to

build a system directly on these principles. Inspired by them, we can apply knowledge

from low-level vision and computer graphics techniques to build practical systems.

1.1.3 Computer Graphics

A large body of work in computer graphics ignores all these difculties and sets out

to create attractive synthetic art and illustration. Attempts at algorithmic denitions of

good design surface in a number of areas in computer science, graphics, scientic vi-

sualization, document layout, human computer interaction, and interface design. Con-

cerns of effective art-like visual communication have particularly come to the forefront

in the realm of non-photorealistic rendering, or NPR. This area is perhaps excessively

broad. It includes almost any part of graphics that aims to create images that are not an

imitation of reality. It includes things as diverse as computer generation of geometrical

patterns, instructional diagrams and impressionist paintings. NPR images run a gamut

between the purely ornamental and those designed to convey very specic information.

A large area of research in NPR has been the production of many, often quite impres-

sive, phenomenological models for rendering in various traditional media and styles.

There is however an increasing interest in NPR as not just a way to imitate traditional


20/128

8

visual styles, but also as a set of techniques for trying to display visual information in

a concise and abstract way.

The link between concise presentation and imitating traditional artistic styles is not

accidental. Almost all the visual styles of traditional media, line drawings, wood-block

prints, comics, expressionist or impressionist paintings, pencil sketches, necessarily

discard vast amounts of information as a direct consequence of their visual style. There

is, for example, no color or shading in a pure line drawing. However, these images still

carry the essential content that the artist (and viewer) requires of them. Skillful artists

can use the properties and constraints of a medium to enhance the expressiveness of

a work (see Figure 1.2). A brief time spent working with photo lters in a program

like Adobe Photoshop suggests that computer implementations of these styles capture

some of the effects of traditional media, but often in a way that does not adapt to

particular situations with an artists exibility. Artists ultimately can judge their results

as they go. Applying a technique in a blanket manner is often less satisfactory. What

is acceptable as reality in a photograph can look fussy and crowded as a painting.

1.2 Our Goal

Though todays algorithms cannot model the general intelligence of an artist, we argue

that carefully designed systems can make use of minimal user interaction to create

much more expressive images. Specically, we look at modulation of local detail, an

important cue used in traditional art and visualization. Including detail only where it

is needed focuses viewer interest and can help clarify the point of an image. As well

as being a feature of art and illustration, applications in visualization could benet

from this. It would allow the computer to hand-craft displays for clarity and efcient

understanding in a particular situation.

This work does not directly address specic visualization applications. Rather than

exploring visualization directly, art remains the focus, and this thesis remains rmly in


21/128

9

the relm of artistic NPR. Our hope however, is that insights gained in this way should

be applicable to a number of areas in visualization. Art is a particularly good place to

explore the link between cognition and design of displays. Specic applications tend

to distract with their own implementation details and domain constraints. Radiology,for example, is a domain where complexity and high stakes greatly constrain practical

applications. Art encourages a wider view, in which it is easier to look at general

techniques and patterns that are widely useful. Similarly, in evaluation, validation of

a particular system is of limited interest, while evaluation of more general techniques

can provide insights into cognition and be more widely relevant.

Grounding our work in knowledge of visual perception also helps focus attentionaway from application engineering and towards general concepts. We are interested in

interactively efcient methods for achieving expressive NPR images. Knowledge of

visual perception suggests that by exploiting the visual system we can reserve human

effort for just the hardest parts of the process of crafting images, and pass the major-

ity of the work over to a computer. For a computer application, the hardest part of

abstraction is deciding what is important. This is not hard for people, since it is done

instinctively. Deciding what to paint a picture of is the easy part for an artist. It is the

mechanics of turning that intention into an image that takes training, time and effort.

This leads us to a simple, minimally interactive method for controlling detail via eye

tracking. As we will soon see, vision research leads us to believe that where people

look indicates importance. Such areas should be portrayed in detail. Conversely, what

viewers dont look at is unimportant to them and can be removed or de-emphasized.

The same insights about vision that leads to this methodology also leads us to quanti-

tative methods for evaluation. If our approach is successful, increased interest in areas

highlighted with detail should be reected in eye movements. This methodology holds

the promise of images that are carefully crafted for understanding on sound principles,

and can be formally evaluated for effectiveness. Such images and techniques can in

turn serve as a tool for further investigating human vision in a way targeted toward the


22/128

10

questions that are important for crafting images. With more information, even better

techniques and images can be built.

In this thesis we begin in Chapter 2 by laying out the basic problem of control-

ling detail in NPR imagery, and look at the range of techniques that have been usedto address it. In Chapters 3 and 4 we then review the basic background in human and

computer vision underlying our approach to this problem. The nature of vision leads

us to an approach of capturing the intentionality central to design via eye tracking.

Information about where people look alone is sufcient to control detail in a directed

way, allowing us to craft semi-automatic NPR images with much of the attractive and

engaging intentionality of completely hand made art. The basic nature of this interac-

tion is described in Chapter 5. In Chapter 6, 7 and 8 we then present several systems

for creating NPR renderings built on this idea, and discuss their strengths and weak-

nesses. An evaluation of one of these systems is presented in chapter 9, which not only

validates the general approach but gives some interesting insights into abstraction and

human vision. Finally, in Chapter 10 we discuss some directions for future research.


23/128

11

Chapter 2

Abstraction in Computer GraphicsIn any work of art all parts of the picture plane do not receive equal attention from the

artist. Critical areas are more detailed, while others are left relatively abstract. This is

the case even in quite realistic styles, and in technical illustration. Such effects have not

been ignored in computer graphics and NPR. Local control of detail has been addressed

in several visual styles. Whatever the rendering techniques used, important areas can

be identied and depicted with greater detail, or emphasis on delity. Deciding what is

important is difcult to do automatically. Two broad approaches to selecting important

areas can be characterized: manual user annotation, and simple heuristics.

Figure 2.1: Direct placement of strokes. Complete control of abstraction is possiblewhen a user provides actual strokes that are rendered in a given style. Reproduced from[Durand et al, 2001].

2.1 Manual Annotation

At one extreme, near complete control of detail can remain in the hands of a user.

This provides many expressive possibilities at the expense of much interaction. At its


24/128

12

Figure 2.2: Manual annotation for textural indication. Important edges on a 3D modelare marked and have texture rendered near them, while it is omitted in the interior.Reproduced from [Winkenbach and Salesin, 1994].

Figure 2.3: Manual local importance images. Hand painted images can indicate im-portant areas to be rendered in greater detail or delity. Reproduced from [Hertzmann,2001]

furthest extreme the computer becomes merely a digital paintbrush the user directly

manipulates [Baxter et al., 2001]. A number of intermediate approaches exist that aid

the user in the technicalities of creating an image while still giving them complete

control over detail. The earliest work creating a painting-like appearance, or painterly

rendering effect [Haeberli, 1990] took this approach. A user places strokes entirely

by hand, their color being sampled from an underlying source image. The approach

is in effect a form of tracing, where the user ultimately remains in control of stroke

placement and size while, like a traditional media artist, making their own decisions

about which details are important as they go. A similar kind of interaction has been

used [Durand et al., 2001] in generating pencil renderings (see Figure 2.1. The user

places strokes which are shaded and shaped automatically to create a nal drawing.

The same stroke based interactive methods are applicable in 3D [Kalnins et al., 2002].


25/128

13

One step distant from actually drawing strokes, it is also possible to indicate in-

creased importance for some areas of a rendering using an importance map , where

higher intensity indicates the need for more attention or detail in that area. For exam-

ple in a painterly rendering framework [Hertzmann, 2001], a hand drawn importancemap was used to indicate that a source image should be more closely approximated in

certain locations (see Figure 2.3). Similarly, [Winkenbach and Salesin, 1994] in 3D

hand drawn lines have been used to indicate locations near which textural detail should

be included (see Figure 2.2). In another painterly rendering application [Gooch and

Willemsen, 2002] rectangles to be painted in greater detail could be drawn by hand.

Various digital versions of other media, such as pen and ink [Salisbury et al., 1994]

and watercolor [Curtis et al., 1997] have been developed that provide the user with a

signicant control over the detail present in different areas. Such approaches can yield

attractive results, but require careful attention on the part of a user.

(a) (b) (c)

Figure 2.4: (a) original image. (b) corresponding salience map [Itti et al, 1998]. (c)corresponding salience map [Itti and Koch, 2000]. Salience methods picks out poten-tially important areas on the basis of contrast in some space (not limited to intensity).The two methods pictured here differ in the method of normalization used to enhancecontrast between salient and nonsalient regions.

2.2 Automatic Methods

More common in NPR have been purely automatic methods. Automatic methods also

run a gamut, from approaches that process an image in a completely local, uniform

manner to those that automatically extract some quantity from an image as a proxy for


26/128

14

importance. Uniform approaches perform some (not necessarily local) operation uni-

formly across an image, and have been used extensively in painterly rendering [Hertz-

mann, 1998,Litwinowicz, 1997, Shiraishi and Yamaguchi, 2000]. A global effect pro-

vides users with only limited control. Rather than being truly uniform, some of theseapproaches make a (largely implicit) simple assumption that some low level features

are important and worth preserving. Automatic painterly rendering methods for ex-

ample, largely assume strong high frequency features are important and should be

preserved in a rendering. In fact, painterly techniques vary largely in their method

for respecting these boundaries: aligning strokes perpendicular to the image gradi-

ent [Haeberli, 1990], terminating strokes at edges [Litwinowicz, 1997], or drawing in

a coarse-to-ne fashion [Hertzmann, 1998, Shiraishi and Yamaguchi, 2000, Hays and

Essa, 2004]. Similarly, automatic line drawing approaches (both 2D and 3D) assume

the importance of all lines that meet certain purely geometrical denitions, occluding

contours, creases, [Saito and Takahashi, 1990,Interrante, 1996,Markosian et al., 1997],

and suggestive contours [DeCarlo et al., 2003]. Such techniques can create attractive

images, but lack the selective omission which gives art much of its expressive power.

The kind of omission commonly used in depicting specic objects can sometimesbe explicitly stated. In drawing trees for example, [Kowalski et al., 1999,Deussen and

Strothotte, 2000] you can avoid drawing detail in the center of the tree, especially as the

tree is drawn smaller. Though this may be an accurate characterization of a particular

common style of depiction, it is not generally applicable to any subject.

For general images, there are relatively few options for automatically selecting

important areas. Some attempts have been made to predict importance using various

image analysis techniques. In 3D, image pyramids have been applied to omit detail in

the interior of a shape [Grabli et al., 2004]. In 2D, drawing on vision research, some

approaches have attempted to use salience measures to capture importance. Salience

measures are a guess at the ability of a feature to capture interest based on its low level

properties [Itti et al., 1998,Itti and Koch, 2000]. Similarly motivated salience measures


27/128

15

have been applied to attempt to predict features worth preserving in painterly rendering

[Collomosse and Hall, 2003]. Because faces are often an important component of

images, detecting them also provides a useful (though not always reliable) automatic

cue for what areas are important. Face detection has been used alongside saliencemethods in other areas of graphics loosely related to NPR where identifying important

features is useful, such as automatic cropping [Chen et al., 2002, Suh et al., 2003] and

recomposing of photographs [Setlur et al., 2004].

2.3 Level Of Detail

An area of computer graphics left out in the above discussion has dealt with many of

these same issues. Various adaptive rendering and level of detail ( LOD ) schemes have

used the visibility or potential interest of features to skip computations that are unlikely

to be noticed. This is different from our goal. We are interested in detail modulation for

stylistic and expressive reasons. Level of detail seeks to control the computational cost

of rendering through approximation, not abstraction. Though both are concerned with

simplication, LOD and various other corner cutting is usually meant to be invisible,

or nearly so, while expressive abstraction is meant to be seen and indeed have a strong

effect on the way a viewer looks at an image. Though the goals are different, some

of the methodologies overlap. The goal of imperceptible omission has encouraged

researchers to look at perceptually motivated methods. Salience measures have been

applied to concentrate computation on noticeable areas, [Yee et al., 2001, Cater et al.,

2003]. In addition, a variety of low level perceptual models have been applied to try

to quantify the visibility of features and guarantee that simplication is invisible, or

minimize visibility. We adopt several of these metrics in our own efforts. One of

our contributions can be seen as applying and expanding perceptual models originally

adopted in LOD to create expressive artistic abstraction.

Both perceptually motivated LOD methods and the methods we present in this


28/128

16

thesis use models of vision to identify expendable areas of an image. It is the functional

denition of an expendable area that differs between the two. In the following chapter

we present the relevant background in human vision necessary for understanding why

such areas exist, and how they may be identied.


29/128

17

Chapter 3

Human VisionA background in human vision is essential in computationally dening artistic abstrac-

tion. We have extraordinarily complex abilities to analyze images, these abilities have

weaknesses and strengths. Level of detail simplication methods seek to exploit the

limits of vision to cut corners in an unnoticeable way. In contrast, we hope to use the

related strengths of the visual system to improve visual design, clarifying content and

make things that need to pop out, pop out. Our interactive technique uses eye move-

ments and the limits of vision to indirectly measure the importance of features. Some

background will clarify the motivation for this approach.

3.1 Eye Movements

The human eye is maximally sensitive over a relatively small central area called the

macula. This area of relatively high resolution is approximately 5 degrees across, while

the most sensitive region (the fovea) is only 1.3 degrees (from a total visual angle of

about 160 degrees) [Wandell, 1995]. Sensitivity rapidly degrades outside of this central

region. Our perception of uniform detail throughout space is a result of continually

switching the point at which our eyes are looking (the point of regard or POR).

This process involves two important types of eye motions: xations , relatively long

periods spent looking at a particular spot, and saccades , very rapid changes of eye po-

sition. These are not the only kinds of motion of which the eye is capable. In smooth

pursuit the eye follows a moving object, and even when xated the eye continually

makes very small jittery motions. Fixations and saccades however are the most signif-

icant motions when viewing static scenery. Saccades can be initiated consciously, but

for the most part occur naturally as we explore a scene. Though xating on a location


30/128

18

Figure 3.1: Patterns of eye movements of a single subject over an image when givendifferent instructions. Note (1) free observation which shows xations that are rel-atively dispersed yet still focused on relevant areas. Contrast it with (3) where theviewer is instructed to estimate the gures ages. Reproduced from Yarbus 1967.


31/128

19

is not identical to attending it, for the most part an attended location is xated, (i.e. if

we pay attention to something, we strongly tend to look at it directly) [Underwood and

Radach, 1998].

Figure 3.2: Similar effects to [Yarbus, 1967] are easily (even unintentionally) achievedwhen using eye tracking for interaction. Circles are xations, their diameter is propor-tional to duration. The rst viewer was instructed to nd the important subject matterin the image. The second viewer was told to just look at the image. The viewer as-sumed, from prior experience in perceptual experiments, that he was going to be laterasked detailed questions about the contents of the scene. This resulted in a much morediffuse pattern of viewing.

3.1.1 Eye Movement ControlQualitatively, a great deal is known about xations. Eye movements are highly goal

directed. Viewers dont just look around at random. Instead, they xate meaningful

parts of images [Mackworth and Morandi, 1967, Underwood and Radach, 1998, Hen-

derson and Hollingworth, 1998], and xation duration is related to processing [Just

and Carpenter, 1976, Henderson and Hollingworth, 1998]. Viewing is highly inu-

enced by task. The classic example of this [Yarbus, 1967] showed that viewers ex-

amining the same image, with different tasks to perform, showed drastically differ-

ent patterns of viewing, in which they focused on the features relevant to their task

(see Figure 3.1). Given the same task, the motions of a particular viewer over an

image at different viewings can be quite different, yet the overall distribution of x-

ations remains similar [Yarbus, 1967]. In real activities, actions, even those thought


32/128

20

of as automatic, are usually preceded by (largely unperceived) xations of relevant

features [Land et al., 1999]. These effects have been noted from some of the earliest

research in the eld [Yarbus, 1967], but the mechanisms involved remain for the most

part informally understood.In general, understanding of most higher-level aspects of eye movement control

is largely qualitative. In limited domains such as reading, attempts have been made

to formulate mathematical models of viewing behavior. For complex natural scenes,

much less is known [Henderson and Hollingworth, 1998]. Clearly any information

used in guiding eye movements must come from the scene. Likewise, the process of

selecting a new location to view must be guided in part by low frequency information

gathered from the periphery during earlier xations. A matter of debate is whether low-

level visual information gained like this is a direct control of behavior or whether it is

primarily used when integrated into a higher level understanding. The precise factors

involved in control and planning of eye movements are an active and highly debated

topic [Kowler, 1990].

3.1.2 Salience ModelsMuch effort has gone into attempts to identify purely low-level image measurements

that can account for a signicant amount of viewing behavior. Clearly it would be inter-

esting if what appears to be a highly complex behavior requiring general understanding

could be modeled or at least reasonably predicted by a simple approach. Results have

been mixed. Fixation locations do not correlate very well over time with the presence

of simple low level image features such as areas of high contrast, junctions, etc... [Un-

derwood and Radach, 1998].

More complex models have been formulated, such as the salience methods men-

tioned earlier. All measure contrast in one sense or another. In general, salience meth-

ods embody the assumption that unusual features are likely to be important and looked

at. Choice of feature space, and scale of measurement and comparison differ. One


33/128

21

popular approach [Itti et al., 1998, Itti and Koch, 2000] uses center surround lters to

measure local contrast in color, orientation and intensity to model general viewing be-

havior. [Rosenholtz, 2001] uses a probabilistic framework to measure the probability of

a feature given a Gaussian model of color or velocity in the surround. This was used topredict visual search performance. A related salience framework was proposed [Walker

et al., 1998] to select unique image locations to match for image alignment. This ap-

proach used kernel estimation to measure the rarity of local differential features in the

global image wide distribution of those features.

These approaches share the same basic idea but vary in what they attempt to model.

This begs the question of what one is really trying to capture with salience. One can

look at salience as simply a quantitative method of deciding whether something is

present in a particular location in the visual eld. In this context, salience doesnt actu-

ally state the location is important, just that it might be because something is there. It

seems quite plausible that a measure like this plays a role in perception. However, more

is usually claimed for salience, for example that it predicts most of viewing behavior

or the valuable content in an image.

Salience would seem to have some additional predictive power because in a wide

class of images the semantically important subject does contrast with the rest of the

scene. Relatively few people take pictures of their family members dressed in camou-

age and lurking in the bushes. Nobody takes a picture of a leaf of grass in a eld. The

tendency of meaningful features to be visually prominent is by no means universal. It

is also unclear if this is really a property of the world, or a property of pictures people

take, but it does seem to underlie some of the success of salience as an engineering tool

in graphics.

Salience models have also been used to model viewing in narrower domains where

their applicability is more clear. The presence or absence of pop out effects in search

for example [Rosenholtz, 1999, Rosenholtz, 2001] is effectively modeled by simple

salience models that measure how distracting a distracter actually is.


34/128

22

Debate about how useful salience is in understanding general viewing is ongoing.

Some optimistically state that salience predictions correlate well with real eye motions

of subjects free viewing images [Privitera and Stark, 2000,Parkhurst et al., 2002]. Oth-

ers are more doubtful and claim that when measured more carefully and in the contextof a goal driven activity, the correlation is quite poor [Land et al., 1999, Turano et al.,

2003]. This mismatch in experimental results ts the intuition that visually promi-

nent, eye catching features might be more correlated with idle exploration of a scene,

and much less related to eye movements made during a task. In spite of this contro-

versy, salience methods are quite popular and have seen a fair amount of application

in computer graphics. They show some correlation with visually prominent features

and are fairly simple to implement. Code for some is publicly available. Clearly both

semantics and low-level features play a part in eye movements. Further investigation

is necessary to clarify the contributions to viewing behavior of salience and scene se-

mantics. Though they seem unable to model important aspects of viewing behavior,

salience models may provide important measures of visual prominence.

3.2 Eye Tracking

Much of the knowledge above about human eye motion has been gained through the

use of eye-tracking. A system measures a viewers eye in one of several manners

and records the point where it is looking, termed the point of regard or POR . One

common approach involves a video camera and an infrared light source. The relative

positions of the pupil and corneal reection in the resulting image are used to calculate

point of regard [Duchowski, 2000]. These systems are reasonably reliable and accurate

and improve with each generation, though they are still subject to drift over time and

variability between viewers. The same technology is used in producing units that sit

in front of a xed display, and in head mounted units for use in more general scenes.

Video based trackers have the virtue of not interfering directly with a viewer, making


35/128

23

them useful as both a natural interactive method and a research tool.

Outside of research in human vision, eye-trackers have seen increasing use as a

mode of human computer interaction. It has also enabled the use of eye movements

as a gauge of cognitive activity for psychological investigations and for evaluation of

visual displays.

Eye position has been used as a cursor for selection tasks in a GUI [Sibert and Ja-

cob, 2000]. They have also been used to indicate a users attention to others in a video-

conferencing environment [Vertegaal, 1999]. Another class of use, related to ours, uses

POR to control simplifying images or scenes for efciency purposes. Knowing where

a user looks enables pruning of information that is not perceptible, and need not be

transmitted in a video stream [Duchowski, 2000]. Similarly, unexamined content need

not be rendered in a 3D environment. In practice, few current systems that make use

of such simplication actually use eye tracking, presumably because of limited avail-

ability, head tracking is typically used instead [Reddy, 2001].

On the whole, eye tracking has been found more useful in interaction where it

serves as an indirect measure of user interest. Eye movements are not under full vol-

untary control. Because of this, when viewers attempt to explicitly point with their

eyes the result tends to lack control and suffer from the so called Midas Touch prob-

lem [Jacob, 1993] where struggling to control eye position, like a cursor, based on

visual feedback creates even more uncontrolled looking, touching on many irrelevant

or undesirable locations.

The same involuntary link of eye movement to thought processes that makes eye

tracking a bad mouse have made it useful as an indirect measure of interest and cog-

nitive activity. Eye tracking has been used to evaluate the effectiveness of informa-

tional displays including application interfaces [Crowe and Narayanan, 2000], web

pages [Goldberg et al., 2002], and air trafc control systems [Mulligan, 2002]. As

mentioned earlier, eye movements may even reveal information that viewers are trying

to report, but cannot, because it is not consciously available. Experiments have shown


36/128

24

that professional radiologists examining slides look longer at locations where tumors

are present, even when they fail to identify and report them [Mello-Thoms et al., 2002].

In the future, this might hold the promise of computer assisted technologies to avoid

such mistakes. Several consulting companies currently sell evaluation services usingeye tracking to graphic design houses and web content creators among others 1 .

3.3 Limits of Vision

Eye movements are related to the resolutional limitations of the eye. At any of the x-

ations with which a viewer explores a scene, the most detailed information is received

only in the fovea, but lower frequency information is received throughout the visualeld. These limits on sensitivity within the visual eld are not a weakness of the visual

system. On the contrary, they are part of our ability to efciently process wide elds

of view and integrate information across eye movements and changes in viewpoint.

3.3.1 Models of Sensitivity

Quantitative models of visual acuity and contrast sensitivity have been developed tomodel sensitivity to stimuli with different properties. Models of acuity predict whether

an observer can detect a black feature of a particular size on a white background. Con-

trast sensitivity measures an observers ability to discriminate a repeating pattern of a

particular contrast and frequency from a uniform gray eld. The drop-off in these sen-

sitivities away from the visual center is modeled as a function of eccentricity , location

relative to the point of xation.

Contrast sensitivity has been studied extensively in a variety of conditions usually

using monochromatic sinusoidal gratings (smoothly varying, repeating patterns of light

and dark bands). This sensitivity declines sharply with eccentricity [Kelly, 1984, Man-

nos and Sakrison, 1974,Koenderink et al., 1978]. Contrast threshold is dened as the

1http://www.eyetools.com, http://www.factone.com, http://www.veridicalresearch.com


37/128

25

(unitless) contrast value (0 to 1 with 1 being maximal contrast) at which a grating and

uniform gray become indistinguishable. Contrast sensitivity is the reciprocal of this

value.

100

101

10-2

10-1

100

101

102

103

Contrast Sensitivity

frequency cycles/degree

i n v e r s e

c o n

t r a s

t

visible

invisible

Figure 3.3: Log-log plot of contrast sensitivity from equation (3.2) This function isused to dene a threshold between visible and invisible features.

Many researchers have empirically studied human contrast sensitivity and several

have developed mathematical models from their data. Researchers in computer science

have also used existing data and models in applications. Different aspects of a stimuli

are important in different situations. Fitting models to data collected from different

viewers under different circumstances gives somewhat different results. Two examples

are given here to illustrate the form these mathematical models take.

Kelly [1984] developed a mathematical model for the contrast sensitivity curve (at

the center of the visual eld) including appropriate scaling factors describing the effects

of velocity ( v) as well as frequency ( f in cyles/degree) of a grating on sensitivity.

A( f , v) = ( 6.1 + 7.3(log 10 (v/ 3)3)v f 2e 2 f (v+ 2)/ 4.59 (3.1)

Mannos and Skarinson [1974] t a mathematical model appropriate to still imagery

to results of prior empirical studies for use as a metric in evaluating image compression.


38/128

26

A( f ) = smax 2.6(0.0192 + 0.144 f )e (0.144f )1.1

(3.2)

Where smax is the peak contrast sensitivity (this is around 400, but varies from

person to person).

3.3.2 Sensitivity Away from the Visual Center

A number of researchers have explored how sensitivity varies with eccentricity [Kelly,

1984,Rovamo and Virsu, 1979]. At larger eccentricities (expressed in degrees of visual

angle) the contrast sensitivity function is multiplied by another function which models

the drop-off of sensitivity in the visual periphery. This function is termed the cortical

magnication factor. It is not radially symmetric, but drops off faster vertically than

horizontally. It can be approximated [Rovamo and Virsu, 1979] with separate formulas

for decrease in sensitivity in four areas. For simplicity a bound from the most sensitive

area can be used in estimating visibility [Reddy, 2001, Reddy, 1997].

M (e) =1

1 + 0.29e + 0.000012 e3 (3.3)

The cubic term can usually be ignored, as its contribution in the range of eccentricities

normal in a screen display is negligible [Reddy, 1997]. The contrast sensitivity is then

M (e) A( f ).

3.3.3 Applicability to Natural Imagery

Some caution is necessary in applying these models derived from simple monochro-

matic repeating patterns to complex natural imagery. Though these models have been

applied with good results in graphics [Reddy, 2001], our goal of creating visible ab-

straction rather than conservative level of detail is more ambitious, and more likely to

stress the models involved.


39/128

27

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

eccentricity degrees

Cortical Magnification

Figure 3.4: Cortical Magnication describes the drop-off of visual sensitivity withangular distance from the visual center.

How to measure contrast is relatively obvious in gratings, there are only two ex-

trema. A single contrast exists for the entire grating. Between two regions in a scene

the meaning of contrast is less clear. Regions are neither uniform in color nor uni-

formly varying. No strong perceptually motivated approach to this problem appears to

have been formulated. Lillesaeter [1993] attempts to address this by dening a contrastbetween a nonuniform gure and ground. This contrast measure is a weighted aver-

age of the contrast between the region and background and the integral of the contrast

along the edge of the region. This is demonstrated to provide more intuitive results

than simpler alternatives on regions with at colors. Issues related to sampling in real

images are not addressed. Measuring contrast in a color image presents another prob-

lem. Contrast in colored gratings has been studied, and much work has been done in

general on color perception. However, there does not appear to be a simple general

contrast sensitivity model dened in color space [Regan, 2000]. Adapting a luminance

based model therefore remains a plausible course of action in designing a model for a

practical application.

Applying the notion of visibility for a grating to a non-repeating pattern of regions


40/128

28

also presents problems. The hump-like shape of the contrast sensitivity curve tells us

something counterintuitive if the size of an area is treated as proportional to an inverse

frequency [Reddy, 2001]. Very low frequencies are much less visible than some higher

ones at a given contrast. This is because detectability of a grating is related to thedensity of receptive elds of corresponding size. There are upper bounds on the size of

human receptive elds. Intuitively, a large slowly varying sine wave may be difcult

to see.

This has been less of a concern in previous work where judgments were being made

mostly about high frequency parts of the curve [Reddy, 2001], but will be noticeable

when visibly abstracting images.

It can be argued [Reddy, 1997] that natural images, at least in places (and certainly

the uniform color regions that we will ultimately use in rendering) more closely resem-

ble square wave, rather than sine, gratings. Since a square wave can be approximated

by the sum of an innite sequence of sine waves, and sensitivity to combined sinu-

soidal patterns is closely related to that of the independent components [Campbell and

Robson, 1968] one might think the visibility for low frequency square waves would

be higher than that for equal frequency sine waves. The actual relation has been stud-

ied empirically [Campbell and Robson, 1968] and conrms this intuition. For square

waves at frequencies below about 1 cycle/degree sensitivity levels off rather than drop-

ping. A theoretical derivation of the difference is presented in [Campbell and Robson,

1968]. It matches some but not all features of the empirical data.

These concerns remind us that when applying these models to real images they

cannot serve as an accurate absolute perceptual measure of visibility. Rather, they

provide a plausible relative sense of the visibility of different features. The absolute

contrast or acuity threshold at which a feature becomes visible is not necessary for our

application. What is important is the relative ordering of feature visibility, that allows

us to create a prioritization. It is necessary to model visual sensitivity only up to the

level where results correspond to our intuitions about this prioritization.


41/128

29

To apply these models in actual scenes, we need to decide on a denition of the

features whose visibility we are judging with these methods. For example, these mod-

els have been used in 3D level of detail [Reddy, 1997] to avoid rendering invisible

features. In this context the obvious choice of feature is a polygon which may or maynot be included in the rendering. For images the choice is less clear, as image prop-

erties can be measured in an unstructured, local way or an image can be partitioned

into a more structured representation. We review some of the possibilities for image

representation in the following chapter.


42/128

30

Chapter 4

Vision and Image Processing4.1 Image Structure Features and Representation

(a) (b)

Figure 4.1: (a) Scale space of one dimensional signal. Features disappear throughscale space but no new features appear. (b) Plot of inection points of another onedimensional signal through scale space. Reproduced from [Witkin 1983]

Image representation and processing is a large eld of relevance in both human and

computer vision. We concentrate on some basic concepts relevant to the task of simpli-

fying images. Scale space theory provides a way of characterizing the different scales

of information present in an image and making correspondences between features at

different scales. Segmentation divides an image into distinct regions, enabling an ex-

plicit, non-local representation of image content. Edge detection provides a measure

of the prominent boundaries in an image.

An important unifying concept in image analysis is that the same image data can

be represented in many forms. In any of these certain information in the image is

explicit and other information is less easy to access [Marr, 1982]. The information and

representation appropriate is task dependent. A variety of representations with different

properties are available. With the exception of 3D techniques, NPR applications have

largely used low-level representations, often functioning locally on the original image

itself. However, human artistic processes operate on richer representations. Ruskin,


43/128

31

one of the 19th centurys most prominent art historians and theorists, famously argued

that in teaching art technique, the most important lesson was teaching the student to

see [Ruskin, 1858]. There seems to be an assumption in image based NPR that seeing

is simply capturing a bitmap representation of the scene, and that it can be consideredaccomplished in the presence of a source photo. Human vision however is much more

than simply capturing an image. If a computer is to produce artistic renderings that

capture some of the expressiveness of real art, especially in highly abstracted styles,

some higher level representation is necessary, analogous to those created in the artists

head as she understands the scene before her, and begins to paint. The better suited

this representation is to the task, the easier it should be to drastically simplify an image

while retaining its important features.

The lowest level representation is the image itself, analogous to the retinal image.

This is the starting point of any further representation, making explicit the light inten-

sities at each pixel. There is structure here that can be more explicitly represented in

other ways. Information in the image exists over a variety of scales, small and large

features, making up parts and whole objects in the scene.

One common way to come to terms with the multiple scales of information in an

image is through its scale space . From a single image, a three dimensional stack of

images is generated in which each contains progressively coarser scale information.

Again, this representation has an analogue in human vision where neurons have recep-

tive elds of different sizes, in effect generating a multi scale representation from the

retinal image.

Scale space has come to refer to such a space of increasingly simple images gener-

ated by a range of processes. Generically this can be thought of as a stack of images

with decreasing information contained at each level as scale increases. This stack is

in theory continuous, in practice sampled at some discrete interval. Starting with the

original image, detail is progressively lost until a uniform color is all that remains (see

Figure 4.1).


44/128

32

A number of constructions for such a space have been developed. Perhaps the sim-

plest approach creates something like an image pyramid, successively downsampling

the image so it is more coarsely pixelated. This approach has a problem in that de-

tailed, high frequency information (the edges between the new larger pixels) may havebeen introduced which was not in the original image. This is the problem of spurious

resolution [Koenderink, 1984]. New information has been hallucinated into existence

by imposing a coarser grid structure on the data. Convolution with a Gaussian kernel

(blurring) generates a space that avoids this problem [Witkin, 1983,Koenderink, 1984].

In fact this blurring has been proven [Koenderink, 1984] to be the unique way to gen-

erate a scale space which is both uniform or uncommitted, (i.e., the process is uniform

across image space and through the scale dimension), and also avoids spurious reso-

lution. Information disappears but cannot be created. In one dimension, this ensures

that any feature will only disappear as scale increases. In two dimensions new features,

maxima for example can appear. However in both cases clear judgments can be made

about what features exist at what range of scales.

That the process of blurring is uniform is an advantage in that ltering can be

applied to any signal, one doesnt need to have a model of what the important featurespresent are. A disadvantage is that coarser features are more coarsely located, the

blurring process that reveals them distorts their spatial extent.

If you know what youre looking for, there is no reason why the blurring operation

must be uniform or uncommitted. A number of nonuniform or nonlinear scale spaces

have been formulated which do not introduce false content but remove information

selectively in certain locations. One of the best known of such methods is anisotropic

diffusion [Perona and Malik, 1990]. Here the diffusion process is not uniform but rather

inversely proportional to the magnitude of the gradient at any position. This results in

an edge preserving blurring which removes low contrast detail while preserving strong

edges. This has the advantage that edges are better preserved in their initial location

until the point at which they disappear. Niessen et al [1997] compares this and several


45/128

33

other nonlinear methods in the context of segmentation. Nonlinear methods perform

well but are signicantly more expensive.

A practical application must sample the continuous scale space at some discrete

intervals. One would like to sample sufciently nely to capture interesting events,the order of disappearance of different features, but not more densely than need be.

Looking at the linear scale space, Koenderink [1984] derives an appropriate sampling

as logarithmic along the scale axis corresponding to a uniform sampling in the scale

parameter t , the standard deviation of the Gaussian kernel used in blurring. This is

intuitive. At small scales many tiny regions are merging quite often, requiring dense

sampling. At higher scales, there are fewer regions, fewer events to capture, and much

less dense sampling in t is required. The issue is the same for nonlinear spaces. Re-

lating scales in different spaces is not straightforward. Some attempt at doing this has

been made in [Niessen, 1997].

Figure 4.2: Interval tree for 1D signal illustrating decomposition of the signal into ahierarchy. Reproduced from [Witkin 1983].

While a scale space such as this begins to capture structural relations of features

across scales, this is still largely an implicit representation. To make this explicit,

features at different scales need to be directly related to each other. Witkin [1983]


46/128

34

addresses this problem in 1D signals. In the scale space of a one dimension signal

features will never appear at coarse scales. So, any features found at a coarse scale

can (if the sampling is dense enough) be traced directly back to their ne scale origin.

This allows localization of features found at a coarse level. Witkin demonstrates thischoosing as a feature zero crossings in the second derivative, inection points in the

signal (Figure 4.1).

Similarly, using these correspondences across scale it is also possible to create

a structure that captures the relationship between all features at all scales. Intervals

between two zero crossings (which again correspond to sections of the signal between

two inection points) disappear in only one way. Two successive zero crossings merge

together, with the result that three intervals, the one between the crossings and those on

either side, merge into one. These three intervals can be made children of the resulting

interval to create an interval tree which characterizes the structure of the signal at all

scales. Witkin observes that those intervals which have longer persistence through

scale space appear to be those identied by human observers as subjectively salient or

important in the signal.

Extending this nice analytical derivation to a practical application in 2D is not

trivial. In 2D features such as maxima, or curves dened by inection points may split

into two at coarser scales. Koenderink [1984] suggests the use of equiluminance curves

in the image as a 2D equivalent to Witkins intervals. Generic equiluminance curves

form a single closed curve. There are two singularities: extrema where the curve is

just a point, and saddle points where the curve forms multiple loops which intersect at

one point. Each loop may contain other saddle points and has to contain at least one

extrema [Koenderink and van Doorn, 1979]. The nesting of these saddle points gives

the structure of the image regions. Though new saddle points may appear inside a loop,

centermost saddle points must disappear before outer ones. Because of this the saddle

points present at all scales can be represented as a tree. Such a structure is difcult to

calculate in practice. It is not obvious how to nd these saddle points efciently or if


47/128

35

they provide a subjectively intuitive partitioning of the image. In addition its not clear

how color could be handled. In a naive approach, each band would produce its own

surface with its own saddle points, resulting in 3 separate scale space trees that would

need to be unied in some way.

4.2 Segmentation

The process described above of dividing up a signal based on the intervals between

features is a particular approach to the general problem of segmentation. This problem

again occurs in both computer and human vision. Segmentation makes explicit the

association (or disassociation) between different areas of an image. It produces an ex-

plicit representation of parts of the image that are associated with each other, assigning

each pixel to one, usually connected group or region. These regions should be uniform

by some measure. Separate regions, at least the adjoining ones, should be markedly

different. How people do this, parsing shapes and objects from the background is only

partially understood. In computer vision, a tremendous variety of methods have been

devised to dene similarity measures for this using color, gray scale intensity, texture

etc. This segmentation is usually a partitioning of an image at a single scale. However

it is sometimes desirable to dene a segmentation over a range of scales.

Scale space has been considered in segmentation. It is typically used to make seg-

mentations produced with other methods more robust. Niessen et al [1997] link pixels

with their neighbors who have similar color in both the spatial and scale dimensions

to create a hierarchy. The end product is a single at segmentation taking its set of

regions from a coarse scale and their spatial extent from a ne scale. A similar ap-

proach is taken in Bangham et al [1998]. Here, the desire is to create a hierarchical

segmentation tree that describes the image over a variety of scales. An alternate ap-

proach [Ahuja, 1996] creates a multi-scale representation without explicitly generating

a scale space.


48/128

36

Each of these methods compute a hierarchical representation of image structure.

However, there is no clear relation between the hierarchy and the theoretical hierarchy

induced by scale space. This is not a major concern; scale space structure is attractive

because of its simple formal denition, but is not the single correct answer in anymeaningful sense for a given practical application. Hierarchical representations are

not general purpose, desirable properties depend on the application. For the purposes

of image abstraction, an important question is whether each subtree in the structure

represents some coherent area or region. This is guaranteed in some geometric sense

by scale space proper, since nodes occur in the tree only when features disappear. In

contrast, methods for building a hierarchy that i

Documents

Art and Visual Perception Thesis