3.introduction onwards deepa

Extraction of visual and motion saliency for automatic video object 1

1. INTRODUCTION

In imaging science, image processing is any form of signal processing for which the

input is an image, such as a photograph or video frame; the output of image processing may

be either an image or a set of characteristics or parameters related to the image. Most

image-processing techniques involve treating the image as a two-dimensional signal and

applying standard signal-processing techniques to it. Image processing usually refers

to digital image processing, but optical and analog image processing also are possible.

Image processing, in its broadest and most literal sense, aims to address the goal of

providing practical, reliable and affordable means to allow machines to cope with images

while assisting man in his general endeavors. In Electrical Engineering and

Computer Science it is any form of signal processing for which the input is an image, such

as a video frame; The output of this may be either image or, a set of parameters related to

image. In most image processing techniques, image is treated as two dimensional signal. In

short, act of examining images for the purpose of identifying objects and judging their

significance.

Image processing refers to processing of a 2D picture by a computer. Basic

definitions: An image defined in the “real world” is considered to be a function of two real

variables, for example, a(x,y) with a as the amplitude (e.g. brightness) of the image at the

real coordinate position (x,y). Modern digital technology has made it possible to manipulate

multi-dimensional signals with systems that range from simple digital circuits to advanced

parallel computers. The goal of this manipulation can be divided into three categories:

1) Image Processing (image in -> image out)

2) Image Analysis (image in -> measurements out)

3) Image Understanding (image in -> high-level description out)

An image may be considered to contain sub-images sometimes referred to as regions-

of-interest, ROIs, or simply regions. This concept reflects the fact that images frequently

contain collections of objects each of which can be the basis for a region. In a sophisticated

image processing system it should be possible to apply specific image processing operations

Dept. of Computer Science & Engineering AWH Engineering College


to selected regions. Thus one part of an image (region) might be processed to suppress

motion blur while another part might be processed to improve color rendition. Sequence of

image processing:

Most usually, image processing systems require that the images be available in

digitized form, that is, arrays of finite length binary words. For digitization, the given Image

is sampled on a discrete grid and each sample or pixel is quantized using a finite number of

bits. The digitized image is processed by a computer.

To display a digital image, it is first converted into analog signal, which is scanned

onto a display. Closely related to image processing are computer graphics and computer

vision. In computer graphics, images are manually made from physical models of objects,

environments, and lighting, instead of being acquired (via imaging devices such as cameras)

from natural scenes, as in most animated movies.

Computer vision, on the other hand, is often considered high-level image processing

out of which a machine/computer/software intends to decipher the physical contents of an

image or a sequence of images (e.g., videos or 3D full-body magnetic resonance scans). In

modern sciences and technologies, images also gain much broader scopes due to the ever

growing importance of scientific visualization (of often large-scale complex

scientific/experimental data). Examples include microarray data in genetic research, or real-

time multi-asset portfolio trading in finance. Before going to processing an image, it is

converted into a digital form.

Digitization includes sampling of image and quantization of sampled values. After

converting the image into bit information, processing is performed. This processing

technique may be Image enhancement, Image restoration, and Image compression.

1) Image enhancement: It refers to accentuation, or sharpening, of image features such

as boundaries, or contrast to make a graphic display more useful for display &

analysis. This process does not increase the inherent information content in data. It

includes gray level & contrast manipulation, noise reduction, edge crispening and

sharpening, filtering, interpolation and magnification, pseudo coloring, and so on.



2) Image restoration: It is concerned with filtering the observed image to minimize the

effect of degradations. Effectiveness of image restoration depends on the extent and

accuracy of the knowledge of degradation process as well as on filter design. Image

restoration differs from image enhancement in that the latter is concerned with more

extraction or accentuation of image features.

3) Image compression: It is concerned with minimizing the number of bits required to

represent an image. Application of compression are in broadcast TV, remote sensing

via satellite, military communication via aircraft, radar, teleconferencing, facsimile

transmission, for educational & business documents, medical images that arise in

computer tomography, magnetic resonance imaging and digital radiology, motion,

pictures, satellite images, weather maps, geological surveys and so on.

1) Text compression – CCITT GROUP3 & GROUP4

2) Still image compression – JPEG

3) Video image compression – MPEG

Digital Image Processing is a rapidly evolving field with growing applications in Science

and Engineering. Modern digital technology has made it possible to manipulate multi-

dimensional signals. Digital Image Processing has a broad spectrum of applications. They

include remote sensing data via satellite, medical image processing, radar, sonar and acoustic

image processing and robotics.

Uncompressed multimedia graphics, audio and video data require considerable

storage capacity and transmission bandwidth. Despite rapid progress in mass-storage

density, processor speeds, and digital communication system performance, demand for data

storage capacity and data-transmission bandwidth continues to outstrip the capabilities of

available technologies. This is a crippling disadvantage during transmission and storage. So

there arises a need for data compression of images.

There are several Image compression techniques. Two ways of classifying

compression techniques are mentioned here :1) Loss less Vs Lossy compression and 2)

Predictive Vs Transform coding. For correct diagnosis, the medical images should be

displayed with 100% quality. The popular JPEG image compression technique is lossy



technique so causes some loss in quality of image. Even though the loss is not a cause of

concern for non-medical images, it makes the analysis of medical images a difficult task. So

it is not suitable for the compression of medical images.

A digital remotely sensed image is typically composed of picture elements (pixels)

located at the intersection of each row i and column j in each K bands of imagery.

Associated with each pixel is a number known as Digital Number (DN) or Brightness Value

(BV), that depicts the average radiance of a relatively small area within a scene.

A smaller number indicates low average radiance from the area and the high number

is an indicator of high radiant properties of the area. The size of this area effects the

reproduction of details within the scene. As pixel size is reduced more scene detail is

presented in digital representation.

While displaying the different bands of a multispectral data set, images obtained in

different bands are displayed in image planes (other than their own) the color composite is

regarded as False Color Composite (FCC). High spectral resolution is important when

producing color components.

For a true color composite an image data used in red, green and blue spectral region

must be assigned bits of red, green and blue image processor frame buffer memory. A color

infrared composite ‘standard false color composite’ is displayed by placing the infrared, red,

green in the red, green and blue frame buffer memory.

In this healthy vegetation shows up in shades of red because vegetation absorbs most

of green and red energy but reflects approximately half of incident Infrared energy. Urban

areas reflect equal portions of NIR, R & G, and therefore they appear as steel grey.

Geometric distortions manifest themselves as errors in the position of a pixel relative

to other pixels in the scene and with respect to their absolute position within some defined

map projection. If left uncorrected, these geometric distortions render any data extracted

from the image useless. This is particularly so if the information is to be compared to other

data sets, be it from another image or a GIS data set.



Distortions occur for many reasons. For instance distortions occur due to changes in

platform attitude (roll, pitch and yaw), altitude, earth rotation, earth curvature, panoramic

distortion and detector delay. Most of these distortions can be modelled mathematically and

are removed before you buy an image. Changes in attitude however can be difficult to

account for mathematically and so a procedure called image rectification is performed.

Satellite systems are however geometrically quite stable and geometric rectification

is a simple procedure based on a mapping transformation relating real ground coordinates,

say in easting and northing, to image line and pixel coordinates.

Rectification is a process of geometrically correcting an image so that it can be

represented on a planar surface , conform to other images or conform to a map. That is, it is

the process by which geometry of an image is made planimetric. It is necessary when

accurate area, distance and direction measurements are required to be made from the

imagery. It is achieved by transforming the data from one grid system into another grid

system using ageometric transformation. Rectification is not necessary if there is no

distortion in the image. For example, if an image file is produced by scanning or digitizing a

paper map that is in the desired projection system, then that image is already planar and does

not require rectification unless there is some skew or rotation of the image.

Scanning and digitizing produce images that are planar, but do not contain any map

coordinate information. These images need only to be geo-referenced, which is a much

simpler process than rectification. In many cases, the image header can simply be updated

with new map coordinate information. This involves redefining the map coordinate of the

upper left corner of the image and the cell size (the area represented by each pixel).

Ground Control Points (GCP) are the specific pixels in the input image for which the

output map coordinates are known. By using more points than necessary to solve the

transformation equations a least squares solution may be found that minimises the sum of the

squares of the errors. Care should be exercised when selecting ground control points as their

number, quality and distribution affect the result of the rectification

. Once the mapping transformation has been determined a procedure called

resampling is employed. Resampling matches the coordinates of image pixels to their real



world coordinates and writes a new image on a pixel by pixel basis. Since the grid of pixels

in the source image rarely matches the grid for the reference image, the pixels are resampled

so that new data file values for the output file can be calculated.

Image enhancement techniques improve the quality of an image as perceived by a

human. These techniques are most useful because many satellite images when examined on

a colour display give inadequate information for image interpretation. There is no conscious

effort to improve the fidelity of the image with regard to some ideal form of the image.

There exists a wide variety of techniques for improving image quality.

The contrast stretch, density slicing, edge enhancement, and spatial filtering are the

more commonly used techniques. Image enhancement is attempted after the image is

corrected for geometric and radiometric distortions. Image enhancement methods are applied

separately to each band of a multispectral image. Digital techniques have been found to be

most satisfactory than the photographic technique for image enhancement, because of the

precision and wide variety of digital processes.

Contrast generally refers to the difference in luminance or grey level values in an

image and is an important characteristic. It can be defined as the ratio of the maximum

intensity to the minimum intensity over an image. Contrast ratio has a strong bearing on the

resolving power and detectability of an image. Larger this ratio, more easy it is to interpret

the image. Satellite images lack adequate contrast and require contrast improvement.

Contrast enhancement techniques expand the range of brightness values in an image

so that the image can be efficiently displayed in a manner desired by the analyst. The density

values in a scene are literally pulled farther apart, that is, expanded over a greater range. The

effect is to increase the visual contrast between two areas of different uniform densities. This

enables the analyst to discriminate easily between areas initially having a small difference in

density.

This is the simplest contrast stretch algorithm. The grey values in the original image

and the modified image follow a linear relation in this algorithm. A density number in the

low range of the original histogram is assigned to extremely black and a value at the high

end is assigned to extremely white. The remaining pixel values are distributed linearly



between these extremes. The features or details that were obscure on the original image will

be clear in the contrast stretched image. Linear contrast stretch operation can be represented

graphically. To provide optimal contrast and color variation in color composites the small

range of grey values in each band is stretched to the full brightness range of the output or

display unit.

In these methods, the input and output data values follow a non-linear

transformation. The general form of the non-linear contrast enhancement is defined by y = f

(x), where x is the input data value and y is the output data value. The non-linear contrast

enhancement techniques have been found to be useful for enhancing the colour contrast

between the nearly classes and subclasses of a main class.

In this report ,chapter 2 deals with the model of visual attention system which builds

on a second biologically plausible architecture at the basis of several models and an

automatic segmentation algorithm for video frames captured by a webcam that closely

approximates depth segmentation from a stereo camera.

The model of visual attention system is related to the so-called “feature integration

theory,” explaining human visual search strategies. Visual input is first decomposed into a

set of topographic feature maps. Different spatial locations then compete for saliency within

each map, such that only locations which locally stand out from their surround can persist.

All feature maps feed, in a purely bottom-up manner, into a master “saliency map,” which

topographically codes for local conspicuity over the entire visual scene.

In primates, such a map is believed to be located in the posterior parietal cortex as

well as in the various visual maps in the pulvinar nuclei of the thalamus. The model’s

saliency map is endowed with internal dynamics which generate attentional shifts.

An automatic segmentation algorithm exploits motion and its spatial context as a

powerful cue for layer separation, and the correct level of geometric rigidity is automatically

learned from training data.

The algorithm benefits from a novel, quantized motion representation (cluster

centroids of the spatiotemporal derivatives of video frames), referred to as motons. Motons



(related to textons), inspired by recent research in motion modeling and object/material

recognition are combined with shape filters to model long-range spatial correlations (shape).

These new features prove useful at capturing the visual context and filling in missing, texture

less, or motionless regions.

Fused motion-shape cues are discriminatively selected by supervised learning. Key to

the technique is a classifier trained on depth-defined layer labels, such as those used in a

stereo setting , as opposed to motion-defined layer labels. Thus, to induce depth in the

absence of stereo while maintaining generalization, the classifier is forced to combine other

available cues accordingly.

Combining multiple cues addresses one of the two aforementioned requirements,

robustness, for the bilayer segmentation. To meet the other requirement, efficiency, a

straightforward way is to trade accuracy for speed if only one type of classifier is used.

However, it is able to achieve efficiency without sacrificing much accuracy if multiple types

of classifiers, such as Ada Boost, decision trees, random forests, random ferns, and attention

cascade, are available. That is, if the “strong” classifiers are viewed as a composition of

“weak” learners (decision stumps), it may be able to control how the strong classifiers are

constructed to fit the limitations in evaluation time.

In this, it describes a general taxonomy of classifiers which interprets these common

algorithms as variants of a single tree-based classifier. This taxonomy allows to compare the

different algorithms fairly in terms of evaluation complexity (time) and select the most

efficient or accurate one for the application at hand.

By fusing motion shape, color, and contrast with local smoothness prior in a

conditional random field model, it achieve pixel wise binary segmentation through min-cut.

The result is a segmentation algorithm that is efficient and robust to distracting events and

that requires no initialization.

The chapter 3 proposes a robust video object extraction (VOE) framework, which

utilizes both visual and motion saliency information across video frames. The observed

saliency information allows to infer several visual and motion cues for learning foreground

and background models, and a conditional random field (CRF) is applied to automatically



determines the label (foreground or background) of each pixel based on the observed

models.

With the ability to preserve both spatial and temporal consistency, VOE framework

exhibits promising results on a variety of videos, and produces quantitatively and

qualitatively satisfactory performance. While here it focus on VOE problems for single

concept videos (i.e., videos which have only one object category of interest presented), this

method is able to deal with multiple object instances (of the same type) with pose, scale, etc.

variations. The chapter 4 deals with the comparison between the three schemes. The chapter

5 gives a conclusion.



2. LITERATUE SURVEY

Digital Image Processing is a rapidly evolving field with growing applications in

Science and Engineering. Modern digital technology has made it possible to manipulate

multi-dimensional signals. Digital Image Processing has a broad spectrum of applications.

They include remote sensing data via satellite, medical image processing, radar, sonar and

acoustic image processing and robotics.

Uncompressed multimedia graphics, audio and video data require considerable

storage capacity and transmission bandwidth. Despite rapid progress in mass-storage

density, processor speeds, and digital communication system performance, demand for data

storage capacity and data-transmission bandwidth continues to outstrip the capabilities of

available technologies. This is a crippling disadvantage during transmission and storage. So

there arises a need for data compression of images.

There are several Image compression techniques. Two ways of classifying

compression techniques are mentioned here.

1) Loss less Vs Lossy compression

2) Predictive Vs Transform coding

For correct diagnosis, the medical images should be displayed with 100% quality. The

popular JPEG image compression technique is lossy technique so causes some loss in quality

of image. Even though the loss is not a cause of concern for non-medical images, it makes

the analysis of medical images a difficult task. So it is not suitable for the compression of

medical images.

The first scheme “Model Of Saliency-Based Visual Attention[1]” represents a model

of saliency-based visual attention system, inspired by the behavior and the neuronal

architecture of the early primate visual system, is presented. Multiscale image features are

combined into a single topographical saliency map. A dynamical neural network then selects

attended locations in order of decreasing saliency. The system breaks down the complex

problem of scene understanding by rapidly selecting, in a computationally efficient manner,

conspicuous locations to be analyzed in detail.



The second scheme “Bilayer Segmentation Of Webcam Videos[2]” addresses the

problem of extracting a foreground layer from video chat captured by a (monocular) webcam

that closely approximates depth segmentation from a stereo camera. The foreground is

intuitively defined as a subject of video chat, not necessarily frontal, while the background is

literally anything else. Applications for this technique include background substitution,

compression, adaptive bit rate video transmission, and tracking. These applications have at

least two requirements:

1) Robust segmentation against strong distracting events, such as people moving in

the background, camera shake, or illumination change, and

2) Efficient separation for attaining live streaming speed.

Image processing can be defined a acquisition and processing of visual information by

computer. Computer representation of an image requires the equivalent of many thousands

of words of data, so the massive amount of data required for image is a primary reason for

the development of many sub areas with field of computer imaging, such as image

compression and segmentation.

Another important aspect of computer imaging involves the ultimate “receiver” of

visual information in some case the human visual system and in some cases the human

visual system and in others the computer itself. Computer imaging can be separate into two

primary categories: computer vision and image processing which are not totally separate and

distinct. The boundaries that separate the two are fuzzy, but this definition allows to explore

the differences between the two and to explore the difference between the two and to

understand how they fit together. One of the major topics within this field of computer

vision is image analysis. Image analysis involves the examination of the image data to

facilitate solving vision problem. The image analysis process involves two other topics:

1) Feature Extraction: is the process of acquiring higher level image

information, such as shape or color information.

2) Pattern Classification: is the act of taking this higher –level information and

identifying objects within the image.



Computer vision systems are used in many and various types of environments, such as

Manufacturing Systems, Medical Community, Law Enforcement, Infrared Imaging,

Satellites Orbiting.

Image processing is computer imaging where application involves a human being in

the visual loop. In other words the image are to be examined and a acted upon by people.

The major topics within the field of image processing include:

1) Image restoration.

2) Image enhancement.

3) Image compression.

2.1 MODEL OF SALIENCY-BASED VISUAL ATTENTION

Primates have a remarkable ability to interpret complex scenes in real time, despite

the limited speed of the neuronal hardware available for such tasks. Intermediate and higher

visual processes appear to select a subset of the available sensory information before further

processing, most likely to reduce the complexity of scene analysis.

This selection appears to be implemented in the form of a spatially circumscribed

region of the visual field, the so-called “focus of attention,” which scans the scene both in a

rapid, bottom-up, saliency-driven, and task-independent manner as well as in a slower, top-

down, volition-controlled, and task-dependent manner. Models of attention include

“dynamic routing” models, in which information from only a small region of the visual field

can progress through the cortical visual hierarchy.

The attended region is selected through dynamic modifications of cortical

connectivity or through the establishment of specific temporal patterns of activity, under

both top-down (task-dependent) and bottom-up (scene-dependent) control.

The model used here builds on a second biologically plausible architecture at the

basis of several models. It is related to the so-called “feature integration theory,” explaining

human visual search strategies. Visual input is first decomposed into a set of topographic



feature maps. Different spatial locations then compete for saliency within each map, such

that only locations which locally stand out from their surround can persist.

All feature maps feed, in a purely bottom-up manner, into a master “saliency map,”

which topographically codes for local conspicuity over the entire visual scene. In primates,

such a map is believed to be located in the posterior parietal cortex as well as in the various

visual maps in the pulvinar nuclei of the thalamus.

The model’s saliency map is endowed with internal dynamics which generate

attentional shifts. This model consequently represents a complete account of bottom-up

saliency and does not require any top-down guidance to shift attention.

This framework provides a massively parallel method for the fast selection of a small

number of interesting image locations to be analyzed by more complex and time consuming

object-recognition processes. Extending this approach in “guided-search,” feedback from

higher cortical areas (e.g., knowledge about targets to be found) was used to weight the

importance of different features, such that only those with high weights could reach higher

processing levels.

Input is provided in the form of static color images, usually digitized at 640 × 480

resolution. Nine spatial scales are created using dyadic Gaussian pyramids, which

progressively low-pass filter and subsample the input image, yielding horizontal and vertical

image-reduction factors ranging from 1:1 (scale zero) to 1:256 (scale eight) in eight octaves.

Each feature is computed by a set of linear “center-surround” operations akin to

visual receptive fields: Typical visual neurons are most sensitive in a small region of the

visual space (the center), while stimuli presented in a broader, weaker antagonistic region

concentric with the center (the surround) inhibit the neuronal response.

Such an architecture, sensitive to local spatial discontinuities, is particularly well-

suited to detecting locations which stand out from their surround and is a general

computational principle in the retina, lateral geniculate nucleus, and primary visual cortex.

Center-surround is implemented in the model as the difference between fine and coarse



scales: The center is a pixel at scale c ϵ {2, 3, 4}, and the surround is the corresponding pixel

at scale s = c + δ, with δ ϵ {3, 4}.

The across-scale difference between two maps, denoted “Ө” below, is obtained by

interpolation to the finer scale and point-by-point subtraction. Using several scales not only

for c but also for δ = s - c yields truly multiscale feature extraction, by including different

size ratios between the center and surround regions (contrary to previously used fixed

ratios).

2.1.1 Extraction of Early Visual Features

With r, g, and b being the red, green, and blue channels of the input image, an

intensity image I is obtained as I = (r + g + b)/3. I is used to create a Gaussian pyramid I(σ),

where σ ɛ [0..8] is the scale. The r, g, and b channels are normalized by I in order to

decouple hue from intensity. However, because hue variations are not perceivable at very

low luminance (and hence are not salient), normalization is only applied at the locations

where I is larger than 1/10 of its maximum over the entire image (other locations yield zero

r, g, and b). Four broadly-tuned color channels are created: R = r - (g + b)/2 for red, G = g -

(r + b)/2 for green, B = b - (r + g)/2 for blue, and Y = (r + g)/2 - |r - g|/2 - b for yellow

(negative values are set to zero).

Four Gaussian pyramids R(s), G(s), B(s), and Y(s) are created from these color

channels. Center-surround differences (Ө defined previously) between a “center” fine scale c

and a “surround” coarser scale s yield the feature maps. The first set of feature maps is

concerned with intensity contrast, which, in mammals, is detected by neurons sensitive either

to dark centers on bright surrounds or to bright centers on dark surrounds.

Here, both types of sensitivities are simultaneously computed (using a rectification)

in a set of six maps I(c, s), with c ϵ {2, 3, 4} and s = c + δ, δ ϵ {3, 4}:

I(c, s) = |I(c) Ө I(s)|. (1)

A second set of maps is similarly constructed for the color channels, which, in cortex,

are represented using a so-called “color double-opponent” system: In the center of their



receptive fields, neurons are excited by one color (e.g., red) and inhibited by another (e.g.,

green), while the converse is true in the surround.

Such spatial and chromatic opponency exists for the red/green, green/red,

blue/yellow, and yellow/blue color pairs in human primary visual cortex. Accordingly, maps

RG(c,s) are created in the model to simultaneously account for red/green and green/red

double opponency (2) and BY(c, s) for blue/yellow and yellow/blue double opponency (3):

RG(c, s) = |(R(c) - G(c)) Ө (G(s) - R(s))| (2)

BY(c, s) = |(B(c) - Y(c)) Ө (Y(s) - B(s))|. (3)

Local orientation information is obtained from I using oriented Gabor pyramids

O(σ,θ), where σ ϵ [0..8] represents the scale and θ ϵ {0o, 45o, 90o, 135o} is the preferred

orientation. (Gabor filters, which are the product of a cosine grating and a 2D Gaussian

envelope, approximate the receptive field sensitivity profile (impulse response) of

orientation-selective neurons in primary visual cortex.)

Orientation feature maps, O(c, s, θ), encode, as a group, local orientation contrast

between the center and surround scales:

O(c, s, θ) = |O(c, θ) Ө O(s, θ)|. (4)

In total, 42 feature maps are computed: six for intensity, 12 for color, and 24 for

orientation.

2.1.2 The Saliency Map

The purpose of the saliency map is to represent the conspicuity—or “saliency”—at

every location in the visual field by a scalar quantity and to guide the selection of attended

locations, based on the spatial distribution of saliency. A combination of the feature maps

provides bottom-up input to the saliency map, modeled as a dynamical neural network.

One difficulty in combining different feature maps is that they represent a priori not

comparable modalities, with different dynamic ranges and extraction mechanisms. Also,



because all 42 feature maps are combined, salient objects appearing strongly in only a few

maps may be masked by noise or by less-salient objects present in a larger number of maps.

In the absence of top-down supervision, it propose a map normalization operator, N(.),

which globally promotes maps in which a small number of strong peaks of activity

(conspicuous locations) is present, while globally suppressing maps which contain numerous

comparable peak responses. N(.) consists of

1) normalizing the values in the map to a fixed range [0..M], in order to eliminate

modality-dependent amplitude differences;

2) finding the location of the map’s global maximum M and computing the average m

of all its other local maxima; and

3) globally multiplying the map by (M – m)2 .

Only local maxima of activity are considered, such that N(.) compares responses associated

with meaningful “activitation spots” in the map and ignores homogeneous areas. Comparing

the maximum activity in the entire map to the average overall activation measures how

different the most active location is from the average.

When this difference is large, the most active location stands out, and the map is

strongly promoted. When the difference is small, the map contains nothing unique and is

suppressed. The biological motivation behind the design of N(.) is that it coarsely replicates

cortical lateral inhibition mechanisms, in which neighboring similar features inhibit each

other via specific, anatomically defined connections.

Feature maps are combined into three “conspicuity maps,” I for intensity (5), C for

color (6), and O for orientation (7), at the scale (σ = 4) of the saliency map. They are

obtained through across-scale addition, “ ” which consists of reduction of each map to

scale four and point-by-point addition:

I = N 4c=2 N c=4

s=c+s N(I(c,s)) (5)

C= N4c=2 N c=4

s=c+s [N(RG(c,s)) + N(BY(c,s))] (6)



For orientation, four intermediary maps are first created by combination of the six feature

maps for a given θ and are then combined into a single orientation conspicuity map:

O = Σθϵ {0o,45

o,90

o,135

o} N( N4

c=2 N c=4s=c+s N(O(c,s,θ))) (7)

The motivation for the creation of three separate channels, I, C, and O, and their individual

normalization is the hypothesis that similar features compete strongly for saliency, while

different modalities contribute independently to the saliency map. The three conspicuity

maps are normalized and summed into the final input S to the saliency map:

S= 1/3 (N(I) + N(C) + N(O)) (8)

At any given time, the maximum of the saliency map (SM) defines the most salient image

location, to which the focus of attention (FOA) should be directed. This could now simply

select the most active location as defining the point where the model should next attend.

However, in a neuronally plausible implementation, the model the SM as a 2D layer of leaky

integrate-and-fire neurons at scale four.

These model neurons consist of a single capacitance which integrates the charge

delivered by synaptic input, of a leakage conductance, and of a voltage threshold. When the

threshold is reached, a prototypical spike is generated, and the capacitive charge is shunted

to zero. The SM feeds into a biologically-plausible 2D “winner-take-all” (WTA) neural

network, at scale σ = 4, in which synaptic interactions among units ensure that only the most

active location remains, while all other locations are suppressed.

The neurons in the SM receive excitatory inputs from S and are all independent. The

potential of SM neurons at more salient locations hence increases faster (these neurons are

used as pure integrators and do not fire). Each SM neuron excites its corresponding WTA

neuron. All WTA neurons also evolve independently of each other, until one (the “winner”)

first reaches threshold and fires. This triggers three simultaneous mechanisms:

1) The FOA is shifted to the location of the winner neuron;

2) The global inhibition of the WTA is triggered and completely inhibits (resets) all

WTA neurons;



3) Local inhibition is transiently activated in the SM, in an area with the size and new

location of the FOA; this not only yields dynamical shifts of the FOA, by allowing

the next most salient location to subsequently become the winner, but it also prevents

the FOA from immediately returning to a previously-attended location.

Such an “inhibition of return” has been demonstrated in human visual psychophysics. In

order to slightly bias the model to subsequently jump to salient locations spatially close to

the currently-attended location, a small excitation is transiently activated in the SM, in a near

surround of the FOA (“proximity preference” rule of Koch and Ullman). Since it do not

model any top-down attentional component, the FOA is a simple disk whose radius is fixed

to one sixth of the smaller of the input image width or height.

The time constants, conductances, and firing thresholds of the simulated neurons

were chosen so that the FOA jumps from one salient location to the next in approximately

30–70 ms (simulated time), and that an attended area is inhibited for approximately 500–900

ms, as has been observed psychophysically. The difference in the relative magnitude of these

delays proved sufficient to ensure thorough scanning of the image and prevented cycling

through only a limited number of locations. All parameters are fixed in the implementation,

and the system proved stable over time for all images studied.

2.2 BILAYER SEGMENTATION OF WEBCAM VIDEOS

This addresses the problem of extracting a foreground layer from video chat captured

by a (monocular) webcam that closely approximates depth segmentation from a stereo

camera. The foreground is intuitively defined as a subject of video chat, not necessarily

frontal, while the background is literally anything else. Applications for this technique

include background substitution, compression, adaptive bit rate video transmission, and

tracking.

This focuses on the common scenario of the video chat application. Image layer

extraction has been an active research area. Recent work in this area has produced

compelling, real-time algorithms, based on either stereo or motion. Some previously used

algorithms require initialization in the form of a “clean” image of the background.



Stereo-based segmentation seems to achieve the most robust results, as background

objects are correctly separated from the foreground independently from their motion versus-

stasis characteristics. This aims at achieving a similar behavior monocularly. In some

monocular systems, the static background assumption causes inaccurate segmentation in the

presence of camera shake (e.g., for a webcam mounted on a laptop screen), illumination

change, and large objects moving in the background.

Although the algorithm in precludes the need for a clean background image, the

segmentation still suffers in the presence of large background motion. Moreover,

initialization is sometimes necessary in the form of global color models. However, in the

videochat application, such rigid models are not capable of accurately describing the

foreground motion. Furthermore, to avoid the complexities associated with optical flow

computation.

The algorithm exploits motion and its spatial context as a powerful cue for layer

separation, and the correct level of geometric rigidity is automatically learned from training

data. The algorithm benefits from a novel, quantized motion representation (cluster centroids

of the spatiotemporal derivatives of video frames), referred to as motons. Motons (related to

textons), inspired by recent research in motion modeling and object/material recognition are

combined with shape filters to model long-range spatial correlations (shape).

These new features prove useful at capturing the visual context and filling in missing,

texture less, or motionless regions. Fused motion-shape cues are discriminatively selected by

supervised learning. Key to the technique is a classifier trained on depth-defined layer labels,

such as those used in a stereo setting , as opposed to motion-defined layer labels.

Thus, to induce depth in the absence of stereo while maintaining generalization, the

classifier is forced to combine other available cues accordingly. Combining multiple cues

addresses one of the two aforementioned requirements, robustness, for the bilayer

segmentation. To meet the other requirement, efficiency, a straightforward way is to trade

accuracy for speed if only one type of classifier is used.

However, it is able to achieve efficiency without sacrificing much accuracy if

multiple types of classifiers, such as AdaBoost, decision trees, random forests, random ferns,



and attention cascade, are available. That is, if the “strong” classifiers are viewed as a

composition of “weak” learners (decision stumps), it may be able to control how the strong

classifiers are constructed to fit the limitations in evaluation time.

In this, it describes a general taxonomy of classifiers which interprets these common

algorithms as variants of a single tree-based classifier. This taxonomy allows to compare the

different algorithms fairly in terms of evaluation complexity (time) and select the most

efficient or accurate one for the application at hand.

By fusing motion shape, color, and contrast with local smoothness prior in a

conditional random field model, it achieve pixel wise binary segmentation through min-cut.

The result is a segmentation algorithm that is efficient and robust to distracting events and

that requires no initialization.

2.2.1 Notation

Given an input sequence of images, a frame is represented as an array

z = (z1, z2, . . . . zn, . . . . zN) of pixels in the YUV color space, indexed by the pixel position n.

A frame at time t is denoted zt. Temporal derivatives are denoted z = (z1, z2, . . . ,zn, . . . , zN)

and computed as ztn = | G(zt

n) – G(zt-1n) | at each time t with a Gaussian kernel G(.) at a scale

of σt pixels. Spatial gradients g = (g1, g2, . . . , gn, . . . ,gN), in which gn = | zn | , are

computed by convolving the imageswith the derivatives of Gaussian (DoG) kernels of width

σs. Here, it use σs = σt = 0:8, approximating a Nyquist sampling filter. Spatiotemporal

derivatives are computed on the Y channel only. Given Om = (g, z), the segmentation task is

to infer a binary label xn ϵ {Fg,Bg}.

2.2.2 Motons

Compute the two-dimensional Om for all training pixels and then cluster the two-

dimensional Om into M clusters using expectation maximization (EM). The M resulting

cluster centroids are called motons. An example with M = 10 motons is shown in Fig. 2.1

This operation may be interpreted as building a vocabulary of motion-based visual words.

The visual words capture information about the motion and the “edgeness” of image pixels,

rather than their texture content as in textons.



Fig 2.1 Motons. Training spatiotemporal derivatives clustered into 10 cluster centroids (motons).

Different colors for different clusters.

Clustering 1) enables efficient indexing of the joint (g, z) space while maintaining a

useful correlation between g and z and 2) reduces sensitivity to noise. It have empirically

tested 6, 10, and 15 clusters with multiple random starts. A dictionary size of just 10 motons

has proven sufficient, and the clusters are generally stable in multiple runs. The moton

representation also yields fewer segmentation errors than the use of Om directly.

The observation in is that strong edges with low temporal derivatives usually

correspond to background regions, while strong edges with high temporal derivatives are

likely to be in the foreground. Textureless regions tend to have their Fg/Bg log-likelihood

ratio (LLR) close to zero due to uncertainty. Such motion-versus-stasis discrimination

properties are retained by the quantized representation; however, they cannot sufficiently

separate the moving background from the moving foreground.

Given a dictionary of motons, each pixel in a new image can be assigned to its

closest moton by maximum likelihood (ML). Therefore, each pixel can now be replaced by

an index into the small visual dictionary. Then, a moton map can be decomposed into its M

component bands, namely “moton bands”.

Thus, this have M moton bands Ik, k = 1, . . ..,M, for each video frame z. Each Ik is a

binary image, with Ik(n) indicating whether the nth pixel has been assigned to the kth moton

or not.



2.2.3 Shape Filters

In video chat application, the foreground object (usually a person) moves nonrigidly

yet in a structured fashion. This briefly explains the shape filters and then shows how to

capture the spatial context of motion adaptively. To detect faces using the spatial context of

an image patch (detection window), builds a over complete set of Harr-like shape filters,

shown in Fig. 2.2.

Fig 2.2 Shape Filters (a) A shape filter composed of two rectangles (shown in a detection window centered at

a particular pixel), (b) and (c) Shape filters composed of more rectangles.

The sum of the pixel value in the white rectangles is subtracted by the sum of the

pixel value in the black rectangles. The result value of the subtraction is the feature response

used for classification. The shape filter is applied only to a gray-scale image, that is, the

“pixel value” in that paper corresponds to image intensity.

In speaker detection, similar shape filters are applied to a gray-scale image, a frame

difference, and a frame running average. In object categorization, texton layout filters

generalize the shape filters by randomly generating the coordinates of the white and black

rectangles and apply the filters to randomly selected texton channels (bands). Experiments in

multiple applications, the shape filters produce simple but effective features for

classification.



In order to efficiently compute the feature value of the shape filters, integral image

processing can be used. An integral image ii(x, y, k) is the sum of the pixel value in the

rectangle from the top left corner (0,0) to the pixel (x,y) in image band k (k = 1). Therefore,

computing the sum of the pixel value for an arbitrary rectangle with the top left at (x 1,y1) and

the bottom right at (x2,y2) in image band k requires only three operations ii(x2, y2,k) –

ii(x1,y2,k) – ii(x2,y1,k) + ii(x1,y1,k), that is, constant time complexity for each rectangle

regardless of its size in the shape filter and band index k.

To infer Fg/Bg for a pixel n, the contextual information within a (sliding) detection

window centered at n is used. The size of the detection window is about the size of the video

frame and fixed for all the pixels. Within a detection window a shape filter is defined as a

moton-rectangle pair (k,r), with k indexing in the dictionary of motons and r indexing a

rectangular mask. First, here denote all of the possible shape filters within a detection

window as S* and then define a whole set of d shape filters S = {(k i,ri)}, i = 1, . . . , d, by

randomly selecting moton-rectangle pairs from S*.

For each pixel position n, compute the associated feature ψ as follows: Given the

moton k, the center the detection window at n and count the number of pixels in Ik that fall in

the offset rectangle mask r. This count is denoted vn(k r). The feature value ψn(i,j) is obtained

by simply subtracting the moton counts collected for the two shape filters (k i,ri) and (kj,rj),

i.e., ψn(i, j) = vn(ki,ri) – vn(kj,rj). The moton counts vn can be computed efficiently by one

integral image for every moton band Ik. Therefore, given S, by randomly selecting i, j, and k

(1 ≤ i; j ≤ d, 1 ≤ k ≤ M), a total of d2. M2 features can be computed at every pixel n.

2.2.4 The Tree-Cube Taxonomy

Common classification algorithms, such as decision trees, boosting, and random

forests, share the fact that they build “strong” classifiers from a combination of “weak”

learners, often just decision stumps. The main difference among these algorithms is the way

the weak learners are combined, and exploring the difference may lead to an accurate

classifier that is also efficient enough to fit the limitations of the evaluation time. This

section presents a useful framework for constructing strong classifiers by combining weak

learners in different ways to facilitate the analysis of accuracy and efficiency.



The three most common ways to combine weak learners are 1) hierarchically (H),

2) by averaging (A), and 3) by boosting (B), or more generally adaptive reweighting and

combining (ARCing). The origin represents the weak learner (e.g., a decision stump), and

axes H, A, and B represent those three basic combination “moves.”

1. The H-move hierarchically combines weak learners into decision trees. During

training, a new weak learner is iteratively created and attached to a leaf node, if needed,

based on information gain. Evaluation of one instance includes only the weak learners along

the corresponding decision path in the tree. It can be shown that the H-move reduces

classification bias.

2. The B-move, instead, linearly combines weak learners. After the insertion of each

weak learner, the training data are reweighted or resampled such that the last weak learner

has 50 percent accuracy in the new data distribution. Evaluation of one instance includes the

weighted sum of the outputs of all of the weak learners in the booster. Examples of the B-

move include AdaBoost and gentle boost. Boosting reduces the empirical error bound by

perturbing the training data.

3. The A-move creates strong classifiers by averaging the results of many weak

learners. Note that all of the weak learners added by the A-move solve the same problem

while those sequentially added by the H and B-moves solve problems with a different data

distribution. Thus, the main computational advantage in training is that each weak learner

can be learned independently and in parallel. When the weak learner is a random tree, A-

move gives rise to random forests. The A-move also reduces classification variance.

Paths, not vertices, along the edges of the cube in Fig. 2.3 correspond to different

combinations of weak learners and thus (unlimited) different strong classifiers. If this restrict

each of the three basic moves to being used once at most, the tree taxonomy produces three

order-1 algorithms (excluding the base learner itself), six order-2, and six order-3 algorithms.

Many known algorithms, such as boosting (B), decision trees (H), booster of trees (HB), and

random forests (HA), are conveniently mapped into paths through the tree cube. Also note

that the widely used attention cascade can be interpreted as a one-sided tree of boosters

(BH). The tree-cube taxonomy also enables to explore new algorithms (e.g., HAB) and


Averaging of weak learners

HBBHHA

AH

BAAB


compare them to other algorithms of the same order (e.g., BHA). Next, it explore which

classifier performs best for the video segmentation application. Following the tree-cube

taxonomy, it focus on comparing three common order-2 models: HA, HB, and BA, that is,

the compare random forests (RFs) of trees, booster of trees (BT), and ensemble of boosters

(EB) to evaluate the behavior of different moves. As a sanity check, it also evaluate the

performance of a common order-1 model B, namely, the booster of stumps (using gentle

boost, denoted as GB). From this elaborate comparison, it gained a number of insights into

the design of tree-based classifiers. These insights, such as bias/variance reduction,

accuracy/efficiency trade-off, the complexity of the problem, and the labeling quality of the

data, have motivated the design of an order-3 classifier, HBA.

2.2.5 Random Forests Vs Booster Of Trees Vs Ensemble Of Boosters

The base weak learner used is the widely used decision stump. A decision stump

applied to the nth pixel takes the form h(n) = a . 1(ψn(i, j) > θ) + b, in which 1(.) is a 0-1

indicator function and ψn(i,j) is the shape filter response for the ith and jth shape filters.

Fig 2.3 The tree-cube taxonomy of classifiers captures many classification algorithms in a single

structure


Stacking weak learners hierarchically

Reweighting of training data

B: Booster of weak learners (common

AdaBoost)A: Forest of weak learners

Weak learner

H: Decision tree

H

B

A


Decision Tree

When training a tree, it compute θ, a, and b of a new decision stump at each iteration

for either the least-square error or the maximum entropy gain, as described later. During

testing, the output F(n) of a tree classifier is the output of the leaf node.

Gentle Boost

Out of the many variants of boosting, here the focus is on the Gentle Boost algorithm

because of its robustness properties. For the nth pixel, a strong classifier F(n) is a linear

combination of stumps F(n) = ΣLl=1 hi(n). It employ gentle boost as the B-move (in Fig. 2.3)

algorithm for GB, BT, and EB.

Random Forests

A forest is made of many trees, and its output F(n) is the average of the output of all the trees

(the A-move). A random forest is an ensemble of decision trees trained with random

features. In this case, each tree is trained by adding new stumps in the leaf nodes in order to

achieve maximum information gain. However, unlike boosting, the training data of RF are

not reweighted for different trees. In the more complicated order-3 classifier HBA, the

training data are only weighted within each BT, but the BTs are constructed independently.

RF has been applied to the recognition problems, such as OCR and keypoint recognition in

vision.

Randomization

The tree-based classifiers are effectively trained by optimizing each stump on only a

few (1,000 in the implementation) randomly selected shape filter features. This reduces the

statistical dependence between weak learners and provides increased efficiency without

significantly affecting their accuracy.

In all three algorithms, the classification confidence is computed by softmax

transformation as follows:

P(xn = Fg|Om) = exp(F(n))/(1 + exp(F(n))) (9)



2.2.6 Layer Segmentation

Segmentation is cast as an energy minimization problem in which the energy to be

minimized is similar, the only difference being that the stereo-match unary potential UM is

replaced by the motion-shape unary potential UMS:

UMS(Om,x;Ө) = Σn=1Nlog(P(xn|Om)) (10)

in which P is from (1). The CRF energy is as follows:

E(Om,z,x;Ө) = γMSUMS(Om,x;Ө) + γCUC(z,x;Ө) + V(z,x;Ө) (11)

UC is the color potential (a combination of global and pixelwise contributions), and V is the

widely used contrast-sensitive spatial smoothness. Model parameters are incorporated in Ө.

Relative weights γMS and γC are optimized discriminatively from the training data. The

final segmentation is inferred by a binary min-cut. No complex temporal models are used

here. Finally, because many segmentation errors are caused by strong background edges,

background edge abating, which adaptively “attenuates” background edges, could also be

exploited here if a pixelwise background model were learned on the fly.



3. EXPLORING VISUAL AND MOTION SALIENCY FOR

AUTOMATIC VIDEO OBJECT EXTRACTION

Humans can easily determine the subject of interest in a video, even though that

subject is presented in an unknown or cluttered background or even has never been seen

before. With the complex cognitive capabilities exhibited by human brains, this process can

be interpreted as simultaneous extraction of both foreground and background information

from a video.

Many researchers have been working toward closing the gap between human and

computer vision. However, without any prior knowledge on the subject of interest or training

data, it is still very challenging for computer vision algorithms to automatically extract the

foreground object of interest in a video. As a result, if one needs to design an algorithm to

automatically extract the foreground objects from a video, several tasks need to be

addressed.

1) Unknown object category and unknown number of the object instances in a video.

2) Complex or unexpected motion of foreground objects due to articulated parts or

arbitrary poses.

3) Ambiguous appearance between foreground and background regions due to similar

color, low contrast, insufficient lighting, etc. conditions.

In practice, it is infeasible to manipulate all possible foreground object or background

models beforehand. However, if one can extract representative information from either

foreground or background (or both) regions from a video, the extracted information can be

utilized to distinguish between foreground and background regions, and thus the task of

foreground object extraction can be addressed. Most of the prior works either consider a

fixed background or assume that the background exhibits dominant motion across video

frames. These assumptions might not be practical for real world applications, since they

cannot generalize well to videos captured by freely moving cameras with arbitrary

movements. Here, this propose a robust video object extraction (VOE) framework, which

utilizes both visual and motion saliency information across video frames. The observed

saliency information allows to infer several visual and motion cues for learning foreground



and background models, and a conditional random field (CRF) is applied to automatically

determines the label (foreground or background) of each pixel based on the observed

models. With the ability to preserve both spatial and temporal consistency, VOE framework

exhibits promising results on a variety of videos, and produces quantitatively and

qualitatively satisfactory performance. While here it focus on VOE problems for single

concept videos (i.e., videos which have only one object category of interest presented), this

method is able to deal with multiple object instances (of the same type) with pose, scale, etc.

variations.

3.1 AUTOMATIC OBJECT MODELING AND EXTRACTION

In most existing unsupervised VOE approaches assume the foreground objects as

outliers in terms of the observed motion information, so that the induced appearance, color,

etc. features are utilized for distinguishing between foreground and background regions.

However, these methods cannot generalize well to videos captured by freely moving

cameras as discussed earlier. This proposes a saliency-based VOE framework which learns

saliency information in both spatial (visual) and temporal (motion) domains. By advancing

conditional random fields (CRF), the integration of the resulting features can automatically

identify the foreground object without the need to treat either foreground or background as

outliers.

3.1.1 Determination of Visual Saliency

To extract visual saliency of each frame, perform image segmentation on each video

frame and extract color and contrast information. In this work, it advance Turbopixels for

segmentation, and the resulting image segments (superpixels) are applied to perform

saliency detection. The use of Turbopixels allows to produce edge preserving superpixels

with similar sizes, which would achieve improved visual saliency results as verified later.

For the kth superpixel rk, calculate its saliency score S(rk ) as follows:

S(rk ) = Σrk≠riexp(Ds(rk , ri )/σ2s )ω(ri )Dr(rk , ri )

≈ Σrk≠ri exp(Ds(rk , ri )/σ2s )Dr(rk , ri ) (12)



where Ds is the Euclidean distance between the centroid of rk and that of its surrounding

superpixels ri , while σs controls the width of the kernel. The parameter ω(ri ) is the weight of

the neighbor superpixel ri, which is proportional to the number of pixels in ri. ω(ri) can be

treated as a constant for all superpixels due to the use of Turbopixels (with similar sizes).

The last term Dr(rk,ri) measures the color difference between rk and ri , which is also in terms

of Euclidean distance. Consider the pixel i as a salient point if its saliency score satisfies S(i)

> 0.8 ∗ max(S), and the collection of the resulting salient pixels will be considered as a

salient point set. Since image pixels which are closer to this salient point set should be

visually more significant than those which are farther away, further refine the saliency S(i )

for each pixel i as follows:

S(i ) = S(i ) ∗ (1 − dist(i )/distmax) (13)

where S(i ) is the original saliency score derived by (12), and dist(i ) measures the nearest

Euclidian distance to the salient point set. The distmax in (13) is determined as the

maximum distance from a pixel of interest to its nearest salient point within an image, thus it

is an image-dependent constant.

3.1.2 Extraction of Motion-Induced Cues

There are three steps for extraction motion-induced cues :

1) Determination of Motion Saliency: Unlike prior works which assume that either

foreground or background exhibits dominant motion, this framework aims at extracting

motion salient regions based on the retrieved optical flow information. To detect each

moving part and its corresponding pixels, it perform dense optical-flow forward and

backward propagation at each frame of a video. A moving pixel qt at frame t is determined

by

qt = qt, t – 1 ∩ qt, t+1 (14)

where q denotes the pixel pair detected by forward or backward optical flow propagation. Do

not ignore the frames which result in a large number of moving pixels at this stage and thus

the setting is more practical for real-world videos captured by freely-moving cameras.



After determining the moving regions, it propose to derive the saliency scores for

each pixel in terms of the associated optical flow information. Inspired by visual saliency

approaches, apply algorithms in (11) and (12) on the derived optical flow results to calculate

the motion saliency M(i, t) for each pixel i at frame t, and the saliency score at each frame is

normalized to the range of [0, 1].

It is worth noting that, when the foreground object exhibits significant movements

(compared to background), its motion will be easily captured by optical flow and thus the

corresponding motion salient regions can be easily extracted. On the other hand, if the

camera is moving and thus results in remarkable background movements, the motion

saliency method will still be able to identify motion salient regions (associated with the

foreground object).

The motion saliency derived from the optical flow has a better representative

capability in describing the foreground regions than the direct use of the optical flow does.

The foreground object (the surfer) is significantly more salient than the moving background

in terms of motion.

2) Learning of Shape Cues: Although motion saliency allows to capture motion salient

regions within and across video frames, those regions might only correspond to moving parts

of the foreground object within some time interval. If simply assumes the foreground should

be near the high motion saliency region as the method did, it cannot easily identify the entire

foreground object.

Since it is typically observed that each moving part of a foreground object forms a

complete sampling of the entire foreground object, advance part-based shape information

induced by motion cues for characterizing the foreground object. To describe the motion

salient regions, convert the motion saliency image into a binary output and extract the shape

information from the motion salient regions. More precisely, first binarize the

aforementioned motion saliency M(i, t) into Mask(i, t) using a threshold of 0.25.

Divide each video frame into disjoint 8 × 8 pixel patches. For each image patch, if

more than 30% of its pixels are with high motion saliency (i.e., pixel value of 1 in the

binarized output), compute the histogram of oriented gradients (HOG) descriptors with 4 × 4



= 16 grids for representing its shape information. To capture scale invariant shape

information, it further downgrade the resolution of each frame and repeat the above process.

Choose the lowest resolution of the scaled image as a quarter of that of the original

one. Since the use of sparse representation has been shown to be very effective in many

computer vision tasks, learn an over-complete codebook and determine the associated sparse

representation of each HOG. Now, for a total of N HOG descriptors calculated for the above

motion-salient patches {hn, n = 1, 2, . . . , N} in a p-dimensional space, it construct an over-

complete dictionary Dp × K which includes K basis vectors, and it determine the corresponding

sparse coefficient αn of each HOG descriptor. Therefore, the sparse coding problem can be

formulated as

minD,α 1/N ΣNn=1 1/2||hn − Dαn||22 + λ||αn||1 (15)

where λ balances the sparsity of αn and the l2-norm reconstruction error. It use the software

developed to solve the above problem. Note that each codeword is illustrated by averaging

image patches with the top 15 αn coefficients.

To alleviate the possible presence of background in each codeword k, combine the

binarized masks of the top 15 patches using the corresponding weights αn to obtain the map

Mk. As a result, the moving pixels within each map (induced by motion saliency) has non-

zero pixel values, and the remaining parts of that patch are considered as static background

and thus are zeroes. After obtaining the dictionary and the masks to represent the shape of

foreground object, it use them to encode all image patches at each frame.

This is to recover non-moving regions of the foreground object which does not have

significant motion and thus cannot be detected by motion cues. For each image patch, it

derive its sparse coefficient vector α, and each entry of this vector indicates the contribution

of each shape codeword. Correspondingly, it use the associated masks and their weight

coefficients to calculate the final mask for each image patch. Finally, the reconstructed

image at frame t using the above maps Mk can be denoted as foreground shape likelihood XSt

, which is calculated as follows:

XSt=Σn∈ItΣK

k=1(αn,k · Mk ) (16)



where αn,k is the weight for the nth patch using the kth codeword. XSt serves as the

likelihood of foreground object at frame t in terms of shape information.

3) Learning of Color Cues: Besides the motion-induced shape information, extract both

foreground and background color information for improved VOE performance. According to

the observation and the assumption that each moving part of the foreground object forms a

complete sampling of itself, it cannot construct foreground or background color models

simply based on visual or motion saliency detection results at each individual frame;

otherwise, foreground object regions which are not salient in terms of visual or motion

appearance will be considered as background, and the resulting color models will not be of

sufficient discriminating capability. In this work, utilize the shape likelihood XSt obtained

from the previous step, and threshold this likelihood by 0.5 to determine the candidate

foreground (FSshape) and background (BSshape) regions.

In other words, consider color information of pixels in FSshape for calculating the

foreground color GMM, and those in BSshape for deriving the background color GMM. Once

these candidate foreground and background regions are determined, it use Gaussian mixture

models (GMM) GCf and GC

b to model the RGB distributions for each model. The parameters

of GMM such as mean vectors and covariance matrices are determined by performing an

expectation-maximization (EM) algorithm. Finally, integrate both foreground and

background color models with visual saliency and shape likelihood into a unified framework

for VOE.

3.2 CONDITION RANDOM FIELD FOR VOE

In most existing unsupervised VOE approaches assume the foreground objects as

outliers in terms of the observed motion information, so that the induced appearance, color,

etc. features are utilized for distinguishing between foreground and background regions.

However, these methods cannot generalize well to videos captured by freely moving

cameras as discussed earlier. This proposes a saliency-based VOE framework which learns

saliency information in both spatial (visual) and temporal (motion) domains. By advancing

conditional random fields (CRF), the integration of the resulting features can automatically

identify the foreground object without the need to treat either foreground or background as

outliers. The following are the methods using condition random field:



3.2.1 Feature Fusion via CRF

Utilizing an undirected graph, conditional random field(CRF) is a powerful technique

to estimate the structural information (e.g. class label) of a set of variables with the

associated observations. For video foreground object segmentation, CRF has been applied to

predict the label of each observed pixel in an image I. Pixel i in a video frame is associated

with observation zi , while the hidden node Fi indicates its corresponding label (i.e.

foreground or background).

In this framework, the label Fi is calculated by the observation zi , while the spatial

coherence between this output and neighboring observations zj and labels Fj are

simultaneously taken into consideration. Therefore, predicting the label of an observation

node is equivalent to maximizing the following posterior probability function

p(F|I,ψ) ∝ exp{−(Σi∈I(ψi ) + Σi∈I, j∈Neighbor(ψi, j ) )} (17)

where ψi is the unary term which infers the likelihood of Fi with observation zi . ψi,j is the

pairwise term describing the relationship between neighboring pixels zi and zj, and that

between their predicted output labels Fi and Fj. Note that the observation z can be

represented by a particular feature, or a combination of multiple types of features. To solve a

CRF optimization problem, one can convert the above problem into an energy minimization

task, and the object energy function E of (17) can be derived as

E = −log(p)

=Σi∈I(ψi ) + Σi∈I,j∈Neighbor(ψi, j )

= Eunary + Epairwise. (18)

In VOE framework, it defines the shape energy function ES in terms of shape likelihood XSt

(derived by (16)) as one of the unary terms

ES = −ws log( XSt ). (19)

In addition to shape information, it need incorporate visual saliency and color cues into the

introduced CRF framework. As discussed earlier, derive foreground and background color



models for VOE, and thus the unary term EC describing color information is defined as

follows:

EC = wc(ECF − ECB). (20)

Note that the foreground and background color GMM models GCf and GC

b are utilized to

derive the associated energy terms ECF and ECB, which are calculated as

ECF = −log(Σi∈I GCf (i ))

ECB = −log(Σi∈I GCb(i )) .

As for the visual saliency cue at frame t, it convert the visual saliency score St derived in

(17) into the following energy term EV :

EV = −wv log( St ). (21)

Note that in the above equations, parameters ws , wc, and wv are the weights for shape, color,

and visual saliency cues, respectively. These weights control the contributions of the

associated energy terms of the CRF model for performing VOE.

It is also worth noting that, it can be concluded that the disregard of background

color models would limit the performance of VOE, since the only use of foreground color

model might not be sufficient for distinguishing between foreground and background

regions. In VOE framework, now utilize multiple types of visual and motion salient features

for VOE, and experiments will confirm the effectiveness and robustness of the approach on a

variety of real-world videos.

3.2.2 Preserving Spatio-Temporal Consistency

In the same shot of a video, an object of interest can be considered as a compact

space-time volume, which exhibits smooth changes in location, scale, and motion across

frames. Therefore, how to preserve spatial and temporal consistency within the extracted

foreground object regions across video frames is a major obstacle for VOE. Since there is no

guarantee that combining multiple motion-induced features would address the above



problem, it need to enforce additional constraints in the CRF model in order to achieve this

goal.

1) Spatial Continuity for VOE: When applying a pixel-level prediction process for VOE

(like ours and some prior VOE methods do), the spatial structure of the extracted foreground

region is typically not considered during the VOE process. This is because that the

prediction made for one pixel is not related to those for its neighboring ones. To maintain the

spatial consistency for the extracted foreground object, add a pairwise term in the CRF

framework. The introduced pairwise term Ei, j is defined as

Ei, j = Σi∈I,j ∈ Neighbor |Fi − Fj |×(λ1 + λ2 (exp(−(|| zi − zj ||)/β))) . (22)

Note that β is set as the averaged pixel color difference of all pairs of neighboring pixels. In

(22), λ1 is a data-independent Ising prior to smoothen the predicted labels, and λ2 is to relax

the tendency of smoothness if color observations zi and zj form an edge (i.e. when || zi − zj||

is large). This pairwise term is able to produce coherent labeling results even under low

contrast or blurring effects.

2) Temporal Consistency for VOE: Although it exploit both visual and motion saliency

information for determining the foreground object, the motion-induced features such as

shape and foreground/background color GMM models might not be able to well describe the

changes of foreground objects across videos due to issues such as motion blur, compression

loss, or noise/artifacts presented in video frames.

To alleviate this concern, choose to propagate the foreground/background shape

likelihood and CRF prediction outputs across video frames for preserving temporal

continuity in the VOE results. To be more precise, when constructing the foreground and

background color GMM models, the corresponding pixel sets FS and BS will not only be

produced by the shape likelihood FSshape and BSshape at the current frame, those at the previous

frame (including the CRF prediction outputs Fforeground and Fbackground) will be considered to

update FS and BS as well. In other words, update foreground and background pixel sets FS

and BS at frame t + 1 by

FSt+1 = FSshape(t + 1) U FSshape(t) U Fforeground(t)



BSt+1 = BSshape(t + 1) U BSshape(t) U Fbackground(t) (23)

where Fforeground(t) indicates the pixels at frame t to be predicted as foreground, and FSshape(t) is

the set of pixels whose shape likelihood is above 0.5 as described in Section III.B3. Similar

remarks apply for Fbackground(t) and BSshape(t). Finally, by integrating (19), (20), (21), and (22),

plus the introduced terms for preserving spatial and temporal information, the objective

energy function (18) can be updated as

E = Eunary + Epairwise

= (ES + ECF − ECB + EV) + Ei, j

= ES + EC + EV + Ei, j . (24)

To minimize (24), one can apply graph-based energy minimization techniques such as max-

flow/min-cut algorithms. When the optimization process is complete, the labeling function

output F would indicate the class label (foreground or background) of each observed pixel at

each frame, and thus the VOE problem is solved accordingly.



4. COMPARISON

The model of saliency-based visual attention system was able to reproduce human

performance for a number of pop-out tasks. When a target differed from an array of

surrounding distracters by its unique orientation, color, intensity, or size, it was always the

first attended location, irrespective of the number of distractors.

Contrarily, when the target differed from the distractors only by a conjunction of

features (e.g., it was the only red horizontal bar in a mixed array of red vertical and green

horizontal bars), the search time necessary to find the target increased linearly with the

number of distractors. It is difficult to objectively evaluate the model, because no objective

reference is available for comparison, and observers may disagree on which locations are the

most salient. It has simple architecture and feed-forward feature-extraction mechanisms, the

model is capable of strong performance with complex natural scenes.

Without modifying the preattentive feature-extraction stages, the model cannot detect

conjunctions of features. While the system immediately detects a target which differs from

surrounding distracters by its unique size, intensity, color, or orientation, it will fail at

detecting targets salient for unimplemented feature types (e.g., T junctions or line

terminators, for which the existence of specific neural detectors remains controversial).

For simplicity, it also have not implemented any recurrent mechanism within the

feature maps and, hence, cannot reproduce phenomena like contour completion and closure,

which are important for certain types of human pop-out. In addition, at present, the model

does not include any magnocellular motion channel, which is known to play a strong role in

human saliency.

Bilayer segmentation of webcam videos reports the lowest classification error

achieved and the corresponding frame rate for each of the three algorithms at their “optimal”

parameter setting according to validation. RF produces the lowest errors, then evaluate the

improved speed of RF when it is allowed to produce the same error level as GB, BT, and

EB.



It reduce the size of the RF ensemble and observe its suboptimal classification error

when RF evaluates at the same speed as GB, EB, and BT. In all cases, RF outperforms the

other classifiers.The median error of segmentation errors with respect to ground truth is

around or below one percent. Segmentation of past frames affects only the current frame

from the learned color model.

Under harsh lighting conditions, unary potentials may not be very strong; thus, the

Ising smoothness term may force the segmentation to “cut through” the shoulder and hair

regions. Similar effects may occur in stationary frames. Noise in the temporal derivatives

also affects the results. This situation can be detected by monitoring the magnitude of the

motion. Bilayer Segmentation is capable of inferring bilayer segmentation monocularly even

in the presence of distracting background motion without the need for manual initialization.

Bilayer Segmentation has accomplished the following:

1) introduced novel visual features that capture motion and motion context

efficiently,

2) provided a general understanding of tree-based classifiers, and

3) determined an efficient and accurate classifier in the form of random forests.

This confirm accurate and robust layer segmentation in videochat sequences. It achieves the

lowest unary classification error in the videochat application.

Although the color features can be automatically determined from the input video,

these methods still need the user to train object detectors for extracting shape or motion

features. Recently, researchers used some preliminary strokes to manually select the

foreground and background regions to train local classifiers to detect the foreground objects.

While these works produce promising results, it might not be practical for users to manually

annotate a large amount of video data.

VOE framework approach was able to extract visual salient regions, even for videos

with low visual contrast. VOE method is able to automatically extract foreground objects of



interests without any prior knowledge or the need to collect training data in advance. is able

to handle videos captured by freely moving camera, or with complex background motion.

It will be very difficult for unsupervised VOE methods to properly detect the

foreground regions even multiple types of visual and motion induced features are

considered. Discrimination between such challenging foreground and background regions

might require one to observe both visual and motion cues over a longer period.

Or, if the video is with sufficient resolution, one can consider to utilize trajectory

information of the extracted local interest points for determining the candidate foreground

regions. In such cases, one can expect improved VOE results.

PSKA allows neighboring nodes in a BAN to agree to a symmetric (shared) cryptographic

key, in an authenticated manner, using physiological signals obtained from the subject. No

initialization or pre-deployment is required; simply deploying sensors in a BAN is enough to

make them communicate securely.



5. CONCLUSION

The first scheme is a conceptually simple computational model for saliency-driven

focal visual attention. The biological insight guiding its architecture proved efficient in

reproducing some of the performances of primate visual systems. The efficiency of this

approach for target detection critically depends on the feature types implemented. The

framework presented here can consequently be easily tailored to arbitrary tasks through the

implementation of dedicated feature maps.

The second scheme ‘bilayer segmentation’ is capable of inferring bilayer

segmentation monocularly even in the presence of distracting background motion without

the need for manual initialization. It has accomplished the following:

1) introduced novel visual features that capture motion and motion context

efficiently,

2) provided a general understanding of tree-based classifiers, and

3) determined an efficient and accurate classifier in the form of random forests.

Bilayer segmentation extract the top point of the foreground pixels and use it as a reference

point for framing window. This result can be used for the “smart framing” of videos. On the

one hand, this simple head tracker is efficient. It works for both frontal and side views, and it

is not affected by background motion. On the other hand, it is heavily tailored toward the

videochat application. It achieves the lowest unary classification error in the videochat

application.

The third scheme ‘VOE methods’ was shown to better model the foreground object due to

the fusion of multiple types of saliency-induced features. The major advantage of this

method is that it do not require the prior knowledge of the object of interest (i.e., the need to

collect training data), nor the interaction from the users during the segmentation progress.



REFERENCES

[1] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid

scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259,

Nov. 1998.

[2] P. Yin, A. Criminisi, J. M. Winn, and I. A. Essa, “Bilayer segmentation of webcam

videos using tree-based classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,

pp. 30–42, Jan. 2011.

[3] Wei-Te Li, Haw-Shiuan Chang, Kuo-Chin Lien, Hui-Tang Chang, and Yu-Chiang Frank

Wang, ” Exploring Visual and Motion Saliency for Automatic Video Object Extraction”

IEEE Transactions On Image Processing, vol. 22, no. 7, July 2013

[4] http://en.wikipedia.org/wiki/Image_processing

\



GLOSSARY

1. Saliency Map: The Saliency Map(SM) is a topographically arranged map that

represents visual saliency of a corresponding visual scene.

2. WTA: Winner-take-all is a computational principle applied in computational models

of neural networks by which neurons in a layer compete with each other for activation.

3. Texton: Textons refer to fundamental micro-structures in natural images and are

considered as the atoms of pre-attentive human visual perception

4. CRF: Conditional random fields (CRFs) are a class of statistical modeling

method often applied in pattern recognition and machine learning, where they are used

for structured prediction.


Documents

3.introduction onwards deepa