8

Click here to load reader

[IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

  • Upload
    derek-t

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

Fusion of Differential Morphological Profiles forMulti-scale Weighted Feature Pyramid Generation

in Remotely Sensed ImageryGrant J. Scott Derek T. Anderson

Center for Geospatial Intelligence Dept. of Elec. & Comp. EngineeringUniversity of Missouri Mississippi State UniversityColumbia, MO, USA Mississippi State, MS, USA

[email protected] [email protected]

Abstract—Object recognition from remotely sensed imagery isan unsolved problem that is, in part, difficult because objects aretypically observed in a variety of contexts. Of particular interestare general purpose methods for identifying important imagestructure robust to changes in context. In this respect, areas notrelated to the task at hand can be ignored, or de-emphasized,when features are extracted and subsequently used for classi-fication. In this article, we utilize an unsupervised differentialmorphological profile (DMP) technique for the identification ofimportant image structure of varying scales. The DMP levelsare fused using a Choquet fuzzy integral with a bias towardsidentifying a general class of scale related objects via densityselection. Fused DMP opening and closing results are compared,an entropic filter is used to further accentuate important regionsand an eigen-based image chip alignment method is performed.Next, we propose importance map weighted cell-structured localbinary pattern (LBP) and histogram of oriented gradient (HOG)multi-scale pyramid descriptors. The final descriptor is used insupport vector machine-based classification. Cross validation andleft out performance for three common aerial object classes atmultiple geographical sites is discussed. Advantages of this workinclude higher positive detection rates, lowered false alarm ratesand an improvement in robustness of object recognition.

I. INTRODUCTION

In the last decade there has been rapid growth in the number ofremote sensing platforms and the amount of data collected andarchived from these sensors. Remote sensing data is acquiredfrom both airborne and spaceborne sensors, collecting data ina variety of modalities. Data from these sensors aid in a varietyof human activities, such as, national defense, environmentalmonitoring, and disaster response. Research topics includeland-cover classification [1], [2], terrestrial change monitoring[3], [4], automated mapping through algorithmic road andbuilding segmentation [5], [6], just to name a few.

One of the most challenging topics of exploitation for re-mote sensing data is the task of object recognition. The varietyof both natural and anthropogenic objects, the range of spatialextents of objects, the variability of context (e.g. differentbiomes or anthropogenic backdrop), and the seasonal varia-tions represent an extremely difficult setting for object recogni-tion. The sheer size of remote sensing data collections, as wellas that of individual captures from modern sensors, rendersmany advanced image processing methodologies intractable.

There are numerous approaches to object recognition; andrecently, approaches that rely on features that are robust tochanges in illumination (i.e. spectral scaling) have becomepopular. Two such approaches are local binary patterns (LBP)[7] and histograms of oriented gradients (HOG) [8].

Traditional research regarding object recognition in imagesoften focuses on classifying or categorizing an image basedupon characteristic objects therein. In these cases, objects ofinterest are delineated in the image for training. These trainingsets may include over 20,000 regions of interest to learn 20classes of objects [9]. The goal of training object recognition inthis regard is to present as many possible variations of targetobject during training in both perspective and context (e.g.background). In the case of satellite imagery, the perspectiveviews are primarily top-down views of objects, subject tomoderate obliqueness. However, the resolution of the objectsin the imagery provides a limited amount of pixel information,the variability of the context encountered is staggering, andthe size of both the individual imagery scenes and imagerydatabase all together is massive. For these reasons, approachesthat require thousands of training samples per object classor tens of thousands of feature per object are intractableapproaches in the context of remote sensing.

For this work, our image database consists of high-resolution orthorectified, georeferenced commercial satelliteimagery from DigitalGlobe’s Quickbird sensor. This imageryis five banded – 0.6m panchromatic and 2.4m multispectral– with each band having an 11-bit effective range. Usingthe satellite’s time delay and integration and mean solarexoatmosphere irradiance metadata that accompanies imagery,we are able to correct the per-pixel measurements to top-of-atmosphere reflectance. In our object and feature extraction,we rely solely on the panchromatic imagery bands. For ourtraining data, we selected 134 objects from four scenes repre-senting four different capture times of a single region withinthe four scenes, Kabul International Airport. Our training dataincluded three classes of target objects, 36 commercial jets, 25military cargo planes, and 40 helicopters; as well as 33 otherplanes and 792 random image chips that are absent of the threetarget classes. Figure 1 represents some example targets from

Page 2: [IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

Commercial Jet

Helicopter

Military/Cargo

Fig. 1. Example target images showing within-class variability of objects aswell as the variability of the complicated imagery context.

our training and test data. The challenges of satellite imageryobject recognition are observable, such as the variability ofobject scale, shadow effects, variations in the actual objects(e.g. the number of rotor blades of the helicopters), and thediffering occurrence context. Further complicating the objectrecognition tasks is the prevalence of appendages to objectsas an artifact of the surrounding imagery contents.

II. MULTI-SCALE OBJECT EXTRACTION

Automatic extraction of objects from remote sensed imageryis a challenging task due to the high variability of the imagecollection environment, land-cover types and types of objects.The differential morphological profile (DMP) [10] is able toexploit contrast edges between objects and their surroundingcontext to perform multi-scale object extraction. Due to thisnature of the DMP, we use a median filter to pre-process theimagery prior to computing a DMP, as this filter removes noisewhile preserving contrast edges.

A. Differential Morphological Profile

Morphological operators are a standard tool of digital imageprocessing that have many uses [11] including edge-detection,object segmentation, object filtering, and texture segmentation,to name a few. Typically, morphological operations involve

the application of a structuring element (SE) over an imageI to dilate (δ) or erode (ε) the image. Furthermore, thesetwo fundamental morphological operations can be combinedto form more complex morphological operations.

Two secondary fundamental morphological operations in-clude opening, γ(I) = δ • ε(I), and closing, ϕ(I) = ε • δ(I).In grayscale imagery, opening with a SE of size n removesgrayscale peaks from the image that are smaller than thestructuring element; and conversely, closing fills grayscalevalleys within the image.

Geodesic grayscale morphology involves a geodesic struc-turing element to define the fundamental operations for di-lation and erosion. Given a set of pixels, such as a disk SE,the geodesic distance between any two pixels in image I , I(p)and I(q), is the minimal path length through connective pixelsof SE which joins I(p) and I(q). The elementary geodesicdilation and erosion of I are defined as

δSE1 (I) = δ1(I) ∧ SE (1)

andεSE1 (I) = ε1(I) ∨ SE, (2)

where δ1(I) and ε1(I) are the elementary dilation and erosion,respectively, of I at all pixels p. The ∧ and ∨ represent theinfimum and supremum, respectively, for the set under SE, withthe structuring element centered over p. For both the geodesicdilation and erosion, arbitrary geodesic distance operations, n,can be generated iteratively, namely

δSEn (I) = δSE

1 • δSE1 • δSE

1 • . . . • δSE1 (I), (3)

andεSE

n (I) = εSE1 • εSE

1 • εSE1 • . . . • εSE

1 (I). (4)

There exists another class of morphological operationswhich are a function of two images and a SE. Morphologicalreconstruction relies on the designation of a marker image anda mask image, coupled with an elementary SE.

Dilation by reconstruction is the iterative dilation of amarker image J under image I using an elementary SEuntil the dilation of J becomes idempotent. The dilation byreconstruction is defined as

δ∗(J, I) =∨

n>=1

δIn(J)|δI

n = δIn+1, (5)

where δIn(J) is the dilation of J by elementary SE, constrained

from above by I such that J ≤ I . As expected, the erosionby reconstruction is defined similarly as

ε∗(J, I) =∧

n>=1

εIn(J)|εI

n = εIn+1, (6)

where εIn(J) is the erosion of J by elementary SE, constrained

from below by I such that I ≤ J .Opening by reconstruction is defined as the standard

grayscale erosion of I by a structuring element SE followedby the dilation by reconstruction,

γ∗SE(I) = δ∗(εSE(I), I), (7)

Page 3: [IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

where εSE(I) becomes the marker and the image I is the maskof the reconstruction. Similarly, the closing by reconstructionis defined as

ϕ∗SE(I) = ε∗(γSE(I), I), (8)

with γSE(I) as the marker of the reconstruction. It can beobserved that, in the case of opening by reconstruction, J isproduced as the grayscale erosion of I from some SE. Then,by definition of grayscale erosion, J ≤ I . Similarly, I ≤ J ,when J is γSE(I) during closing by reconstruction. In bothopening and closing by reconstruction, the structuring elementSE used in the grayscale erosion and dilation to produce J ,respectively, is typically larger than an elementary structuringelement as used during the iterative reconstructions.

As described previously, the opening and closing mor-phological operations remove grayscale peaks and valleys,respectively, from the image signal. Intuitively, opening byreconstruction using an SE of size n filters out image objectslighter than their surrounding context and smaller than the SE.Opening and closing profiles can be defined for a set of levels` ∈ L as

Πγ(I) = {Πγ` : Πγ` = γ∗SE`(I),∀` ∈ L}. (9)

andΠϕ(I) = {Πϕ` : Πϕ` = ϕ∗SE`

(I),∀` ∈ L}. (10)

respectively. SE` is the `th member of a strictly increasingseries of structuring element sizes, i.e. ∀` ∈ L;SE` < SE`+1

This profile produces a series of filter responses, each remov-ing successively larger objects from the image. To generateof a set of light (dark) objects, we merely need to make noteof which level ` to ` + 1 transition removes the object fromthe opening (closing) profile. This is captured by a piecewisederivative of the the opening and closing profiles, respectively;

∆γ(I) = {∆γ` : ∆γ` = |Πγ` −Πγ`−1|,∀` ∈ L′}, (11)

and

∆ϕ(I) = {∆ϕ` : ∆ϕ` = |Πϕ` −Πϕ`−1|,∀` ∈ L′}, (12)

with L′ as the set of levels in L except the first level.In this work, we use the DMP to produce a stack of scaled

object extractions from our high resolution satellite imagery.The first step of processing is the application of a 3x3 pixel(1.5x1.5m) median filter to remove spurious noise from theimagery signal while maintaining edges. We then use a setof geodesic structuring elements, having radii 1, 3, 5, 7, 9,11m, to produce both the opening and closing DMP profileswith five levels. Figure 2 shows two example objects and theopening DMP for the first three transition levels.

B. Choquet Fuzzy Integral-Based DMP Stack Fusion

Much of the theory and several applications of the fuzzyintegral (Sugeno and Choquet) can be found in [12] [13] [14].To begin, we consider a finite set of S sources of information,X = {x1, ..., xS} and a function that maps X into somedomain, initially [0,1], that represents the partial support of a

(a) (b)

(c) (d)

(e) (f)

(g) (h)Fig. 2. Opening differential morphological profile. (a) and (b) are twowindows of original imagery, at twice the window scale used for recognition,with a 3x3 pixel median filter applied; (c) and (d) are the 1m to 3m transitionlevel of the opening DMP generated from (a) and (b), respectively; (e) and (f)are the 3m to 5m transition level of the opening DMP generated; (g) and (h)are the 5m to 7m transition level of the opening DMP generated. Note, DMPlevel images have been enhanced, [0,1] range compressed for viewability.

hypothesis from the standpoint of each source of information.Depending on the problem domain, X can be a set of experts,sensors, pattern recognition algorithms, etc. The hypothesisis usually thought of as an alternative in a decision process.Both Sugeno and Choquet integrals take partial support for thehypothesis from the standpoint of each source of informationand fuse it with the verifiable or subjective worth of eachsubset of X in a non-linear fashion. This worth is encoded intoa fuzzy measure [15]. Initially, the function h : X → [0, 1],and the measure g : 2X → [0, 1] take real number values in[0,1]. Certainly, the output range for both function and measurecan be defined more generally; but it is convenient to think of

Page 4: [IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

them in the unit interval for confidence fusion.Formally, for a finite set X , a fuzzy measure is a function

g : 2X → [0, 1], i.e. a real valued set function, such that1. g(φ) = 0 and g(X) = 1;2. If A,B ⊆ X , A ⊆ B, then g(A) ≤ g(B).

Note that if X is an infinite set, a third condition guaranteeingcontinuity is required. In this article, we use the Sugenoλ-fuzzy measure [15]. The Sugeno λ-fuzzy measure buildsthe measure, specifically the 2S − 1− S non-singleton setvalues, given the densities (the measures for each singleton).Alternatively, for a few number of sources, one can define thefuzzy measure by hand or learn it from training data [16]. TheSugeno λ-fuzzy measure has the following additional property

3. For all A,B ⊆ X with A ∩B = φ,

g(A ∪B) = g(A) + g(B) + λg(A)g(B) (13)

for some λ > −1.The value λ is found by solving

λ+ 1 =S∏

i=1

(1 + λgi). (14)

Given a finite set X , g and h, the Choquet fuzzy integral ofh with respect to g is∫

h ◦ g =S∑

i=1

(g(Hi)[h(x(i))− h(x(i+1))]), (15)

where h(x(S+1)) = 0, Hi = {x(1), ..., x(i)}, and X has beensorted so that h(x(1)) ≥ h(x(2)) ≥ ... ≥ h(x(S)).

In this article, the Choquet fuzzy integral is used at eachpixel to fuse the DMP evidence across levels. The densities,gi = g(xi), have been empirically specified. To fuse DMPevidence about aircraft, e.g. cargo plane, passenger plane,helicopter, etc.; DMP levels 1, 2 and 3 are the primarysources of information. This is due to scale and the respectiveSE applied at these levels. For example, airplane enginesand helicopter blades will generally appear at DMP level1, wings and/or fuselage generally appear in levels 2 or3, etc. The densities used are g1 = .1, g2 = .35, g3 = .45and g4 = g5 = g6 = (1/3). The remaining 2S − 2− S fuzzymeasure values are found by solving for λ, (14), using thedensities and g(A ∪B), (13).

Levels in the DMP stack can contain small values since theyare differential amounts; thus, normalization is often required.We normalize the fuzzy Choquet integral fused DMP stackimage, IC , by the maximum value in the window (W ),

IC(x, y) =IC(x, y)

maximum(a,b)∈W IC(a, b). (16)

As explained in section II-A; two DMP stacks are produced,one for opening and one for closing. We fuse the opening-based DMP stack and separately fuse the closing-based DMPstack; then one fusion result is selected from the two. Foreach window, we sum pixel values of the fusion results in theopening and closing for a 5x5 region centered in the imagechip, e.g. window. We pick the fused result, opening or closing,

that has the greatest value. In future work, we will address abetter method to aggregate evidence from both morphologicaloperators together, i.e. objects across scales in which somecomponents of the object are lighter while others are darkerthan their surrounding.

Next, the result is combined with the output of a local en-tropy image filter. We process the 3x3 median filter grayscaleimage, IM , with an entropy filter to produce an entropy image,IE . This step is performed in order to further accentuateimportant regions of the objects we wish to recognize. Theentropy filter helps highlight areas that more uniquely describeand subsequently differentiate the objects of interest. We usea 9x9 neighborhood around each pixel to compute the entropyimage. The local distribution of IM is approximated using ahistogram of B bins, where B=256 for this work. For eachpixel, the calculation is

−B∑

i=1

pilog2(pi). (17)

The IE is combined with the normalized fuzzy integral outputusing a t-norm, with a product operator for this work,

IR(x, y) = IE(x, y) ∧ IC(x, y). (18)

Next, IR is morphlogically dilated with a disk SE of size5x5, which results in the final image IF . Dilation is used tospread the evidence outward and make sure we have a bufferaround the objects of interest to adequately describe them.Figure 3 shows the various processing stages. IF becomes ourimportance map, used to highlight the most salient aspects ofthe imagery objects.

III. IMAGE DESCRIPTORS

In this section, we discuss two state of the art imagedescriptors and their extension with respect to our importancemap. The local binary pattern (LBP) is a type of texturefeature. It was initially proposed by Ojala et al. [7] and hasbeen used in the computer vision realm for face recognitionand specialized instances of object detection [8] [17] [18]. Anadvantage of the LBP feature is that it is robust to monotonicgray level changes. It is also computationally efficient, espe-cially when converted into an integral image representation,e.g. integral LBP and integral LBP histogram images. Thenotation LBPu

n,r denotes the LBP feature that takes n samplepoints with radius r, and the number of 0-1 transitions is nomore than u. The pattern that satisfies this constraint is calleduniform patterns [7]. The value LBPn is calculated as

LBPn =n∑

k=0

(s(ik − ic)2k), (19)

where LBPn is the LBP code, ic is the center value, ik is thevalue of the kth neighbor, and function s(x) is

s(x) =

{1 if x ≥ 0;0 else. (20)

Page 5: [IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

(a) (b)

(c) (d)

(e) (f)Fig. 3. Important region calculation example (original imagery from Figure2). (a) and (b) are the DMP with opening, fuzzy integral, entropic filtergenerated from the objects centered in (a) and (b), respectively; (c) and (d)are the DMP with closing, fuzzy integral, entropic filter generated from theobjects centered in (a) and (b), respectively; (e) and (f) are the resulting dilatedfused results. In both instances, opening-based DMP stacks were selected.

The calculation of LBPn,r simply involves sampling eachneighbor, Ik, on a circle of radius r. Ojala observed thatthere are a limited number of transitions or discontinuitiesin the circular presentation of the 3x3 texture patterns. Ojalawent on to remark that these uniform patterns are fundamentalproperties of local image texture, meaning that they provide avast majority of all patterns; 90% for LBP 2

8,r type LBP and70% for LBP 2

16,r. The calculation of LBPun,r, i.e. uniform

LBP, is a constraint on the number of allowed 0-1 transitions(i.e. no more than u). For example, the pattern 0010010 is anon-uniform pattern for LBP 2

8,r, while it is a uniform patternfor LBP 4

8,r. Once the LBPun,r image is computed, a histogram

of LBPs is calculated as

hLBP (m) =∑i∈W

(F{LBPun,r(i) = m}), (21)

for m = 0, ...,M − 1, where W is the neighborhood,hLBP (m) is the mth histogram bin, M is the number ofpatterns for u (59 for u = 2), and the binary function F is

F{A} =

{1 if A is true;0 else. (22)

LBP histograms are generally scaled to make them comparableacross different size windows. We normalize according to

hLBP (m) =hLBP (m)

M∑k=1

(hLBP (k))

. (23)

Many distance formulas exist for measuring dissimilarity be-tween histograms, e.g. histogram intersection, log-likelihoodstatistic, chi square statistic, Bhattacharyya distance, etc.

While the LBP histogram is useful in some settings for de-scribing a region of interest according to a collection of binarycodes, we also utilize the Edelman [19] histogram of gradients(HOG) descriptor. The HOG is a low-level image descriptorwell-known for its inclusion in the scale invariant featuretransform (SIFT) [20]. As the name implies, it describes thedistribution of gradients. This is generally performed on asingle channel grayscale image, however it has been extentedto the color domain [21]. For a image region of interest, e.g.a square sub-region at a given scale, the HOG is constructed,typically as a histogram of length 8. The gradient directiondetermines the histogram bins and the magnitude is the amountadded to the bins. As in [20] [21], we blend the magnitudeamount, ‖( ∂I

∂x ,∂I∂y )‖, into the nearest and second closest bins

based on the gradient’s direction and α (see [20]), i.e.

h(br) = h(br) + (1− α)gm, (24)

andh(bn) = h(bn) + αgm, (25)

where br is the bin that the gradient resides in,br = bg0/(360/B)c, g0 ∈ [0, 360) is the gradient direction andbn is the next closest bin [20] [21].

A. Image Chip Alignment

In the next section, we outline a method for building cell-structured image descriptors. Cell-structured image descrip-tors take advantage of the spatial relation of image featuresin different regions, e.g. wings are symmetrical about thefuselage. However, we first need to perform image chipalignment, e.g. rotate to an expected orientation. Each imagechip is individually aligned to a fixed orientation accordingto the direction of maximal variance based on its respectiveimportance map. The weighted covariance matrix is

C =

∑(x,y)∈W

IF (x, y)(~v~vT )

∑(x,y)∈W

IF (x, y)− 1, (26)

where ~v is ([x y]− ~µ), W is the image chip (i.e. the NxMscanning window). From C, we calculate the eigen values,{λ1, λ2} and eigenvectors, {~e1, ~e2}. The eigenvector withthe greatest eigenvalue, i.e. direction of maximal variance,is selected, ~emax. It is ambigious which direction the imageshould be rotated, i.e. either ~emax or (−1)~emax. For trainingpurposes, we duplicate the image instance and rotate according

Page 6: [IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

to both directions. We use an image window that is twice thesize of the final sliding window to gaurentee that the rotatedimage is fully contained. For testing purposes, we pick arotation and assume training has accounted for this ambiguity.

B. Cell-Structured Multi-Scale Pyramid Descriptors

A common computer vision problem arises at this pointregarding the image descriptors. The problem is how to decidefrom which scale to extract the features. At one extreme, thesedescriptors can be used to model the surface level texture of anobject of interest. At the other extreme, these descriptors canbe used to model larger scale information, e.g. shape, aboutthe objects under investigation.

We address this classical problem in the following two ways.First, we produce multiple spatial resolutions of the region ofinterest (e.g. 100x100 at nominal 0.5x0.5m resolution, as wellas 50x50 and 25x25 for down-sampled resolutions of 1m and2m, respectively). The three spatial resolutions of the imageform a pyramid, i.e. full resolution, half resolution and quarterresolution per dimension. The HOG is fixed for an image chip,other than selection of the number of bins to use, but theLBP requires specification of r, n and u. Our approach is toproduce the three reduced images using bilinear interpolation.The HOG parameters are fixed for the three images and so arethe LBP parameters. The result is the extraction of differentinformation at different spatial resolutions, creating a multi-scale set of features.

The second technique provides proper spatial context forfeatures in the window. That is, a plane has two wings andtheir corresponding features should spatially exist only in tworegions opposite of each other in the cross-plane direction.Cell-structured image descriptors are a popular tool in thecomputer vision realm [8] [17] [18]. We use cell-structuredversions of both the LBP and the HOG. For the image of size100x100, we extract a 3x3 cell-structured descriptor (wherethe cells overlap by 30%). For the image of size 50x50,we extract a 2x2 cell-structured descriptor (which overlapby 20%). For the image of size 25x25 we use a single celldescriptor (i.e. 1x1 cell-structured). For each cell, the LBP andHOG are calculated as already described. The final descriptoris the concatenation of all of the cells into one long featuredescriptor. Each cell is individually normalized to an areaunder the curve of 1. Figure 4 illustrates this multi-scalepyramid descriptor concept.

C. Importance-Map Weighted Image Descriptors

The primary motivation and contribution of this work isthe following. When performing sliding window-based objectrecognition, or any similar setup that is capable of unsu-pervised segmentation and subsequently centering a windowon the resulting object, there exists two different types ofinformation in the window. The first type of information is theobject of interest (e.g. cargo plane). The second is everythingelse (e.g. trees, road, people, etc.). In order to build descriptorsthat are robust and can be transfered between contexts, i.e.different geospatial regions, the framework should be designed

Fig. 4. Cell-structured multi-scale pyramid representation. Successive scalesare bilinearly down-sampled to coarser spatial resolutions and cell partitioningis reduced.

to focus on the most relevant information about the objectcontained in a window. That is, we do not want the descriptorto explain other objects and textures. We also assume that thisis a real-world problem and that we can not simply try toaccount for this with a brute force method of collecting datafrom every possible scenario/context that might arise. Our ideais to build image descriptors based on the DMP stack fusionresults described above. The fusion results can be thought ofas a soft segmentation (confidence) of objects in the regionof interest. Specifically, those of scale related importance asspecified by the density selection in the fuzzy measure. Weextend the LBP and HOG cell-structured image descriptors totake into account the importance-map weights,

hLBP (m) =∑i∈W

(βiF{LBPun,r(i) = m}), (27)

andh(br) = h(br) + (1− α)(βgm), (28)

h(bn) = h(bn) + α(βgm), (29)

where βi is the [0, 1] importance map weight for pixel ifrom image IF . That is, each pixel is weighted accordingto its importance map weight. The resulting histograms arenormalized as discussed before. The actual feature descriptorused in this work is the multi-scale pyramid descriptor usingthe importance weighted calculations.

For the HOG, we use 16 bins. This results in descriptors oflength 144 for scale 100x100 and 3x3 cell-structured, 64 for50x50 and 2x2 cell-structured and 16 for the 25x25 reducedimage. For the LBP, we empirically found r = 3, n = 8 and u= 2, which results in 531 for 100x100 and 3x3, 236 for 50x50and 2x2 and 59 for 25x25 reduced image. It total, our LBPand HOG multi-scale pyramid descriptor is of length 1,050.

IV. EXPERIMENTAL RESULTS

Herein, classifiers are built per-object based on differentformulations of the features. We do this to determine whichparts of the proposed system provide the greatest benefit. We

Page 7: [IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

TABLE IDATA SET OBJECT COMPOSITION

Com. Jet Heli. Cargo OtherData set 1

July 2009 6 7 3 6Nov. 2009 13 18 11 8Jan. 2010 10 8 5 11Mar. 2010 7 7 6 8

all 0 0 0 792Data set 2

2275 9 2 5 62489 7 0 0 42493 4 0 0 3

use the support vector machine [22] with the dot productkernel function. Two data sets are used in this work. Dataset 1 is four different scenes, collected at different times atKabul International Airport and data set 2 is composed ofthree scenes from different locations. Table I details the objectcomposition of the two data sets.

Three similar objects are identified for analysis in thiswork; commercial jets, helicopters, cargo planes, miscella-neous planes and tiles not centered on a target object. Theseobjects are selected due to their importance in remote sensingas well as their common features which can make classificationdifficult. Data set 1 is primarily dominated by chips not cen-tered on a target object (792 instances). Our first experiment isN-fold cross validation. We use two folds because of the lim-ited number of training and testing examples per object class.However, later in this section we show results for test sites heldback from training. These three testing sites are different interms of their background and inner-class object appearance.The average (for the two folds) confusion matrix is calculatedfor; (R1) non-pyramid flat descriptor without importance mapweighting, (R2) non-pyramid flat descriptor with importancemap weighting, (R3) pyramid descriptor without importancemap weighting, and (R4) pyramid descriptor with importancemap weighting.

For each object class, scoring is conducted with respect totarget/object hits (TAs) and false alarms (FAs). Performanceindicators reported are based on TA+ (number of targethit correctly classified as TA), TA− (number of target hitincorrectly classified as FA), FA+ (number of false alarmhit correctly classified as FA) and FA− (number of falsealarm hit incorrectly classified as TA). These indicators arethe elements of the confusion matrix. From these indicatorswe calculate the target detection rate (TAR) and false alarmrate (FAR) as TAR = TA+

TA++TA− and FAR = FA−

TA++FA− .Table II summarizes the results of re-substitution testing.

Results R1 are the baseline for comparison; where no pyramidis used and every pixel is weighted equally in terms of featureimportance. These results are poor and unremarkable becausethe TAR is low and FAR is significant. Results R2 show theimprovement observed by using pixel importance weighting;where per-pixel feature importance weighting helps raise theTAR and lower FAR. Results R3 demonstrate the performance

TABLE II2-FOLD CROSS VALIDATION RESULTS

Com. Jet Helicopter CargoR1: Non-pyramid, No importance weightTAR 0.44 0.34 0.08FAR 0.27 0.49 0.96R2: Non-pyramid, With importance weightTAR 0.53 0.38 0.21FAR 0.21 0.44 0.66R3: Pyramid, No importance weightTAR 0.82 0.83 0.78FAR 0.11 0.28 0.38R4: Pyramid, With importance weightTAR 0.94 0.99 0.80FAR 0.16 0.15 0.3

TABLE IIIRESULTS FOR DATA SET 2.

Com. Jet Helicopter CargoPyramid, With importance weight

TA+ 15 1 1TA− 1 1 1FA+ 19 37 34FA− 5 1 4TAR 0.94 0.5 0.5FAR 0.25 0.5 0.8

Pyramid, No importance weightTA+ 15 0 0TA− 6 0 2FA+ 14 38 33FA− 5 2 5TAR 0.71 0 0FAR 0.25 1 1

benefit of the pyramid descriptor, but without importanceweighting. In results R2, the pyramid drastically improvesthe TAR and lowers the FAR substantially. Results R4 showthe benefit of using both the pyramid and importance mapweighting. In results R4, the TAR is better for all classes, andthe FAR is lower for each class except commercial jet. Thecombination of results in Table II shows that the pyramid cell-structured importance weighted method performs best. Theyalso highlight that the pyramid descriptor has the most benefit,followed by importance map weighted features.

The second experiment conducts training on data set 1 andtesting on data set 2 using just the pyramid features, with andwithout the importance map weighting. Table III summarizesthe test results of sites held back from training. This dataset includes targets from across the globe; Tehran, Iran andthe Denver, Colorado, USA airport. These experiments showthat the commercial jet classifier generalizes the best. It wasobserved that they are most visually similar across the two datasets. The TAR is essentially the same as in the cross validationresults, but data set 2 has a higher FAR. However, helicoptersand cargo planes do not hold up as well as commercial jets.Data set 2 is difficult, in part, due to the fact that there are onlyfive cargo and two helicopter examples. Instead of collectingand training with respect to a large set of simple and similar

Page 8: [IEEE 2011 IEEE Applied Imagery Pattern Recognition Workshop: Imaging for Decision Making (AIPR 2011) - Washington, DC, USA (2011.10.11-2011.10.13)] 2011 IEEE Applied Imagery Pattern

examples, the goal of data set 2 was to observe where thealgorithm has the most difficulties. For example, differenthelicopters are present in data set 2 that are not accountedfor in training. Data set 2 helicopters are spherical and haveone blade; while data set 1 helicopters are oblong and havemultiple blades.

Cargo planes proved the most difficult. Analysis into theproduction of the importance map indicates fairly accuratefindings and visual inspection of this object class reveals thatthe objects are similar in appearance between data sets 1 and 2.It is our conjecture that the difficulty here does not reside in theimportance map or the pyramid scheme, but rather the features.Commercial jets generally have a long fuselage and backwardsloped wings. In most instances, the eigen-based alignmentmethod rotates the object to the correct direction. However,for the cargo planes, the wings are generally large and canbe as long if not longer than the fuselage, leading to pooralignment. In light of the power of the descriptor frameworkand importance map weighting, the HOG and LBPs do notappear to be as effective for this class in terms of generalizingas they are for the other classes.

V. CONCLUSION AND FUTURE WORK

In summary, we present a multi-scale, pyramid-based, pixel-weighted feature importance computer vision descriptor anddemonstrate its performance for the recognition of differentaircraft from remotely sensed imagery. The descriptor isbased on histogram of gradients (HOGs) and local binarypatterns (LBPs). Opening and closing differential morpho-logical profiles (DMP) were constructed, Choquet integralfusion of the DMP stack was performed, entropy filtering,image chip alignment, and DMP stack selection operatorswere put forth. Superiority was demonstrated in comparison toflat single-scale and non-importance weighted representations.Cross validation shows encouraging results and blind testingwith a set of data held back from training indicates the areasof this approach which need refinements. The progressionof the results, as reported in Table II, demonstrate that thepyramid cell-structured importance weighting performs wellversus traditional approaches in the difficult problem space ofrecognizing objects in remote sensing imagery.

In future work, we will focus on improving the fusionof objects across the opening and closing DMP stacks. Theentire fuzzy measure will also be learned. We will alsoexplore the inclusion of different image features to improvethe performance, focusing on features which are more robustto rotation. Finally, we will also explore methods to learn thecontribution of each feature at each scale in order to reducethe size of the feature descriptor.

ACKNOWLEDGEMENTS

The authors would like to thank DigitalGlobe for providingQuickBird imagery from the RADII development data set foruse in this paper.

REFERENCES

[1] C. Homer, C. Huang, L. Yang, B. Wylie, and M. Coan, “National land-cover database for the united states.”

[2] R. Lucas, A. Rowlands, A. Brown, S. Keyworth, and P. Bunting, “Rule-based classification of multi-temporal satellite imagery for habitat andagricultural land cover mapping,” ISPRS Journal of Photogrammetryand Remote Sensing, vol. 62, no. 3, pp. 165–185, 2007.

[3] G. Xian, C. Homer, and J. Fry, “Updating the 2001 national land coverdatabase land cover classification to 2006 by using landsat imagerychange detection methods,” Remote Sensing of Environment, vol. 113,no. 6, pp. 1133–1147, 2009.

[4] O. Sjahputera, G. Scott, B. Claywell, M. Klaric, N. Hudson, J. M.Keller, and C. H. Davis, “Clustering of detected changes in high-resolution satellite imagery using a stabilized competitive agglomerationalgorithm,” IEEE Trans. Geoscience and Remote Sensing, p. to bepublished, 2012.

[5] X. Jin and C. H. Davis, “An integrated system for automatic road map-ping from high-resolution multi-spectral satellite imagery by informationfusion,” Information Fusion, vol. 6, no. 4, pp. 257–273, 2005.

[6] G. Sohn and I. Dowman, “Data fusion of high-resolution satelliteimagery and lidar data for automatic building extraction,” ISPRS Journalof Photogrammetry and Remote Sensing, vol. 62, no. 1, pp. 43–63, 2007.

[7] T. Ojala, M. Pietikinen, and D. Harwood, “A comparative study oftexture measures with classification based on featured distributions,” Pat-tern Recognition, vol. 29, no. 1, pp. 51 – 59, 1996. [Online]. Available:http://www.sciencedirect.com/science/article/pii/0031320395000674

[8] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in In CVPR, 2005, pp. 886–893.

[9] (2011, May) Pascal2 visual object classes challenge, 2010.[Online]. Available: http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/index.html

[10] M. Pesaresi and J. A. Benediktsson, “A new approach for the morpho-logical segmentation of high-resolution satellite imagery,” IEEE Trans.Geoscience and Remote Sensing, vol. 39, pp. 309–320, 2001.

[11] R. C. Gonzales and R. E. Woods, Digital Image Processing, 3rd ed.Prentice Hall, 2007.

[12] M. Grabisch, H. Nguyen, and E. Walker, Fundamentals of uncertaintycalculi, with applications to fuzzy inference. Kluwer Academic,Dordrecht, 1995.

[13] H. Tahani and J. Keller, “Information fusion in computer vision usingthe fuzzy integral,” IEEE Transactions System Man Cybernetics, vol. 20,pp. 733–741, 1990.

[14] T. M. M. Grabisch and M. Sugeno, Fuzzy Measures and Integrals:Theory and Applications. Physica-Verlag, Heidelberg, 2000.

[15] M. Sugeno, “Theory of fuzzy integrals and its applications,” Ph.D.thesis, vol. Tokyo Institute of Technology, 1974.

[16] A. Mendez-Vazquez, P. Gader, J. Keller, and K. Chamberlin, “Minimumclassification error training for choquet integrals with applications tolandmine detection,” Fuzzy Systems, IEEE Transactions on, vol. 16,no. 1, pp. 225 –238, feb. 2008.

[17] N. Dalal, B. Triggs, and C. Schmid, “Human detection using orientedhistograms of flow and appearance,” in In European Conference onComputer Vision. Springer, 2006.

[18] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human detector withpartial occlusion handling,” in IEEE 12th International Conference onComputer Vision, 2009, pp. 32–39.

[19] S. E. N. Intrator and T. Poggio, “Complex cells and object recognition,”1997.

[20] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International Journal of Computer Vision, vol. 60, pp. 91–110, 2004.

[21] R. H. Luke, J. M. Keller, and J. Chamorro-Martinez, “Extending thescale invariant feature transform descriptor into the color domain,”ICGST International Journal on Graphics, Vision and Image Processing,GVIP, vol. 08, pp. 35–43, 2008.

[22] B. E. Boser, I. M. Guyon, and V. N. Vapnikn, “A training algorithm foroptimal margin classifiers,” 5th Annual ACM Workshop on COLT, pp.144–152, 1992.