1 Scalable Remotely Sensed Image Mining Using Supervised ... · Scalable Remotely Sensed Image Mining Using Supervised Learning and Content-based Retrieval Ritendra Datta, Jia Li,

1

Scalable Remotely Sensed Image Mining

Using Supervised Learning and Content-based

Retrieval

Ritendra Datta, Jia Li, Ashish Parulekar, and James Z. Wang

Abstract

Automated satellite image analysis systems have traditionally been designed to accurately analyze

small land covers, typically covered by a single satellite image. It is often highly desirable to have a

real-time system which can analyze land cover spanning large regions. In this paper, we approach the

problem of large-scale satellite image mining from a content-based retrieval and semantic categorization

perspective. A two-stage architecture for automatic retrieval of satellite image patches is proposed. The

semantic categories of query patches are determined and patches from that category are ranked based

on an image similarity measure. Semantic categorization is done by a learning approach involving the

two-dimensional multi-resolution hidden Markov model (2-D MHMM). Patches that do not belong to

any trained category are handled using a support vector machine (SVM) based classifier. One issue is

the image variations due to changing sun elevation angles, which poses a hindrance to robust image

mining. We tackle this problem using histogram transformations. Experiments yield promising results

in modeling semantic categories within satellite images using 2-D MHMM, producing accurate and

convenient browsing. We also show that prior semantic categorization improves retrieval performance.

A system prototype has been created for demonstration purposes.

I. INTRODUCTION

Since 1972, when the first remote sensing satellite was launched, there have been signifi-

cant technological advances in optical sensing systems. While remote sensing data acquisition

The authors are with The Pennsylvania State University, University Park, PA 16802, USA.

The system prototype can be accessed at http://riemann.ist.psu.edu/remote/ .

2

technology has reached new heights, automated analysis of the collected data has not evolved

at the same pace to meet the current requirements. Every day, there is a massive amount of

remotely sensed data being collected and sent by terrestrial satellites. Prohibitive costs involved

in the manual analysis of these large volumes have made it imperative to develop automated

image analysis and mining tools. An automated, real-time, content based satellite image retrieval

system capable of handling large volumes can come in handy for information mining.

In this paper, we propose a system to aid in large-scale mining of remotely sensed images. In

essence, a content-based image retrieval (CBIR) system designed specifically for satellite imagery

is proposed. We represent collections of satellite images by a database of non-overlapping

fixed-size rectangular regions, henceforth referred to as patches. The proposed system involves

supervised learning of land cover categories of interest, and using the category information to

aid querying and browsing in a CBIR framework. It is shown through experiments that (1)

two-dimensional multi-resolution hidden Markov models (2-D MHMM) are an effective way

to jointly model the spectral and spatial structure of different land cover categories in satellite

imagery, and hence are valuable tools for automatic learning from this type of imagery, (2) CBIR

is a practical approach for real-time browsing and analysis of remotely sensed imagery, and (3)

performing categorization prior to applying image retrieval techniques significantly increases

speed and precision of retrieval from large collections of satellite imagery. The novelty and

key contributions include the proposal and implementation of a flexible system architecture

that allows for easy modifications, an intuitive user interface to aid in practical analysis, and

the demonstration that prior categorization leads to a better remotely sensed image retrieval

performance. No attempt is made to compare the categorization process in our architecture with

existing classification approaches.

For a general overview, consider the interface shown in Fig. 1. With the aim of mining the

image database for crop fields of a certain kind, an image analyst can use an example patch and

our system to retrieve all patches which resemble the particular pattern. A generative classifier

screens the retrieval results to ensure that patches retrieved are from the same semantic category

as the query, in this case crop fields. In other cases, land cover of a certain type within a particular

geographic region may be of interest. The proposed system is designed to aid in the retrieval

of such patches in real-time for further analysis. Other applications of our system could be as

follows. If an analyst specializing in residential areas wishes to scrutinize all images containing

3

Fig. 1. Screenshot of the system interface. A crop patch query (top-left) is used to retrieve other visually similar crop regions

within the database. Prior categorization of patches help produce high precision results.

significant portions of residential land cover, manually browsing through all available imagery

is typically impractical when large volumes are involved. Instead, our proposed system can be

employed to quickly and automatically find such images. The analyst can then proceed with the

investigation on the selected set of relevant images, possibly a fraction of the original volume.

The paper is arranged as follows. We survey work closely related to ours in Sec. II. In

Sec. III, the salient characteristics of the remotely sensed data used in our experiments and system

prototype are discussed. In Sec. IV, we elaborate on the proposed system architecture, including

the user interface for browsing and retrieval. In Sec. V, the learning based categorization methods

and the CBIR based image mining technique is discussed. Also discussed here are methods for

image normalization for handling sun elevation variance and other atmospheric effects. We then

discuss the experimental setup and the obtained results in Sec. VII. We conclude in Sec. VIII.

4

II. RELATED WORK

Thus far, there have been numerous attempts at automated classification of remotely sensed

imagery. In this section, we review research that is most closely related to our work.

One of the earliest attempts at remotely sensed image classification have been by Haralick et

al. [14]. In the more recent years, many different techniques have been proposed, some of which

include texture analysis [24], [20], Markov random fields [31], genetic algorithms [32], fuzzy set

theory [12], [38], [34], Bayesian classifiers [28], [10], decision trees [19] and neural networks

[4], [2]. Generalized orthogonal subspace projection has been found to improve classification

performance on Landsat TM images in [27]. The use of multi-source as well as multi-temporal

imagery for a data fusion approach to Landsat TM image classification has been explored in [5].

Since the growth in popularity of Support Vector Machines (SVM) as a supervised learning

technique, its use in the classification of Landsat and other types of imagery have been explored,

for example, in [15], [25]. Some others have approached remotely sensed image classification

as an application of CBIR, for examples, [24], [29], [11]. Content based image retrieval for

high resolution (��

) aerial photographs using Gabor filters has been implemented in [26],

[23]. Evaluation of remotely sensed image mining systems has been explored [9]. Assessment

of Landsat image classification performance has been discussed in [8]. Furthermore, in [30],

classification evaluation methods have been reviewed, and the �� statistic has been found

most suited for comparing classifiers. Clearly, there has been extensive work on satellite image

classification, by the remote-sensing community as well as the image analysis community.

In a recent survey [37] on satellite image classification results published in the Photogram-

metric Engineering and Remote Sensing Journal, it has been reported that over the last fifteen

years the classification accuracies have not had significant increase. The paper reports a mean

overall pixel-wise classification accuracy across all these experimental results at �� with

a standard deviation of�� . However, there are a number of parameters that have changed

over the years or across different experimental settings with respect to the remote-sensed data,

which make it difficult to generalize easily. Some of the parameters that vary include the spatial

and spectral information in the imagery, the sensor type, the geographic location and terrain

conditions of the data sets, and the ground-truth labels of the regions on the maps based on

which the accuracies are calculated (due to their subjective and often ambiguous nature). Owing

5

to advances in data collection technology, the quality of available image data has improved

as well, with reduced distortions using techniques such as orthorectification. Nonetheless, this

report is indicative of the possible existence of an upper bound on the classification accuracy

achievable in this type of imagery, given the current technology. Hence, the focus of our work

is to build a convenient CBIR-based tool for mining semantically relevant images from large

collections, instead of attempting to improve upon classification accuracy.

III. REMOTELY SENSED IMAGE DATA

Before elaborating on the system design, we discuss here the salient features of the data used

in our experiments and prototype. Prior knowledge about the data helps employ an appropriate

feature extraction process and design philosophy. It also helps foresee difficulties that may arise

due to the intrinsic nature of data, such that the approach can be suitably geared toward solving

them.

(a) (b) (c)

(d) (e) (f)

Fig. 2. Bands in the satellite imagery captured by the ETM+ device on Landsat 7. (a) Band 1 (Blue). (b) Band 2 (Green). (c)

Band 3 (Red). (d) Band 4 (NIR). (e) Band 5 (SWIR 1). (f) Band 7 (SWIR 2). (Courtesy: GLCF).

The Landsat satellites have been orbiting the Earth for over 30 years, and the data that they

collect has been used to study the land cover, environmental resources, and man-made structures

6

on the Earth’s surface. The Enhanced Thematic Mapper Plus (ETM+) instrument on the Landsat

7 is a multi-spectral remote sensing device which captures � bands. Each band is sensitive to

different wavelength ranges of solar energy. Bands 1, 2 and 3 of the ETM+ instrument record

reflected light in the visible range, and are known as blue band, green band, and red band

respectively. The ETM+ sensor also records energy beyond the visible spectrum. Band 4 is

known as near infrared (NIR), and bands 5 and 7 are known as short wavelength infrared

(SWIR). All these bands are recorded at a spatial resolution of �� meters. A plot of these �bands on a sample land cover can be seen in Fig. 2.

Each wavelength band has the potential to capture different features of the earth’s terrain.

While band 1 (blue) typically illuminates objects under clear water better than other colors,

plants do not show up brightly in this band. On the other hand, band 2 (green) reflects vegetation

brightly and hence can be used to determine the health of vegetation. Band 2 also gives excellent

contrast between clear and turbid water. Band 3 (red) reflects well from dead foliage and also

highlights urban features. Band 4 (NIR) can potentially help differentiate between various types

of vegetation. While band 5 (SWIR 1) is useful in distinguishing between various types of

vegetation, it has limited cloud penetration. This in turn helps differentiate between snow and

clouds. Band 7 (SWIR 2) is useful in detecting moisture in soil and vegetation [33]. Clearly,

the spectral dimensions have a lot of information content that can be utilized for classification.

The spatial patterns in local neighborhoods can further help improve upon classification.

(a) (b) (c)

Fig. 3. Tri-band Landsat 7 ETM+ image formats. (a) True color format. (b) NIR format. (c) SWIR format. (Courtesy: GLCF)

In typical color image retrieval, tri-band images (usually the �� spectrum) are used for

classification and indexing. In this work, three bands are selected from the possible 6 bands

available in Landsat ETM+ imagery. This choice was based on a combination of two factors,

7

(1) calculation of the Optimal Index Factor (OIF) [18] over the set of all possible subset of 3

bands, and (2) consultation with an experienced satellite image analyst working in a government

research lab, who provided intuitions about the choice of spectral bands, based on the land cover

categories used.

A true color format ( �� ) consists of bands 1, 2, and 3 representing their corresponding

color bands. As shown in Fig. 3 (a), it produces realistic representations of the land cover. The

problems, though, are that images produced are of low contrast, and that band 1 (blue) is usually

noisy due to its high dispersion in the atmosphere. In the SWIR format, blue color is displayed

by band 2 (green), green color by band 4 (NIR) and red color by band 5 (SWIR) (See Fig. 3 (b)).

While this format is useful in analyzing vegetation patterns, it is very sensitive to cloud cover.

A widely used format is the NIR format, in which blue color is displayed by band 2 (green),

green by band 3 (red) and red by band 4 (NIR) (See Fig. 3 (c)). With this format, vegetation

appears bright red and urban areas appear greenish blue.

To further provide insight into the choice of spectral bands, we computed the OIF for all 20

combinations of 3 bands out of the 6 bands (1,2,3,4,5, and 7) over a representative subset of the

imagery used, consisting of all land cover categories of interest (specified later). The OIF for a

set of 3 bands is computed as �� "! #%$&('*)�+ &# $, '*).- / , - (1)

where + & are the standard deviations of the 3 bands, and / , are the three correlation coefficients

formed out of the 3 bands, taken two at a time. The measure tends to be larger when the

information content for a set of 3 bands is higher. Because the training samples consists of

distinct sets for each land cover category in question, computing OIF over them ensures that the

measure helps rank discriminative power mainly over the categories of interest. The OIF values

for all 20 triplets are shown in Table I. We find that the top three ranked formats are 021�35463 ��7 ,0 � 35463 �87 , and 09�:3(1�354 7 , in that order. Band 4 appears in all of the top 10 combinations in the OIF

table. Intuitively, band 3 (red) helps highlight urban features, and band 3 (green) helps reflect

crop fields better, distinguishing them from other land cover classes, as discussed previously.

Thus, based on both intuitions as well as the OIF table, we find the 09�:3(1�3;4 7 (NIR format) to be

most suitable, being the third most discriminative combination according to OIF, and hence use

them for all our experiments.

8

TABLE I

OPTIMUM INDEX FACTOR (OIF) FOR THE LANDSAT ETM+ BANDS OVER REGIONS OF INTEREST.

Rank Combination OIF

1 3, 4, 5 41.1730

2 1, 4, 5 40.4513

3 2, 3, 4 39.2354

4 3, 4 7 37.9621

5 1, 3, 4 37.7571

6 1, 4, 7 37.7465

7 2, 4, 5 34.7594

8 4, 5, 7 33.4958

9 2, 4, 7 33.3145

10 1, 2, 4 30.3658

11 3, 5, 7 19.6789

12 1, 5, 7 18.8772

13 1, 3, 5 18.5256

14 2, 5, 7 18.3707

15 2, 3, 5 18.2690

16 1, 2, 5 16.7659

17 2, 3, 7 14.0359

18 1, 3, 7 13.8118

19 1, 2, 7 12.4120

20 1, 2, 3 10.6813

Publicly available1 Landsat-7 ETM+ images [33] are used, which have a spatial resolution of�� <� . Four land cover categories of interest are considered, namely urban, residential, crop and

mountain regions, although our architecture supports seamless addition of more classes through a

modular training process. The choice of these particular land cover categories, which correspond

roughly with some of the Level I categories in the USGS classification [1], were due to a specific

application that was kept in mind at the inception of the project. Note that the specific formats

and resolutions used in our experiments do not restrict the use of our approach for other formats.

1Source of data for experiments : http://glcf.umiacs.umd.edu/data/landsat/.

9

In particular, our 2D-MHMM and IRM based modeling and retrieval processes are not restricted

by three band imagery, and hence the whole setup can be easily modified to support a higher

number of spectral bands. Our architecture is designed to allow for such additions to the system

without much effort.

IV. ARCHITECTURE OF THE PROPOSED RETRIEVAL SYSTEM

Patch pk

ModelComparison

ModelComparison

ModelComparison

Feature VectorLikelihood

MaximumLikelihood

ModelComparison

Categorize using2−D MHMMs

FeatureExtraction

FeatureExtraction

l 1

l 3

l K

l 2

Query Patch q

=6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6==6=6=6=6=6=6=6=6=6=>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>>6>6>6>6>6>6>6>6>

BiasedSVM

C2

Category ck

Patch p

k

(To be categorized)

{1, ..., K}

Off−line Processing

CategorizedPatch

{0}C

1(N

ot to be categorized)

InDB ?

DB

Yes No

Fetch categoryfrom DB

Category c

MUX

1

2

K

Patch DB foreach category(conceptual)

On−line querying

Satellite images { I K }1

Re−size andRe−sample

User

Likelihood

Likelihood

Likelihood

Likelihood

, ..., I

Trained 2−D MHMMs

IRM Similaritymatching

similaritySort by

Interface

M1

M2

M3

MK

Fig. 4. System architecture. Left side: The database building process. Right side: Real-time user interaction.

The generic framework of the proposed system can be divided into two parts. The off-line

processing part consists of initial data acquisition, normalization, ground-truth labeling, and

model building, while the on-line part consists of querying, browsing and retrieval. A schematic

diagram of the architecture is given in Fig. 4, which follows in the lines of a learning-based CBIR

system on more generic image data. Below we discuss each of these components, and follow it

up with a section on implementation details and issues which cover the practical problems and

issues we faced while implementing this architecture, and how we tackled them.

A. Off-line Indexing of Images

Although we have introduced specifics on the data we experimented on, in this section we

elaborate on the details in a fairly general manner. Consider a set of ? satellite images��@ 35A !� 3� B B C3�? of an arbitrary type (e.g., Landsat, ASTER, SRTM). A subset of the available spectra

10

(here, bands 2, 3, and 4) for the given type of imagery (here, Landsat 7 ETM+) is used to

create composite images. A raw NIR image typically has low brightness and contrast. In order

to improve the visual clarity of the image and to reduce the differences in image contrasts due to

sun elevation angle, we attempt normalization of the images by two different methods, namely

a standard digital-number (DN) to at-surface reflectance conversion procedure commonly used

in the remote sensing community, and a generic normalization procedure practised in the image

processing community. These procedures are elaborated upon in Sec. V-A.

The normalized images are divided into equal sized non-overlapping rectangular patches

of size DFEHGJIKE , padding the right and bottom with zeros appropriately, to get a total ofLpatches M(N ) 3� B B C3ONKPRQ . Traditionally, satellite image classification has been done at the pixel

level [31], [4]. For a typical Landsat image at �� <� resolution, class boundaries are at pixel

level. Even at this level there may sometimes be class ambiguity, due to which some authors have

proposed fuzzy classification [12], [38], [3] as opposed to hard classification. Even though pixel-

level classification may give an overall segmented view of the land cover, formulating a CBIR

framework becomes challenging. While a system working at the patch level may have inferior

land cover classification accuracies, there are distinct advantages of doing so. They include, (1)

formulating the problem in an image retrieval framework becomes convenient, (2) within a patch

category such as urban regions or forests, it helps to find regions with comparable density, and

(3) tracking salient features such as specific patterns of deforestation or terrace farming within

crop fields using patch level classification followed by heuristic search techniques. However,

classification of patches instead of pixels can lead to greater ambiguity, because ground-truth

inter-class boundaries are at the pixel-level, or at an even finer level, especially for some of the

USGS Level II, III, and IV categories [1].

Fig. 5. Examples of ambiguous patches. Left: Urban and Residential. Right: Residential and Crop.

11

Image patch sizes of, say, ��4FGS��4 , and� ��TG � �� pixels (as used in our experiments) cover

roughly 1� U� � � �WVand

� 46 X<4�� WVon the ground respectively. While these are too large an area

to represent precise ground segmentation, the focus is on convenient visualization of image data.

However, a strategy is still required to resolve the ambiguity that this approach inherently leads

to. As shown in Fig. 5, some patches have large coverage of different categories. In our system,

fuzziness is not incorporated. Instead, we consider a patch N & to belong to a category Y & if N &has roughly over

�<Z��coverage of type Y & (dominant category). Patches which do not belong to

any of the categories M<Y ) 3� B B C3�Y\[RQ , or those that do not have a dominant category, are given class

labelZ

(category unknown). Dividing the image into rectangular patches makes it convenient

for training as well as browsing. Moreover, for semantic classification, a more global view of

an area is helpful. For example, a few trees in a city may occupy a pixel in the image. This

pixel is still ideally classified as part of an urban area rather than a forest. Thus, the granularity

of patches involves a trade-off between global view and ambiguity. Zooming and panning form

part of the interface to allow users the flexibility in result visualization.

For each patch N @ , information about the relative coordinates of its top left corner and its

parent image are stored as metadata. Suppose that there is some way to manually identify �semantically non-overlapping classes or categories M<Y ) 3� C B B3�Y\[]Q relevant to a specific application.

Note that this need not be an exhaustive set of classifications. We aim to build a supervised patch

classifier to help automate the categorization process. This is needed since it is impractical to

manually label the large number of patches representing the remotely sensed image collection.

For this purpose, a small number ^ of patches of each semantic category Y & are chosen to

training 2-D MHMMs, generative models used for the classification. Details on 2-D MHMMs

are presented in Sec. V-B. Here we give a brief overview. For each semantic category, a separate

2-D MHMM is trained using visual features of the corresponding ^ training patches, resulting

in � different models. Now, for each image patch M(N ) 3� C B B3ON_P`Q in the database, the likelihood a &of the patch belonging to class b is computed using the trained models. Since the system deals

with a non-exhaustive set of categories, those patches that do not belong to any of the trained

classes are required to be labeled as classZ, as mentioned before. It makes little sense to train

another 2-D MHMM for them, since there may not exist any spatial or textural motif among

them. Instead, another supervised classification is performed using Support Vector Machines

(SVMs). Two sets of randomly chosen training patches are taken, c �with manual class labels

12Z(unknown category), and cd� with any of the labels M � 3� B C B3(�eQ (known category). The � 2-D

MHMM likelihood estimations for each of the samples of the two classes are used as feature

vectors for training an SVM. A biased SVM classifier is used for this purpose, such that cd�is predicted with high accuracy at the cost of c �

being predicted with moderate accuracy. The

reasons for doing so will be discussed in Sec. V-C. If an unknown patch N & , whose 2-D MHMM

likelihood estimation vector 09a ) 3� B C B3(af[ 7 , is classified as c �by this biased SVM, it is labeled asg & ! Z

. Otherwise, its class label is assigned by the ruleg & !ih�jlknmohqp@ 09a @ 7 (2)

In summary, the overall classification process is as follows:r For a given patch N & , its likelihood a @ 3 �ts A s � for each trained model is computed.r The trained SVM is used to classify the vector 09a ) 3u C B C35af[ 7 as c �or cd� .r If the result is c �

then g & ! Zelse g & !Hh�j;kvmwh�px@ 09a @ 7 .

The class label g & is stored as metadata for N & . This set of tasks can be performed entirely

off-line over the collection of satellite imagery in order to generate a large annotated database

of patches. We remind the readers that the metadata for each patch consists of an identifier for

its parent image, its location relative to the top-left corner of the parent, and the predicted class

label M Z 3 � 3� B C B3(�eQ assigned to it.

B. On-line Query Processing

Assume that there is an efficient indexing strategy for handling the database of patches and

associated metadata. The simplest way to represent the query is as follows. Given a query patch

the user seeks to find patches within the same semantic category, sorted by their visual similarity.

Rather than searching through the entire database, it suffices for the system to search through

only those patches that belong to the same category as the query. This helps in improving

both accuracy and retrieval speed. The underlying assumption here is that class predictions

are acceptable. As shall be seen in Sec. VII, categorization using 2-D MHMM and SVM is

fairly accurate. One problem is that patches may not always be homogeneous, i.e., a patch may

contain a mixture of different, possibly unknown categories. Our experiments revealed that the

improvement in quality of mining due to categorization is more evident in these ambiguous cases.

This is especially evident for patches partially covered by known and unknown categories.

13

Fig. 6. Demonstrating retrieval improvement with semantic categorization. Left: Ambiguous urban query. Center: Unwanted

retrieval result (without categorization). Right: Desirable retrieval (with categorization).

Consider the example in Fig. 6. The left patch contains urban area and water, the middle one

water and forest, and the right one urban area and forest. Suppose urban areas are a known/trained

class, while water is not. If the left patch is queried in our system without prior categorization, the

middle patch ranks higher in visual resemblance than the right patch. This happens because (1)

water is present in the first two patches and absent in the third, and (2) the shapes of the coastlines

in the first two patches are similar. This problem is avoided by using prior categorization, which

eliminates the middle patch from consideration for ranking. In some sense, the action performs

semantic filtering to improve mining quality.

The user has two different means of formulating a query:

Within-database query: In the case that an existing patch is used as query, its semantic

category g & is already stored and hence known. Patches in the database whose semantic categories

are not g & are eliminated from consideration.

External query: In the case of an external query, the patch is re-sized or trimmed to fit

the standard dimensions DoEyGzI_E and adjusted for spectral encoding, if needed. The semantic

category g & of this patch is predicted using the 2-D MHMM likelihoods and the biased SVM.

Again, all but the patches labeled g & are eliminated from consideration.

The remaining patches are now ranked according to their visual similarity with the query.

Visual similarity is computed using the Integrated Region Matching (IRM) measure, which is fast

and robust, and can handle large image volumes [35]. The top { matched patches M(N}|~(3� B B C3ON�|��\Qare then displayed for perusal. The choice of { is contingent upon the specific application.

Experimentation on choosing { and how precision of retrieval varies with it are discussed in

Sec. VII. Note that for the purpose of retrieval, query patches determined as c �(uncategorized)

14

are also searched from among only the patches labeled c �in the database.

C. Implementation Details and Issues

Here we provide more insight into the practical aspects of implementation, and the problems

associated with it. The system architecture is particularly designed keeping in mind scalability

concerns, that is, the ability to handle large land cover, imagery with many spectral dimensions,

and widely varying patch sizes. For this purpose, all implementation has been carried out using a

combination of the C programming language and the freely available MySQL database system.

The scalability of the image retrieval component, IRM, stems from the fact that the module first

builds a fixed width feature set out of all the images in the database. This consists of color

features, texture features, and scale/rotation invariant shape features. This is stored in a MySQL

database, associated with each image patch, along with their location and parent image details.

The use of MySQL allows random access on the set of features over all the files, and it helps

index as many images as the database system allows, which is very large in case of MySQL.

The index set is rebuilt at regular time intervals, depending upon the frequency of database

updates. Thus the most time-consuming portion of IRM based retrieval is performed only as

an intermittent background process. The real-time image ranking process involves computations

with database columns themselves, and hence is primarily performed as database actions, making

it fast and avoiding the necessity for the images to be read in repeatedly.

The 2D-MHMM and SVM training processes are time consuming, but can be performed

entirely as static processes, which is what is done in our system. These are performed on

relatively small but representative sets of images, and hence are fixed costs, making the training

process independent of the size of the database. Testing with both 2D-MHMM as well as SVM

are rapid, and hence scale well to large databases. Yet, it is not necessary to perform these

in real-time, since the classification due to them do not change unless the training process is

re-done. Therefore, once the training of these models are completed, all images patches are

categorized once. This category label is stored in the MySQL database along with the IRM

features and meta-data, indexed by the image. This way, the process of prior categorization

followed by retrieval is achieved simply by a database filtering operation prior the IRM distance

computation. Zooming and panning in the user interface has been speeded up by the use of Haar

wavelet transforms, as explained in Sec. VI.

15

Issue arise when an external patch is given as query. In this case, the feature extraction program

needs to be executed on the patch, followed by the 2D-MHMM based categorization, and finally

the SVM based categorization. These steps can together take approximately 3 seconds per image.

Since these programs are executed only once, this is the maximum wait time for the user for

external queries, which can be typically tolerable. Secondly, even for the off-line processes,

when there are over a hundred thousand images, the 2D-MHMM categorization can take a

significant amount of time. We note that with the images divided into patches, categorization

can be entirely parallelized, with no dependencies across patches. In our case, the images are

categorized using cluster computers. Each set of patches associated with one Landsat image is

assigned to a separate node in the cluster. This way, we completed the categorization of all 12

Landsat images in one-twelfth the time. We note that categorization is essentially a static process

if the patch database is not being regularly updated. Hence this process can be run once, without

the need for it to run as a background process at regular intervals. Finally, we note that when

an external image query is provided which is significantly smaller in size than those used in the

existing database, the required re-sizing has negative effects on the feature extraction process.

In particular, visual features get distorted when scaled to a larger size, and lose detail. With the

features extracted on this scaled version, the retrieval results are not always favorable. This is

the reason external query patches of very similar or same size are preferred.

V. CATEGORIZATION AND RETRIEVAL

We now proceed to elaborate on the specific techniques used in our system. These include the

image normalization process, the 2-D MHMM based categorization process, biased SVM based

categorization, and the IRM distance for computation of visual relevance. Note that scalability

is one of the key advantages of our generative modeling and image retrieval approach. The 2D-

MHMM based categorization has been applied to a 600 image category problem in [22]. The

IRM based image retrieval system [35] has been applied to over 1 million images for real-time

retrieval. The use of multi-spectral imagery involving higher dimensions also does not pose a

restriction to either the generative modeling or retrieval components. Hence our system has the

potential for handling a large number of land cover categories, a very large number of Landsat

images, and higher dimensional spectral bands than what have been used here, without any major

scalability issues.

16

A. Normalization

The need for image normalization arises from the inherent sensing variations due to the sun

elevation angle and other atmospheric effects. As the Landsat 7 revolves around the Earth, it

passes over a particular region at same time of the day, each day. Each column in the Universal

Transverse Mercator (UTM) grid is known as a path. Images within each UTM zone are captured

separately. Images of zones that lie on the same path are fairly consistent because the sun

elevation angle does not vary much during the time satellite traces one path. However, as the

satellite traverses along different paths, the sun elevation angle keeps changing. With this change,

the amount of incident light on the earth’s surface changes and hence the reflectance changes

with it. As a result, the images lying on different paths appear considerably different.

Fig. 7. Satellite images from different UTM paths within the US. Left: Raw NIR image from East Coast (path �O� ) Right:

Raw NIR image from West Coast (path �5� ). (Courtesy: GLCF)

Fig. 7 shows raw NIR images from two different paths. The difference in color between

the two images is apparent. It is expected that the color features extracted from patches of the

same semantic category but from the two different paths will be considerably different. Without

normalization, most results for such a query patch will be drawn from the same path as that

of the query. This phenomenon is likely to be disadvantageous, given the typical applications

of satellite image mining. It is further noted that the NIR images in their raw form have low

brightness and contrast.

To compensate for poor contrast of the NIR images and lighting variations across UTM

paths due to sun elevation and other atmospheric effects, we explore two different normalization

procedures. These include a standard digital number (DN) to at-satellite reflectance conversion

procedure commonly used in the remote sensing community, and a generic histogram normal-

17

Fig. 8. Left: Enhanced West Coast images. Center: Enhanced East Coast images. Right: Normalized West Coast images

(Reference Path � East Coast). (Courtesy: GLCF)

ization procedure practised in the image processing community.

The Landsat 7 imagery data consists of digital numbers (DN) representing intensity at each

pixel over each spectral band. To compensate for the effects mentioned above, header information

about the particular Landsat images can be utilized. One such compensation method is DN to

at-satellite reflectance conversion [16]. For each band of the multi-spectral imagery, the DN can

be converted to reflectance values � using the following equation:

� ! � G��GS� V� Yn� L G + A9��02� 7 (3)

where � ! ��A9��GS� L GS�dA�� +where � is the at-satellite radiance for the pixel, � , ��A9� , and �dA�� + are the sun elevation,

gain and bias for the particular spectrum/image obtained from its header,� Yn� L is the solar

irradiance obtained from [33], and � is the normalized Sun-Earth distance over the year. By this

conversion it has been shown that the variations due to elevation and other effects over different

Landsat 7 images reduce. We apply this conversion to all bands of the training and test data,

18

and then observe its effect on retrieval over a subset of the test data.

The second approach is simple and more ad-hoc, and follows a procedure that is common

in the image analysis community. Prior to dividing an image into rectangular patches,� � � of

the histogram for each band (leaving out � � on each side) is linearly stretched over the entire

available intensity range, i.e., 0 Z�� 8�87 . We note that although this improves contrast of the NIR

images, it potentially increases the difference between images originating from different paths.

To compensate for this, histogram centralization is performed before the linear stretching. More

specifically, using all images from a chosen path � �, the means �}| , �K� , and �� of red, green,

and blue spectral bands respectively are computed. Then, for each image in the database, its

histograms of the � , � , and � bands are shifted to be centered at �}| , �_� , and �� correspondingly.

After the shifting operation, linear stretching as discussed above is applied. The results of

normalization on sample images are shown in Fig. 8. We apply this normalization first on a

subset of the test images.

Having applied both normalization techniques separately on a set of two Landsat images (on

each taken from the East Coast and West Coast), retrieval is performed over the entire pooled

set of test patches in both cases. It is observed that the average precision and recall over all

categories showed better performance with the ad-hoc normalization than with the standard

DN to at-satellite reflectance conversion. While this may be due to the specific visual features

extracted in our system, we are unaware of an accurate explanation for this. Nonetheless, due

to the improved performance, we use the ad-hoc normalization procedure for the rest of our

experiments. Adapting a more standard normalization procedure into our system is part of future

work.

B. Categorization Using 2-D MHMMs

The 2-D multiresolution hidden Markov model (2-D MHMM) has been used for generic

image categorization. This section presents a brief overview of the model and its application to

categorization. For a more detailed discussion on 2-D MHMMs, please refer to [21].

Under 2-D MHMM, each image is characterized by several layers, i.e., resolutions, of feature

vectors. The feature vectors within a resolution reside on a 2-D grid. The nodes in the grid

correspond to local areas in the image at that resolution. A node can be a pixel or a block of

pixels. The feature vector extracted at a node summarizes local characteristics around a pixel or

19

Fig. 9. A conceptual diagram of the 2-D MHMM based modeling process. Arrows indicate the intra-scale and inter-scale

dependencies among visual features.

within a block. The 2-D MHMM specifies the distribution of all the feature vectors across all the

resolutions by a spatial stochastic process. Both inter-scale and intra-scale statistical dependence

among the feature vectors are taken into account in this model. These dependencies are critical

for judging the semantic content of satellite image patches because texture or spatial structure

in these patches can be captured at a larger scale than at a block or pixel level. The inter- and

intra-scale dependencies of the feature vectors are captured by assuming hidden layers of states

at all the resolutions. The feature vectors, which are actually observed, are assumed to follow

Gaussian distributions conditioned on given states at the corresponding resolution and position.

The statistical dependence among states across scales is modeled by a Markov chain, and that

within each scale is modeled by a Markov mesh.

For the experiments, a three-level pyramidal structure in the model was used. A schematic

diagram for this process can be found in Fig. 9. The number of states at lowest resolution is 1and the number of states for each of the two higher resolutions is 4 . For feature extraction, 4x4

blocks are taken and the visual features are characterized by a six dimensional feature vector.

This vector consists of three moments of the wavelet coefficients in the high frequency bands

(representing texture) and the three average color components in the LUV space. As discussed

earlier, instead of taking �� bands, near-IR, red and green bands are taken from the satellite

image spectra. This is motivated by the fact that traditionally these bands have been visualized

on screen for manual classification as if they are �� bands. It is thus reasonable to convert

these bands to �� in the same way as we would convert from �� . For more details on

20

the feature extraction process, readers are referred to [35]. The likelihood for an image given a

trained 2-D MHMM is computed as explained in [21]. The computed scores for an image over

all trained models are then used in the SVM classification and the eventual category prediction.

C. Separating Known (C2) and Unknown (C1) Classes Using SVM

The training of the 2-D MHMMs is performed on a non-exhaustive set of categories M<Y ) 3� B C B3�Y\[RQ .Generating a training set covering all possible land-cover categories is time-consuming and

expensive, if at all possible. Hence it is preferable to limit the scope to only those classes that

are of interest. As a result, among the image patches there exist many that represent categories

outside of M<Y ) 3� C B B3�Y\[]Q . Also, there are patches that are a blend of multiple categories, with none

dominating. In both cases, these patches should ideally be labeledZ 09c ��7

. As mentioned, all

patches labeled M � 3� B C B3(�eQ are considered part of c�� .

Using the maximum likelihood approach, we can always assign a category label between�

and � to every patch, even if it is actually part of c �(unknown). This is not desirable, and

as explained in Sec. IV-A, neither can we train another 2-D MHMM to model patches in c �.

A naive approach to solving this problem is based on the following assumption. Given a patch

which does not resemble any trained category, the likelihood estimation from all the models

tend to be low. Therefore, if all likelihood scores are below a certain threshold, the patch can be

assigned classZ

( c �). However, not surprisingly, it is found that for a given patch, the likelihood

estimates are not independent of each other. This may be due to the fact that the 2-D MHMMs

are trained on samples that have some degree of visual resemblance across categories.

To solve this problem, we employ a formal classification approach. Let the set of likelihood

estimates for a given patch N & be its feature vector � & ! Mqa ) 3u C B C35af[]Q . In the experiments, 4 classes

were considered. We can plot the 4-D feature vectors of �:3 Z�Z8Z patches manually labeled as c �orc�� . The plots, taken two dimensions at a time, are shown in Fig. 10. Clearly, a non-linear method

can better model the class separation than thresholding or other linear methods. Classification was

attempted using Quadratic Discriminant Analysis (QDA) and Logistic Regression. The accuracy

rate with Logistic Regression turned out to be the best at approximately �� U� � with accuracy

of classifying only c�� at about ��4� X1 � . Classification using SVM was then performed on the

data using the LibSVM software package [6], using the RBF Kernel ��09� @ 3�� , 7 !H�< $(¡ ¢O£ ¢¥¤¦¡ § . The

results were further improved, at ��: U1 � overall accuracy and ��: �8� accuracy at classifying cd� .

21

−22 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2−60

−50

−40

−30

−20

−10

0

Dimension 1

Dim

ensi

on 2

C1C2

−22 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2−40

−35

−30

−25

−20

−15

−10

−5

0

Dimension 1

Dim

ensi

on 3

C1C2

−22 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2−30

−25

−20

−15

−10

−5

0

Dimension 1

Dim

ensi

on 4

C1C2

−60 −50 −40 −30 −20 −10 0−40

−35

−30

−25

−20

−15

−10

−5

0

Dimension 2

Dim

ensi

on 3

C1C2

−60 −50 −40 −30 −20 −10 0−30

−25

−20

−15

−10

−5

0

Dimension 2

Dim

ensi

on 4

C1C2

−40 −35 −30 −25 −20 −15 −10 −5 0−30

−25

−20

−15

−10

−5

0

Dimension 3

Dim

ensi

on 4

C1C2

Fig. 10. Plot of the 4-D likelihood feature vectors L for C1 (black/circles) and C2 (red/crosses). The six unordered pairs of

dimensions are shown.

When a patch is classified as c �it is removed from further consideration for retrieval. To

be on the safe side, we would rather prefer to have some c �patches to be classified as cd� ,

than have some valid cd� patches mistakenly classified as c �and eliminated from consideration.

Therefore, the goal is to be achieve higher accuracy in detecting c�� . This increases the search

space to some extent while eliminating a significant chunk of unwanted patches. Hence we desire

to have a biased classifier. One way to introduce weights into the SVM learning process is to

sample the training classes accordingly. The bias is introduced by sampling c �and c�� in the

approximate ratio� 1T¨ � � for training the SVM, resulting in a total of about

� 3 Z�Z8Z samples (with

repetition). In this manner, we achieve high accuracy of classifying c�� (� � � ) while for c �

the

score is moderate (� : X� � ). Hence less than 4 � of the patches within categories M<Y ) 3u C B C3(Y\[]Q

will be mistakenly eliminated from consideration. It is presumed that this is not a problem since

patches of one category in a satellite image are usually spread over a large region. It is highly

unlikely that all patches in one region will be eliminated.

22

D. Retrieval using IRM

The Integrated Region Matching (IRM) measure [35] used in the SIMPLIcity system is

employed for image patch mining. IRM is a scalable and robust region-based image similarity

measure. IRM attempts to integrate visual features of each segmented region in the images to

provide robust region-based image matching, with low dependence on reliable segmentation. The

scheme allows multiple regions of one image to be matched with several regions of the other

image. The overall similarity measure is computed as a weighted sum of the similarity between

region pairs, with weights determined by the significance of regions.

Each image is segmented using b -means clustering [17] on the color component vectors, and

for each generated segment is represented by a nine dimensional feature vector summarizing

color and wavelet based texture properties. The feature vectors used include the same six texture

and color features used in 2-D MHMM, and three additional features characterizing the shape of

the segment. The matching is performed by a soft similarity measure in the following manner.

For two images A ) and A V , suppose they are segmented into b ) and b V regions respectively. The

IRM distance between images A ) and A V is then given by��02A ) 35A V 7 ! & ~© ª '*) & §©« '*) + ª ¬ « � ª ¬ « subject to

& §©« '*) + ª ¬ « ! N ª 3 ��s a s b ) and ~© ª '*) + ª ¬ « ! N6®« 3 �tsJ�¯s b V(4)

where � ª ¬ « is the distance between the feature vectors characterizing region a of image A ) and

region�

of image A V , and + @ ¬ , is the significance score for that region pair. More specifically,

denoting the region pairs as / ª and / « respectively, the values of � ª ¬ « are computed as follows:� ª ¬ « ! ��0 / ª 3 / « 7�°�± 09� 0 /ª 3 / « 7;7 (5)

where ��l0 / ª 3 / « 7 is the color and texture distance given by�8�²0 / ª 3 / « 7 ! �� ³© @ '*) 09� @ 0 / ª 7�� @ 0 / « 7;7 V (6)

with � ) , � V , and � $ being the mean L, U, and V color components of the region, ��´ , �qµ and � ³being the square-roots of the second-order moments of wavelet coefficients in the HL, LH, and

HH bands respectively. The shape distance � 0 /ª 3 / « 7 is the shape distance given by� 0 /

ª 3 / « 7 ! �1 $© @ '*) ¶ �·0 / ª 35A 7� @ � �]0 / « 3;A 7� @ ¸ V(7)

23

where �]0 / 3;¹ 7 is the normalized inertia [13], invariant to scaling and rotation, defined by�]0 / 3;¹ 7 ! #"º�» º�¼ | -¥- ½ �¿¾½�-C- À0 - /�- 7 )9Á ÀÃÂ V (8)

where - /�- denotes the size in pixels of the region / ,¾½ denotes mean of ½ , and � @ denotes the A �Ä�

order normalized inertia of spheres. The function± 0 °Å7 in Eq. 5 is used to make the color-texture

distance coherent with the shape distance in the overall � ª ¬ « computation, with the following

form being found to be empirically appropriate, as detailed in [35]:± 09� 7 ! ÆÇÇÇÈ ÇÇÇÉ� �oÊ Z �Z X� � Z X�ÌËÍ� sÎZ �Z � �oË Z X� (9)

The significant credits + ª ¬ « determine how important a role each pair of regions plays in the

calculation, constrained by N ª and N ®« , which are the significance of regions a and�

within A )and A V respectively. To calculate significance credit between a pair of segments, the most similar

highest priority (MSHP) principle [35] is used for assignment of region significance.

In region-based image similarity measures, segmentation quality typically plays a major role

in the similarity scores. In the case of IRM, however, there is a high degree of robustness to poor

segmentation. Difficulty in segmentation can be particularly acute in the case of noisy remotely

sensed images, and hence robustness to unsatisfactory segmentation is important. The use of

the IRM distance for ranking patches by visual similarity helps generate high precision retrieval

in our system, as observed in our experiments. Another possible reason for the demonstrated

success of IRM in the remote sensing domain is the importance given to texture and localized

shape features in the similarity computation.

VI. USER INTERFACE

A screenshot of the user interface can be found in Fig. 11. Readers are encouraged to

experience the demonstration at the aforementioned location. Initially, a random set of patches

from among the database collection are shown. The user can then choose to mine the imagery

in three different ways:

(1) The user can click on a patch to retrieve other visually similar patches. Along with the

retrieved patches the user also has access to their metadata. This process of browsing can continue

as a chain of clicks to arrive at the required set of patches and/or parent images of interest.

24

Fig. 11. Interface with two possible options to query the patch database, (a) by explicit specification, or (b) by clicking on a

retrieved patch.

Fig. 12. Interface for zooming and panning, showing urban (blue), residential (yellow), and the retrieved (green) patch. Left:

Original Ï²Ð¦Ñ Ò²Ó resolution. Right: Zoomed out at �;Ô²Ó . (Courtesy: GLCF)

(2) The user has the option of clicking the random button to display a new set of random

patches retrieved from the database. This is one way to explore the database, understand the

distribution of land cover within the database, and get an overview of the variations within the

images captured.

(3) If the user wishes to enter a query patch not in the database, this can be done as our

system supports external queries.

25

To add to the convenience in browsing, six levels of zooming is supported by our system.

A user can enter the zooming mode by clicking on the location coordinates specified below

a retrieved patch. This is an effective way to explore geographic regions in the proximity of

patches of interest. In other words, the zooming capability allows users get a more global view

of regions of interest. Panning is implicitly supported by the interface as well, by clicking on

any portion of the image to move gaze to that region at the next level of zoom. A screenshot

of the zooming/panning interface can be viewed in Fig. 12. The best way to experience this

interface is through usage of our prototype.

Zooming and panning operations are optimized for real-time response in the following man-

ner. Haar wavelet transforms are used to achieve zooming, since they preserve localization

of data [36]. These transforms decompose the images into sums and differences of neighbor-

hood pixels. On a given query, the system only needs to retrieve the quantized coefficients of

the queried region for reconstruction. Since the processing for categorization and zooming is

done only once during setup, and only localized parameters are required, the response time is

considerably low. Using the metadata associated with the patches, such as precise geographic

locations or semantic categories (either manually provided or automatically predicted), more

advanced querying capabilities can be incorporated. For example, useful extensions could be

the capability to formulate queries such as “Find the closest urban area near this location” or

“Find a water source nearest to this residential area”. When accurate orthorectified imagery are

available, interface extensions to support such complex querying within our framework should

be fairly straightforward.

VII. EXPERIMENTAL RESULTS

For all our experiments, ? ! � � Landsat 7 ETM+ multi-spectral images with �� <� resolution

are used. Six images from the East Coat (path� 4 ), and six others from the West Coast (path

46) of the US are used. Since there is considerable difference in the sensed data from each path,

normalization is employed to homogenize the quality, as explained previously. The selection

of images in this manner is aimed at demonstrating the effectiveness of our system given the

varying sun elevation challenge. Our system supports four semantic categories (i.e., � ! 4 ),

namely mountain, crop field, urban area, and residential area. As described in Sec. III, the

NIR bands are chosen for image representation. The pixel dimensions of each image�q@

are

26

� Z8Z�Z GS�8� Z8Z , with geographic dimensions being approximately� � Z � � G �� w� �

.

Ground-truth categorization is not available readily for patches. This is required for training and

testing the 2-D MHMM based categorization process, as well as for measuring the precision of

retrieval using the IRM measure. In order to build a manual categorization of the patches into the

specified classes, an expert working on satellite image analysis in a government research lab gave

two arbitrarily chosen subjects a tutorial on how to distinguish between the 4 semantic categories.

The satellite images are divided into square patches. The subjects then independently label each

test patch as either of M � 3��:351�354�Q , orZ

in case they belonged to neither of the classes or had no

dominant coverage, keeping in mind the��Z��

coverage policy (Sec. IV). The final category labels

are determined by taking the overlap of the sets as it is, and in case of disagreements, randomly

choosing one of the two. With the high-quality of the ETM+ images, it is not hard to visually

identify the four categories used. The overlap between these two sets from the independent

subjects is approximately� � � . In the absence of a “gold standard”, this serves as a “silver

standard”.

The choice of what patch size to divide the images into is critical. A patch should be large

enough to encapsulate the visual features of a semantic category. At the same time it should

be small enough to include only one semantic category in most cases. Instead of arbitrarily

selecting a patch size, we explore the effectiveness of 4 different patch sizes, each covering a

different level of granularity, in our system. In particular, we explore using patch sizes� ��G � � ,1��GÍ18� , ��4zGÍ��4 , and

� ��G � �<� . For training, we initially experiment with using samples

sizes from as low as 28, to as high as 90 per category, by testing the built models on a small

validation set of ��4�G]��4 image patches. We observe a clear trend, that beyond��Z

training patches,

the classification accuracy is not showing noticeable improvement, despite significant increase

in computational cost. At 50 training samples itself, the categorization results are satisfactory.

Hence ^ ! ��Zsamples of each of the four categories, and for each of the 4 patch sizes, are

used for training the 2-D MHMMs to yield models ? ) 3�? V 3(? $ and ?y´ in each case. A biased

SVM is trained using the procedure described in Sec. V-C and used in the likelihood-based class

prediction process. In order to test the effectiveness of categorization over each training size, and

to eventually choose an appropriate patch size for the system, 484�� test patches in each case are

classified using the built models, and the results compared with the generated ”silver standard”.

The confusion matrices for the 4 classes as well as for the classZ

( c �), for each of the 4 patch

27

sizes, are shown in Table II. Note that the accuracy of classifying c �patches reflects on the

model accuracy of both 2-D MHMMs and the biased SVM. A measure of accuracy often used

in the remote-sensing community to evaluate multi-class classification performance is Cohen’s

Kappa Coefficient [7], approximated by �Õ��²� [8], defined as�o�� ! L # [@ '*) � @Ö@ � # [@ '*) 02� @Ö× GØ� @X× × 7L V � # [@ '*) 09� @ × GS� @ × × 7where � is the number of classes, � denotes the confusion matrix obtained for this � -class

classification problem,L

is the total number of test samples, � @ , indicates observation in rowA column Ù , � @Ö× is the total of row A and � @X× × is the total of column A . More specifically,� @ × ! P© , '*) � @ ¬ , and � @ × × ! P© , '*) � , ¬ @ 3 (10)

which denote the row sums and column sums of the confusion matrix respectively. We use this

measure to select a patch size that maximizes �Õ��²� , i.e., produces the best classification for the

categories in question at the corresponding level of granularity.

When taking only classes�

to 4 , the �w��²� coefficients are�� Z8�� , � � Z � � ,

�8Z U1 � , and ��: U� ��respectively for patch sizes

� ��G � �<� , �<4ÚGt��4 , 1��nGt1�� , and� �nG � � respectively. When including

classZ

( c �) as well into consideration, the corresponding �W�� values are �81� � � � , � � X1<4 � ,� Z ��8� , and �� . While these results are overall very encouraging, and the differences across

patch sizes are not significant, the size� ��`G � �<� tends to produce the best categorization among

them by this measure. Moreover, this patch size was preferred by the analyst we consulted with,

for visualization purposes. Hence for building the system prototype, and for the remainder of

our experiments, patch sizes of� ��ÛG � �<� are used.

Sample results obtained when querying our system using an urban patch (Fig. 13) and a

mountain patch (Fig. 14) are shown. To analyze the improvement in speed of retrieval due to

prior categorization with 2-D MHMMs,��Z8Z

queries were made for patches from each category,

and the retrieval times were noted. In the first run, 2-D MHMM categorization was not used.

Thus, for each query patch, the entire database was searched to find similar patches. In the

second run, 2-D MHMM categorization was used to limit the search to only those patches

within the same predicted class. Table III shows the average time per query for patches of each

category for the two runs. As expected, the average time of retrieval for each category is fairly

similar because the size of the database to be searched for each patch is the same. Table III

28

TABLE II

CLASSIFICATION RESULTS (CONFUSION MATRIX) USING 2-D MHMM WITH 4 DIFFERENT PATCH SIZES.Ü¦Ý�ÞÚß·Ü�Ý�ÞMountain Crop Urban Residential Others Accuracy

Mountain (1) �l�áà;Ô Ð �l� � �5Ò â²�ÃÑ Ò;ãCrop (2) Ô â;Ð²� �áÏ �;� ��;Ô Ðl�¦Ñ Ï;Ò;ãUrban (3) � � Ï(�áÐ ä²â Ï;Ï ä(�;Ñ Ò;Òlã

Residential (4) Ï�� Ï²â à;Ð�� Ï²� Ð;Ï�Ñ à5ÒlãOthers (0) Ò²� �5ä àlà ÏlÒ �l�;�� ÐlÐ¦Ñ âl�5ãåÃæRß`å¦æ

Mountain Crop Urban Residential Others Accuracy

Mountain (1) �l�áàl� Ô Ô Ô �l� âl�¦Ñç�ä;ãCrop (2) Ô �áÔ��;� à Ò�� à5Ò Ð;Ï;ãUrban (3) Ô � Ï²àl� �là Ï²� �l�¦Ñ �5älã

Residential (4) �Ò â àlà à5ä²Ð Ï²� Ðlà¦Ñ Ô¦�áãOthers (0) �;à Òlä � �áÒ �l�áà²� Ð;ä�Ñ Òlà;ãè�ÝÚß`èuÝ


Mountain (1) �l�;�ä Ò Ô Ï ä²� â;à�Ñ Ô;Ð5ãCrop (2) � â;â²� ��Ô à;â �áÒlà Ð5Ï(Ñ Ð;à5ãUrban (3) Ô �áÒ ÏlÏ²� Òlä à¦� �lÐ¦Ñ Ò;ã

Residential (4) �áà �áÏ àl� à5ä²Ô ÏlÐ ÐlÔ¦Ñ �¦�áãOthers (0) ä²� ä²� Ï²� ��à ��Ô;Ðl� Ð;Ò�Ñç�áÐ;ãÜ¦åÚß·Ü�å


Mountain (1) �l�Ï;Ï � �l� �áà Ò²Ô âlà¦Ñ Ò;ãCrop (2) Ô �áÔlÔ;à â �5� �O�(Ï Ð;à�Ñ ÒlÐ5ãUrban (3) Ò Ï²à ��â5ä ä� ÏlÐ �lÔ¦Ñ Ï²�5ã

Residential (4) â ��à �5Ï à5älÏ Ïlà Ð��;Ñ Ô5ÒlãOthers (0) â¦� âlâ ��â �áÏ ��Ô5Ò� Ð;Ï�Ñ �5älã

shows the average times per query across categories are fairly consistent for the first run. There

is considerable improvement in retrieval speed for each category in the second run due to the

reduced search space induced by categorization. The speed difference among categories for the

second run is no longer consistent, since the number of patches varies across categories. It is

important to note that the largest number of patches belong to the c �category (uncategorized).

29

Fig. 13. Ordered retrieval results on an Urban query patch. Patch labels consist of (1) Parent image, (2) Local Coordinates

and, (3) IRM distance.

This shows the significance of categorization of patches into uncategorized ( c �) and categorized

( c�� ) using SVM. Without this, the number of patches to be searched of each query of class c��would increase because patches of c �

would get distributed among the trained categories.

As mentioned in Section V-A, normalization is required for effective querying across different

paths. To establish the need for normalization and effectiveness of the scheme used, we perform

the following experiments. In the first trial, images are used without any normalization. The

retrieved patches can be grouped as (1) those that belong to the very same image as the query

patch, (2) those that belong to images from the same path, and (3) those that belong to images

from a different path. A total of�<Z

queries are run for each category. The average percentage of

these groups among the top��Z

ranked patches for each query, are shown in Table IV. It can be

observed that for each category, more than <� � of the results belong to the same parent image

as the query. Also note that more than�8Z��

of the results for the query belong to the same path

as the query. Thus, without normalization, the results tend to show significant bias. In the second

30

Fig. 14. Ordered retrieval results on a Mountain query patch. Patch labels consist of (1) Parent image, (2) Local Coordinates,

and (3) IRM distance.

trial, the normalization procedure described earlier is used on the images before retrieval, and the

experiment repeated. The new results are reported in Table V. We note significant improvement

in the retrieval, with a clear reduction in bias toward the query image and path. Exploring more

robust normalization procedures to counter the bias further, is a possible future direction.

In order to assess the impact of prior categorization using 2-D MHMMs and SVM in improving

retrieval effectiveness, we perform the following experiment. Of the { patches displayed in

response to each query, one measure to determine retrieval effectiveness is the percentage of

relevant patches in them, i.e., the precision. It is measured as follows. For each category, we

use the system to retrieve from�

to 1 Z patches per query (in intervals of�) and measured the

percentage of patches retrieved that have the same manual category label as the query patch.

This is repeated�

times for each category and the average precision is plotted over variation of{ , as shown in Fig. 15. The most vital observation made is that semantic categorization using

2-D MHMM results in roughly � � to��Z��

improvement in retrieval relevance. For specific

31

TABLE III

COMPARISON OF RETRIEVAL TIMES WITH AND WITHOUT PRIOR 2-D MHMM CLASSIFICATION

Search Entire database Semantically relevant patches No. of patches

Mountain (1) é�êXë²ìuíÃî é�ê éÃï¦ðÃî í�ìÃðÃñCrop (2) é�êXë;òÃï¦î é�ê é�ì<ë;î ë²ì�óÃó�ï

Urban (3) é�êXë;ò¦ôÃî éqê éqë;î ëlðÃñ�ëResidential (4) é�êXë²ìuðÃî é�ê é�ë5î ï(ìÃð¦ì

Others (0) é�êXë;ò¦ðÃî é�ê éuó�ôÃî ï�ëlôuò(ìTABLE IV

AVERAGE DISTRIBUTION OF PATCH RETRIEVAL WITHOUT NORMALIZATIONõ÷öuøúùlû�ü5ý;þqÿ��ý��¥û��Same image Same path Different path

Mountain (1) íÃé�� í�ì�� ô��Crop (2) ë;é¦é�� ëlé¦é�� é��

Urban (3) ó�ð�� í�ë� í��Residential (4) ðuï� í¦ô�� ì��

TABLE V

AVERAGE DISTRIBUTION OF PATCH RETRIEVAL AFTER NORMALIZATIONõ÷öuøúùlû�ü5ý;þqÿ��ý��¥û��Same image Same path Different path

Mountain (1) ôuï� ð�ì�� ëlô��Crop (2) ó�ð�� ð¦í�� ë¦ë�

Urban (3) òÃï� ó(ñ�� ïÃó�Residential (4) òuó�� ó�ë�� ï�í��

requirements, these plots may be used to choose suitable values of { . We have thus established

the effectiveness of prior categorization using 2-D MHMMs as a tool for satellite image mining,

both in terms of retrieval precision as well as in terms of retrieval speed.

It is worth noting that, patches in untrained categories can also be effectively retrieved as

shown in Fig. 16, albeit with less precision. The retrieval for untrained categories is also efficient

because the system searches for similar patches only among the patches labeled c �rather than

32

5 10 15 20 25 3050

60

70

80

90

100Accuracy of retrieval: Mountain

Per

cent

age

Rel

evan

ce

No. of patches retrieved (Q)

Mountain − IRM OnlyMountain − IRM + 2−D MHMM

5 10 15 20 25 3050

60

70

80

90

100Accuracy of retrieval: Crop field

Per

cent

age

Rel

evan

ce


Crop − IRM OnlyCrop − IRM + 2−D MHMM

5 10 15 20 25 3050

60

70

80

90

100Accuracy of retrieval: Urban area

Per

cent

age

Rel

evan

ce


Urban − IRM OnlyUrban − IRM + 2−D MHMM

5 10 15 20 25 3050

60

70

80

90

100Accuracy of retrieval: Residential area

Per

cent

age

Rel

evan

ceNo. of patches retrieved (Q)

Residential − IRM OnlyResidential − IRM + 2−D MHMM

Fig. 15. Average precision of IRM based retrieval for each category, with and without 2-D MHMM categorization.

the entire database of patches. We finally comment on the 2-D MHMM training times. About

20 minutes are required to train each 2-D MHMM on a 1.7 GHz Intel Xeon machine, taking

approximately 80 minutes to build the required models for the system. However, this process

can be run off-line, and is non-recurring. The subsequent indexing process is done only once

for each image added to the database. The system performs retrieval in real-time using the fast

and robust IRM measure.

VIII. DISCUSSION AND FUTURE WORK

The proposed system uses a convenient learning based approach for large-scale browsing and

retrieval of satellite image patches. It has been shown that automatic semantic categorization of

patches using 2-D MHMM prior to retrieval improves performance in terms of speed as well

as precision of retrieval. Prior categorization reduces the search space to fewer, more relevant

patches, thereby reducing search time. Searching through only the semantically relevant patches

leads to improvement in the quality of retrieval. SVM has been effectively used to deal with

patches that have not been trained for. Performing classification at patch level instead of pixel

33

Fig. 16. Demonstrating the effectiveness of our system on untrained categories. Shown here are ordered retrieval results on a

coastline query patch (coastline - untrained).

level in satellite images helps in building a convenient interface for browsing. Adding new

satellite images to the system is fairly straightforward and does not require re-training of the

existing models. To add a new category of interest, a new 2-D MHMM model is required to be

trained only for that category, while the existing models can be re-used. The SVM classifier must,

however, be retrained with samples from all classes. Two different normalization approaches have

been attempted, and for this specific case, histogram transformations have been found effective

for giving the system robustness to sun elevation and other atmospheric variations. There are,

however, still some issues which have not been tackled and form part of our ongoing research.r Square patches are used in the proposed system due to the convenience in computation, but

the users may desire more flexible shapes for querying. It will be interesting to see how

accuracy of retrieval varies with the size of the patches.r The impact of using standard normalization procedures such as DN to at-satellite reflectance

in place of histogram transformations on our particular system, though experimented with,

34

has not been explored extensively, and hence remains future work.r Only four low level land cover categories have been considered in this work. Although

classification over a small number of categories have often been experimented with and

found to be useful for applications, for example [15], [5], [38], extending our system to

support more categories will be beneficial. If we wish to adapt the system for more finer

categories, a higher resolution imagery such as IKONOS. Instead of the rough categories

used in this work, an interesting future direction for us is to explore using higher resolution

imagery and more precise land cover categories. We note that these can be attempted without

making any major changes to the proposed architecture.r An important issue related to the categories used in the experiments was the presence/absence

of noise in the form of snow and cloud covers. While developing the system, we faced this

problem, especially when training on the ‘mountain’ patches, which were a mix of snow-

capped mountains, and low-altitude hills. To resolve this, we took advantage of the fact

that 2D-MHMM based models have been found capable of learning significantly diverse

image categories [22]. We hand-picked a mix of all observed variants within a land cover

category into the training set. As noted, this approached seemed to work, since the accuracy

for the mountain category was as high as 94.5%. This reinforces the belief that 2D-MHMM

is capable of learning fairly diverse land cover categories. As for cloud covers, the Landsat

images that our experiments were conducted on did not have significant areas of cloud

cover. However, this may often be the case, and this can undermine the performance of the

system. This problem can potentially be tackled by either building multiple 2D-MHMM

based models for each category with and without cloud cover, or by using multi-temporal

image stacks for more improved categorization and retrieval. The use of multi-temporal

imagery for improved classification and retrieval of patches has not been explore in this

work, but is believed to have potential for such improvement. It may be possible to train

different models of the same land cover categories over different periods of time. Pooling

of all such patches into one database and performing retrieval may be one straightforward

approach to accounting for annual weather changes (snow, clouds) in the retrieval process.

35

REFERENCES

[1] J. R. Anderson, E. H. Harvey, J. T. Roach, R. E. Whitman, “A Land Use and Land Cover Classification System for Use

with Remote Sensor Data,” Geological Survey Professional Paper 964, US. Government Printing Office, 1976.

[2] P. M. Atkinson and A. R. L. Tatnall, “Neural Networks in Remote Sensing,” International Journal of Remote Sensing,

18(4):699–709, 1997.

[3] L. Bastin, “Comparison of Fuzzy C-means Classification, Linear Mixture Modeling and MLC Probabilities as Tools for

Unmixing Coarse Pixels,” International Journal of Remote Sensing, 18(17):3629-3648, 1997.

[4] H. Bischof, W. Schneider, and A. J. Pinz, “Multispectral Classification of Landsat-images using Neural Networks,” IEEE

Trans. on Geoscience and Remote Sensing, 30(3):482–490, 1992.

[5] L. Bruzzone, D. F. Prieto, and S. B. Serpico, “ Neural-Statistical Approach to Multitemporal and Multisource Remote-

Sensing Image Classification,” IEEE Trans. on Geoscience and Remote Sensing, 37(3):1350–1359, 1999.

[6] C.-c. Chang and C.-j. Lin, “LIBSVM : A Library for SVM,” 2001. Software available from:

http://www.csie.ntu.edu.tw/ � cjlin/libsvm.

[7] J. Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, 20:37-46, 1960.

[8] R. G. Congalton, “A Review of Assessing the Accuracy of Classifications of Remotely Sensed Data,” Remote Sensing of

Environment, 37:35–46, 1991.

[9] H. Daschiel and M. Datcu, “Information Mining in Remote Sensing Image Archives: System Evaluation”, IEEE Trans. on

Geoscience and Remote Sensing, 43(1):188–199, 2005.

[10] M. Datcu, H. Daschiel, A. Pelizzari, M. Quartulli, A. Galoppo, A. Colapicchioni, M. Pastori, K. Seidel, P. G. Marchetti,

and S. D’Elia, “Information Mining in Remote Sensing Image Archives: System Concepts”, IEEE Trans. on Geoscience

and Remote Sensing, 41(12):2923–2936, 2003.

[11] R. Datta, J. Li, and J. Z. Wang, “Content-Based Image Retrieval - A Survey on the Approaches and Trends of the New

Age,” Proc. ACM International Workshop on Multimedia Information Retrieval, ACM Multimedia, 2005.

[12] G. M. Foody, “Approaches for the Production and Evaluation of Fuzzy Land Cover Classifications from Remotely-sensed

Data,” International Journal of Remote Sensing, 17(7):1317–1340, 1996.

[13] A. Gersho, “Asymptotically Optimum Block Quantization,” IEEE Trans. on Information Theory, 25(4):373–380, 1979.

[14] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Texture Features for Image Classification, IEEE Trans. on Systems, Man

and Cybernetics, 3(6):610–621, 1973.

[15] L. Hermes, D. Frieauff, J. Puzicha, and J.M. Buhmann, “Support Vector Machines for Land Usage Classification in Landsat

TM Imagery,” Proc. Geoscience and Remote Sensing Symposium, 1999.

[16] C. Huang, L. Yang, C. Homer, B. Wylie, J. Vogelman, and T. DeFelice, “At-satellite Reflectance: A First-order

Normalization of Landsat 7 ETM+ images,” USGS Technical Report, 2001.

[17] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.

[18] J. R. Jensen, “Introductory Digital Image Processing: A Remote Sensing Perspective,” Prentice Hall, 1995

[19] A. S. Kumar and K. L. Majumder, “Information Fusion in Tree Classifiers,” International Journal on Remote Sensing,

22(5), 861-869, 2001.

[20] C.-S. Li and V. Castelli, “Deriving Texture Feature Set for Content-based Retrieval of Satellite Image Database,” IEEE

ICIP, 1997.

[21] J. Li, R. M. Gray, and R. A. Olshen, ”Multiresolution Image Classification by Hierarchical Modeling with Two Dimensional

Hidden Markov Models,” IEEE Trans. on Information Theory, 46(5):1826–1841, 2000.

36

[22] J. Li and J. Z. Wang, “Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach,” IEEE Transactions

on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, 2003.

[23] W.-Y. Ma and B. S. Manjunath, “A Texture Thesaurus for Browsing Large Aerial Photographs,” Journal of the American

Society for Information Science, 49(7):633–648, 1998.

[24] B. S. Manjunath and W.-Y. Ma, “Texture Features for Browsing and Retrieval of Image Data,” Journal of Applied Optics:

Information Processing, 43(2):210–217, 2004.

[25] F. Melgani and L. Bruzzone, “Classification of Hyperspectral Remote Sensing Images with Support Vector Machines,”

IEEE Trans. on Geoscience and Remote Sensing, 42(8):1778–1790, 2004.

[26] S. Newsam, L. Wang, S. Bhagavathy, and B. S. Manjunath, “Using Texture to Analyze and Manage Large Collections of

Remote Sensed Image and Video Data,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(8):837–842, 1996.

[27] H. Ren and C.-I. Chang, “A Generalized Orthogonal Subspace Projection Approach to Unsupervised Multispectral Image

Classification,’ IEEE Transactions on Geoscience and Remote Sensing, 38(6):2515–2528, 2000.

[28] M. Schrder, H. Rehrauer, K. Seidel, and M. Datcu, “Interactive Learning and Probabilistic Retrieval in Remote Sensing

Image Archives, IEEE Trans. on Geoscience and Remote Sensing, 38(5):2288-2298, 2000.

[29] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-Based Image Retrieval at the End of the

Early Years,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.

[30] P. C. Smits, S. G. Dellepiane, R. A. Schowengerdt, “Quality assessment of image classification algorithms for land-cover

mapping: a review and a proposal for a cost-based approach”, International Journal of Remote Sensing, 20(8):1461–1486,

1999.

[31] A. H. S. Solberg, T. Taxt, and A. K. Jain, “A Markov Random Field Model for Classification of Multisource Satellite

Imagery,” IEEE Trans. on Geoscience and Remote Sensing, 34(1):100–113, 1996.

[32] B. C. K. Tso and P. M. Mather, “Classification of Multisource Remote Sensing Imagery using a Genetic Algorithm and

Markov Random Fields,” IEEE Trans. on Geoscience and Remote Sensing, 37(3):1255–1260, 1999.

[33] U.S. Geological Survey, “Landsat (Sensor: ETM+),” EROS Data Center, Sioux Falls, SD. Available from:

http://glcf.umiacs.umd.edu/data/landsat/.

[34] F. Wang, “Fuzzy supervised classification of remote sensing images,” IEEE Transactions on Geoscience and Remote

Sensing, 28(2):194–201, 1990.

[35] J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries,” IEEE

Trans. on Pattern Analysis and Machine Intelligence, 23(9):947–963, 2001.

[36] J. Z. Wang, J. Nguyen, K. Lo, C. Law, and D. Regula, “Multiresolution Browsing of Pathology Images Using Wavelets,”

Journal of the American Medical Informatics Association, Proc. of AMIA Annual Symposium, 1999:340–344, 1999.

[37] G. G. Wilkinson, “Results and Implications of a Study of Fifteen Years of Satellite Image Classification Experiments,”

IEEE Trans. on Geoscience and Remote Sensing, 43(3):433–440, 2005.

[38] J. Zhang and G. M. Foody, “A Fuzzy Classification of Sub-urban Land Cover from Remotely Sensed Imagery,” International

Journal of Remote Sensing, 19(14):2721–2738, 1998.

Documents

1 Scalable Remotely Sensed Image Mining Using Supervised ... · Scalable Remotely Sensed Image Mining Using Supervised Learning and Content-based Retrieval Ritendra Datta, Jia Li,