Upload
hanhan
View
222
Download
2
Embed Size (px)
Citation preview
Satellite Image Scene Classification based on DeeperConvNet with Context Aggregation
Yingbin Zhenga, Bingfei Fua, Weiyuan Shaoa, Jian Pub, Hao Yea
aShanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai, ChinabEast China Normal University, Shanghai, China
Abstract
Scene classification is a fundamental problem to understand the high-resolution
remote sensing imagery. Recently, convolutional neural network (ConvNet) has
achieved remarkable performance in different tasks, and significant efforts have
been made to develop various representations for satellite image scene classifica-
tion. In this paper, we present a novel representation based on a deeper ConvNet
with context aggregation. The proposed two-pathway ResNet (ResNet-TP) ar-
chitecture adopts the ResNet [1] as backbone, and the two pathways allow the
network to model both local details and regional context. The ResNet-TP based
representation is generated by global average pooling on the last convolutional
layers from both pathways. Experiments on two scene classification datasets,
UCM Land Use and NWPU-RESISC45, show that the proposed mechanism
achieves promising improvements over state-of-the-art methods.
Keywords: High-resolution remote sensing imagery, scene classification,
convolutional neural network, ConvNet, residual learning, context aggregation.
1. Introduction
With the growing deployment of remote sensing instruments, satellite image
scene classification, or scene classification from high-resolution remote sensing
imagery, has drawn attention from remote sensing community for its potential
applications in various problems such as environmental monitoring and agricul-
ture. Multiple challenges exist to produce accurate scene classification results.
Preprint submitted to Elsevier February 5, 2018
arX
iv:1
802.
0063
1v1
[ee
ss.I
V]
2 F
eb 2
018
Large intra-class variation in the same scene class is a common issue. Moreover,
the semantic gap between the scene semantic and the image features could fur-
ther increase the difficulties of robust classification. Thus, the design of suitable
representations on satellite images to deal with the challenges is of fundamental
importance.
Great progress has been achieved in the recent years with the utility of rep-
resentations based on the convolutional neural networks (ConvNet), which led
to breakthroughs in a number of computer vision problems like image classi-
fication. The typical ConvNet including AlexNet [2], SPP-net [3], VGG [4],
and GoogleNet [5], has also been applied to the task of satellite image scene
classification. As the image number of the satellite image datasets are order-
of-magnitude smaller than that of the image classification datasets (e.g., Ima-
geNet [6]) and may not sufficient to train the robust deep models, these ConvNet
based methods usually employ the off-the-shelf pre-trained deep networks (e.g.,
in [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]). The activations of the layers
or their fusion are considered as the visual representation and sent to the scene
classifiers. Evaluations on the benchmarks show that the deep learning based
features often outperform previous handcrafted features.
The number of stacked layers in most current deep networks for satellite
images is relatively small. For example, [14, 11, 12, 13] design classification
systems based on the 7-layer architecture of AlexNet [2] or its replication Caf-
feNet [20], and [14, 15, 16, 17] employ the 16-layer VGG architecture [4]. Recent
evidence suggests that deeper convolutional networks are more flexible and pow-
erful with high modeling capacity for image classification [1, 21]. Some previous
works (e.g., [18, 8]) employ the ResNet [1] as one of the basic models. However,
the effectiveness of these deeper models and how their performance depends on
the number of layers are still not fully exploited for remote sensing images.
In this work, we focus on the problem of deeper convolutional networks, and
introduce an image representation built upon a novel deeper architecture for
satellite images, which adopts the Residual Networks (ResNet) [1] as backbone.
The two-pathway ResNet (or ResNet-TP abbreviatedly) is proposed, and Fig. 1
2
ImageNet Image
conv1+
pool1
OutputStride 4 8 16 32
64
32
pool5
pool5
Imag
e L
abel
Satellite Image
conv1+
pool1
OutputStride 4 8
conv2_x
16
conv3_x
32
conv4_x
64
32
pool5
pool5
Phase1: pre-training Shared Parameters
Para. Fine-tuning
conv2_x conv3_x conv4_x
Input Image
conv1+
pool1
OutputStride 4 8
conv2_x
16
conv3_x
32
conv4_x
64
32
pool5
pool5
…
Representation
Phase2: fine-tuningPhase3: feature extraction
Shared Parameters
Scen
e L
abel
FC
FC
Figure 1: Pipeline of the proposed framework with two-pathway ResNet (ResNet-TP). The
network is pre-trained using ImageNet database (Phase 1). Phase 2 produces the fine-tuned
network with the satellite image dataset. A given satellite image goes through the network
and the representation is generated from the global average pooling on the last convolutional
layers (Phase 3).
illustrates the pipeline. The proposed structure aims to aggregate the contextual
information to enhance the feature discrimination. The input images go through
two paths of convolutional operations after conv4 x layers: one path follows
the default building block settings, and another path incorporates the dilation
within convolutional layers to expand the receptive field, which has been shown
to be effective in many tasks such as semantic segmentation [22, 23], video
action detection [24], RGB-D [25], and DNA modeling [26]. Training the deeper
ConvNet is usually more difficult and may lead to higher risk of overfitting,
especially when using the relatively small remote sensing dataset. Therefore,
we employ the transfer learning strategy to reuse the parameters learned from
image classification dataset. The idea of constructing contextual representations
has been taken in several previous remote sensing works, e.g., [13] and [7]. These
approaches use a single spatial pyramid pooling [3] on the feature maps of last
convolutional layer, which is usually tiny after progressively resolution reduction
3
IN OUT
IN OUTIN
Ws
Figure 2: The building block with the deeper residual function F . IN and OUT denote
number of in-plane and out-plane, respectively. Top: the basic building block. Bottom: the
bottleneck building block.
of previous operations. ResNet-TP is designed with contextual pathways before
last convolutional layers, and is able to alleviate the loss of spatial acuity caused
by tiny feature maps. To evaluate our proposed framework, we report the
evaluations on the recent NWPU-RESISC45 [16] and the UC Merced (UCM)
Land Use dataset [27]. Our representation is compared with several recent
approaches and achieves state-of-the-art performance.
2. Methodology
2.1. ResNet Architecture
We begin with a brief review of the ResNet and the residual learning to
address the training issue of deeper neural networks, which is the foundation to
win the ILSVRC&COCO 2015 competition for the tasks of ImageNet detection,
ImageNet localization, COCO detection, and COCO segmentation [1]. First,
downsampling is performed directly by one 7 × 7 convolutional layer and one
max-pooling (over 2 × 2 pixel window) with stride 2, respectively. The main
component used to construct the architecture is the stacked convolutional layers
with shortcut connections. Such building block is defined as
H(x) = F(x, {Wi}) + Wsx, (1)
where x and H(x) are the input and output of the building block, F(·) is
the residual mapping function to be learned, Wi is the parameters of the con-
volutional layers, and Ws is linear projection matrix to ensure the dimension
4
Table 1: Configuration of the groups in ResNet-TP Architecture. Suppose the input image
is with size 224 × 224. Basic(IN,OUT ) and Bottleneck(IN,OUT ) denote the basic and
bottleneck building block with number of in-plane IN and out-plane OUT (see Fig. 2). ‘×ni’
indicates stacking ni blocks, where [n2, n3, n4, n5]=[2,2,2,2] for 18 layer, [3,4,6,3] for 34/50
layer, [3,4,23,3] for 101 layer.
GroupBlock Output size,
18/34 layer 50/101 layer dilation
conv1+pool1 [7×7, 64]; Max Pooing 56×56, 1
conv2 x Basic(64,64)×n2 Bottleneck(64,256)×n2 56×56, 1
conv3 x Basic(128,128)×n3 Bottleneck(128,512)×n3 28×28, 1
conv4 x Basic(256,256)×n4 Bottleneck(256,1024)×n4 14×14, 1
conv5 2 xBasic(512,512)×n5 Bottleneck(512,2048)×n5
14×14, 2
conv5 1 x 7×7, 1
matching of x and F (Ws is set as identity matrix when they are with the same
dimension). The operation F(·) + Wsx is performed by a shortcut connection
and element-wise addition. There are usually two or three layers within one
building block, and two typical building blocks are shown in Fig. 2, where the
basic building block is for 18/34-layer and the bottleneck building block is for
50/101/152-layer in [1]. The convolution is performed with the stride of 2 after
a few building blocks to reduce the resolution of feature maps. Unlike previous
architectures such as AlexNet [2] and VGG [4], ResNet has no hidden fully-
connected (FC) layers; it ends with a global average pooling and then a N -way
FC layer with softmax (N is the number of classes). We refer the reader to [1]
for more details.
2.2. Context Aggregation
We now elaborate the construction of ResNet-TP. The architecture of the
network is summarized in Table 1. In general, the network contains six groups of
layers or building blocks. Group conv1+pool1 consist of the 7×7 convolutional
5
layer and the max-pooling, and conv2 x to conv4 x are with a stack of building
blocks. All their configurations follow the generic design presented as in Sect.
2.1, and differ only in the depth of blocks. Consider an input image with
224× 224 pixels, group conv4 x is with output stride of 16 and thus its feature
map size is 14× 14.
We introduce group with dilation convolutional layers and construct the
two-pathway architecture. This architecture is made of two streams: a pathway
with normal building blocks (conv5 1 x) and another with larger receptive fields
(conv5 2 x). The dilation is operated on the 3 × 3 convolutional layer in the
building block. Let x be the input feature map and w be the filter weights
associated with the dilation convolutional layer, the output y for position p =
(p1, p2) is defined as:
y(p) =∑d∈Gd
w(d) · x(p + d) (2)
where Gd = {(−d,−d), (−d, 0), . . . , (0, d), (d, d)} is the grid for the 3 × 3 filters
and d is the dilation. We set the dilation d = 2 for conv5 2 x, and the layers
in conv5 1 x can also be considered as a special case with d = 1. The moti-
vation for this architectural design is that we would like the prediction to be
influenced by two aspects: the visual details of the region around each pixel of
the feature map as well as its larger context. In fact, ResNet-TP is degenerated
to the standard ResNet when conv5 2 x and its subsequent layers are removed.
Finally, we connect the last convolutional hidden layers in both pathways with
the global average pooling followed by the FC layer with softmax to perform a
prediction of the labels.
2.3. Model Training and Implementation Details
The ResNet-TP architecture is with a large amount of parameters to train.
A traditional remote sensing dataset contains thousands of high-resolution satel-
lite images, which is far less than the image classification datasets for training
the state-of-the-art deep learning models. Following previous works [14, 7, 8],
training of ResNet-TP is based on the transfer learning strategy and Fig. 1
6
illustrates the overall framework of the proposed ResNet-TP based scene repre-
sentation.
The whole training procedure as well as the feature extraction are carried
out via the open source PyTorch library and an Nvidia Titan X (Pascal) GPU.
The first phase is to get a pre-trained model using the ImageNet database [6].
During this process, due to the network with only conv5 1 x pathway having
the same structure with the original ResNet, we set the weights of conv1 x to
conv5 1 x with the existing PyTorch ResNet models1. Directly updating model
from this initialization lead to performance drop, as the parameters of conv5 2 x
are randomly initialized. On the other hand, it is time-consuming if the model is
trained from scratch, since ImageNet contains millions of images. Here we make
a compromise by learning the weights of conv5 2 x and its subsequent layers
from the network with only conv5 2 x pathway and by frozen of conv1 x to
conv4 x2. We compare this pre-training strategy with the model trained from
scratch under ResNet-TP-18, and find that they are with similar performance
on ImageNet validation set, while its training is much faster.
For the fine-tuning phase, we only fine-tune the building block groups af-
ter conv3 x by using the training satellite images and their labels due to the
limitation of the GPU memory. We take random rotated, mirrored, or scaled
images for data augmentation during fine-tuning. Finally, the representation is
obtained from the global average pooling in both pathways, and the linear SVM
classifier with default setting C = 1 is carried out for a fair comparison with
previous works [18, 7, 16, 17].
1The download link can be found from https://github.com/pytorch/vision/blob/
master/torchvision/models/resnet.py2The stochastic gradient descent (SGD) is used with batch size of 64 and momentum of
0.9. The learning rate is initially set to be 0.01 and is divided by 10 every 30 epochs.
7
Figure 3: Scene categories from UCM Land Use: (1st row) agricultural, airplane, baseball
diamond, beach, buildings, chaparral, dense residential; (2nd row) forest, freeway, golf course,
harbor, intersection, medium residential, mobile home park; (3rd row) overpass, parking lot,
river, runway, sparse residential, storage tanks, tennis court.
3. Experiments
To evaluate the effectiveness of the proposed method, we compare it with
several state-of-the-art approaches on two remote sensing scene classification
datasets, including the widely used 21-Class UCM Land Use dataset [27] and
the recent proposed 45-Class NWPU-RESISC45 dataset [16].
3.1. UCM Land Use
The UCM Land Use dataset contains 2100 aerial scene images extracted from
United States Geological Survey (USGS) national maps. Each land use class is
composed of 100 images with the spatial resolution of 1 ft and the size of 256×
256 pixels. The scene categories and the sample images are illustrated in Fig. 3.
In our experiments, the images are wrapped into the size of 224× 224, and the
subset with a fixed image number (range from 5 to 90 for each class) is randomly
selected as training set. To test the stability of the proposed representation, the
experiments are executed 20 times to obtain convincing results, and the mean
accuracy and standard deviation are adopted as the evaluation indicators.
We first evaluate the overall results of proposed representation and compare
them with several state-of-the-art approaches. Table 2 summarizes the overall
8
Figure 4: Scene categories from the NWPU-RESISC45 dataset: (1st row) airplane, airport,
baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland; (2nd
row) cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track
field, harbor; (3rd row) industrial area, intersection, island, lake, meadow, medium residential,
mobile home park, mountain, overpass; (4th row) palace, parking lot, railway, railway station,
rectangular farmland, river, roundabout, runway, sea ice; (5th row) ship, snowberg, sparse
residential, stadium, storage tank, tennis court, terrace, thermal power station, wetland.
accuracy and standard deviation of all the classes in the UCM Land Use3. As
can be seen from the table, the ResNet-TP based representation shows very
competitive performance with different number of training images, which is sig-
nificantly better than the other representations when the training images are
limited. The approach ResNet152 EMR [18] is also a ResNet based representa-
tion, but is generated by combining information from multiple layers with larger
input image size (320×320). When the input image size is set to 224×224 pix-
els, the classification accuracy is 98.38%, which is inferior to ours. We believe
that ResNet-TP based representation is also complementary to these mixed-
resolution methods since they focus on different levels of information, which
3Here we only compare the results from single representation. We believe the fusion with
other features leads to the better performance, however it is not within the scope of this work.
9
Table 2: Overall accuracies and standard deviations (%) of the proposed methods and state-
of-the-arts on the UCM dataset. ‘-’ indicates that the results are not available in the corre-
sponding paper.
Number of images 5 50 80
SSEP [28] 65.34± 2.01 - -
MKL [29] 64.78± 1.62 88.68± 1.10 91.26± 1.17
SPP-net MKL [7] 75.33± 1.86 95.72± 0.50 96.38± 0.92
AlexNet-SPP-SS [9] - - 96.67± 0.94
VGG-16 [10] - 94.14± 0.69 95.21± 1.20
ResNet50 [8] - - 98.50± 1.40
ResNet152 EMR [18] - - 98.90
ResNet-TP-101 78.12± 2.28 98.03± 0.39 98.71± 0.49
will be examined in the future work.
Network Depth and Pathways: Recall that the ResNet-TP are constructed
from the stacking of building blocks, based on different block setting shown in
Table 1. In Table 3 we list the classification accuracy versus the network depth
and pathway. Using ResNet-TP-18 model, we can already get an accuracy of
98.26%, and adding more layers further boosts the accuracy. We also analyze
the influence of pathways. The network with only one stream of the dilation
convolution conv5 2 x and its subsequences achieves higher performance than
that of conv5 1 x under the setting of 18/101 layer, and lower in 34/50 layer.
We conjecture that the applied single pathway may not be the one at which
the network responds with optimal confidence, and context aggregation with
multiple pathways increase the robustness.
Number of Training Images: We also evaluate the effect of training im-
age number in the representation. We observe significant performance gains
when the number of training images increases from 10 to 50, after which the
performance tends to be saturated. Another observation is that the result of
ResNet-TP-50 is similar to the accuracy of ResNet-TP-101 in most of the com-
10
Table 3: Evaluation on UCM dataset with different pathway of the ResNet-TP models.
conv5 1 x and conv5 2 x indicates the network with only one stream of ResNet-TP. The
number of training images is 80.
Model Full conv5 1 x conv5 2 x
ResNet-TP-18 98.26± 0.74 97.80± 0.77 97.83± 0.90
ResNet-TP-34 98.43± 0.68 98.37± 0.54 98.06± 0.73
ResNet-TP-50 98.56± 0.53 98.49± 0.63 98.15± 0.62
ResNet-TP-101 98.71± 0.49 98.50± 0.59 98.57± 0.52
UCM parameter
80
85
90
95
100
10 20 30 40 50 60 70 80 90
ResNet-TP-18 ResNet-TP-34 ResNet-TP-50 ResNet-TP-101
Figure 5: Evaluation of the ResNet-TP models with different number of training images on
the UCM dataset.
parisons, indicating that the computation could be saved by ResNet-TP-50 with
marginal performance drop.
3.2. NWPU-RESISC45
The NWPU-RESISC45 dataset contains 31500 remote sensing images ex-
tracted from Google Earth covering more than 100 countries and regions. Each
scene class is composed of 700 images with the spatial resolution varied from
about 30 to 0.2 m per pixel. Sample images and the scene categories are shown
in Fig. 4, and we also wrap the images into the size of 224× 224. We follow the
official train/test split strategy with two training ratios, i.e., 10% (10% for train-
ing and 90% for testing) and 20% (20% for training and 80% for testing). We
repeat the evaluations ten times under each training ratio by randomly splitting
the dataset and also report the mean accuracy and standard deviation.
11
Table 4: Overall accuracies and standard deviations (%) of the proposed methods and state-
of-the-arts under different training ratios on the NWPU-RESISC45 dataset. The results of
pre-trained and fine-tuned ConvNets are reported in [16].
Feature Training ratios
ref. Network 10% 20%
AlexNet 81.22± 0.19 85.16± 0.18
Pre-trained ConvNet GoogleNet 87.15± 0.45 90.36± 0.18
VGG-16 82.57± 0.12 86.02± 0.18
AlexNet 81.22± 0.19 85.16± 0.18
Fine-tuned ConvNet GoogleNet 87.15± 0.45 90.36± 0.18
VGG-16 82.57± 0.12 86.02± 0.18
BoCF [17]
AlexNet 55.22± 0.39 59.22± 0.18
GoogleNet 78.92± 0.17 80.97± 0.17
VGG-16 82.65± 0.31 84.32± 0.17
[19] Triplet networks - 92.33± 0.20
This workResNet-TP-18 87.79± 0.28 91.03± 0.26
ResNet-TP-101 90.70± 0.18 93.47± 0.26
We compare ResNet-TP based representation with several baselines and
state-of-the-art approaches. Among them, the first group contains several well-
known baseline descriptors, including pre-trained or fine-tuned AlexNet [2],
GoogleNet [5], and VGG-16 [4]. Table 4 shows that the proposed represen-
tation outperforms all the baseline descriptors as well as the state-of-the-art
approaches shown in the middle part of Table 4, including the Bag of Convolu-
tional Features (BoCF) [17] and a very recent triplet networks [19]. In Fig. 6,
we report the confusion matrix and detail classification accuracy for each scene
label using training ratios of 20%.
Network and Training Ratio: We study the performance of different network
settings and training ratios, and Results are given in Fig. 7. Similar to the
observation from UCM dataset, adding the layers in ResNet-TP architecture
12
94.91.10.00.00.00.20.00.30.00.20.30.00.00.00.00.00.00.60.60.00.00.00.00.00.00.00.00.00.80.20.30.00.00.01.30.00.80.00.00.00.20.00.00.30.0
0.389.80.00.00.01.10.00.00.20.60.20.00.00.00.00.30.30.00.20.00.00.00.00.00.00.50.00.80.20.01.41.10.20.05.10.00.30.00.00.20.30.00.51.30.2
0.00.089.40.60.00.20.00.00.20.00.00.00.00.20.20.30.00.00.00.00.00.00.20.00.00.00.00.80.00.00.00.00.20.00.00.00.20.00.01.30.00.30.20.00.0
0.00.01.491.30.00.00.00.80.00.01.30.00.00.00.50.00.30.01.90.20.00.00.00.20.00.00.20.81.40.00.00.50.00.50.00.00.00.01.00.00.51.90.00.00.0
0.00.20.00.094.80.50.00.00.00.20.00.00.20.00.20.00.00.00.00.02.20.00.00.00.00.00.00.20.00.00.00.00.60.00.20.00.30.00.00.00.00.00.00.00.3
0.00.20.30.00.288.40.00.00.00.00.20.00.00.00.00.00.00.30.00.00.20.00.00.00.00.00.20.00.00.30.30.00.00.00.00.00.60.00.00.20.00.00.00.00.2
0.00.00.00.00.00.097.50.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.60.00.00.00.00.00.0
0.30.00.00.00.00.20.074.80.00.05.70.20.00.00.00.00.00.00.50.30.00.00.00.20.00.00.012.50.20.00.30.00.00.50.00.00.00.00.00.81.60.00.00.80.0
0.00.00.00.00.00.00.00.096.00.00.00.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.0
0.00.00.00.00.30.00.00.00.096.20.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.30.00.00.00.00.00.00.00.00.60.00.00.30.00.00.00.00.01.10.0
0.00.20.30.20.00.20.02.20.00.081.30.50.00.00.00.00.00.81.71.60.00.00.00.20.20.00.62.50.60.61.10.20.00.80.00.00.00.00.00.00.00.20.00.50.0
0.00.01.90.00.00.00.00.30.00.02.186.70.00.00.00.00.50.20.51.00.00.00.03.72.90.00.00.50.00.20.20.00.01.10.00.00.00.00.00.20.01.00.00.00.0
0.01.40.00.00.30.01.60.00.00.20.00.095.20.00.00.00.00.00.00.00.00.00.00.00.04.10.00.00.20.00.00.00.30.00.00.30.00.20.00.00.00.00.20.00.2
0.00.00.30.00.00.00.80.00.00.00.00.00.093.80.81.00.20.00.00.00.00.01.60.00.01.30.00.00.00.00.00.00.60.00.00.00.00.00.80.00.00.00.20.01.7
0.00.00.50.60.01.90.00.60.00.00.20.00.00.290.00.00.00.00.00.60.00.00.30.30.00.04.40.00.52.10.30.00.00.31.10.00.00.00.80.00.00.80.20.00.0
0.00.00.30.00.00.00.00.00.00.00.00.00.00.00.295.70.00.00.00.00.00.00.50.00.00.00.00.20.00.00.00.20.20.00.00.00.00.00.30.00.20.20.60.00.0
0.00.00.80.00.00.00.00.00.00.00.00.00.00.00.00.294.90.20.30.00.00.20.00.20.00.00.00.20.00.00.00.00.00.00.00.00.00.00.02.50.00.30.00.00.0
0.20.00.00.00.00.50.00.00.00.00.00.00.00.00.00.00.094.90.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.50.00.00.00.50.00.00.00.0
0.00.50.50.00.00.00.00.20.00.00.00.20.00.00.20.00.20.083.30.30.00.00.00.20.20.00.30.00.01.02.20.00.00.00.00.00.00.00.00.01.10.20.01.10.0
0.20.00.00.30.00.30.01.30.00.00.80.50.00.01.00.00.00.20.392.70.00.00.01.00.00.00.80.01.40.20.00.00.03.70.00.00.00.00.00.00.20.50.00.00.0
0.00.30.00.02.50.20.00.00.00.20.20.00.00.60.00.00.00.00.00.091.31.30.00.00.00.00.00.50.00.00.00.00.20.00.00.80.00.00.00.20.00.00.00.00.5
0.20.30.20.00.50.00.00.00.50.30.00.00.20.00.00.00.00.00.00.01.791.30.00.00.01.00.00.00.00.00.00.01.00.00.00.00.00.80.00.00.00.00.00.012.9
0.20.00.50.80.00.50.00.00.00.00.00.00.01.40.21.40.00.00.00.00.00.095.60.00.00.50.00.00.30.20.01.90.20.00.20.00.50.01.90.00.00.00.50.00.6
0.00.00.30.80.00.00.00.30.00.00.07.30.00.20.30.00.00.00.50.80.00.00.091.95.20.00.01.60.00.20.00.00.00.30.00.00.00.03.50.00.04.40.00.00.0
0.20.00.30.00.00.00.00.20.00.00.63.70.00.00.00.00.20.30.60.00.00.00.00.290.50.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
0.00.00.00.00.00.00.00.00.20.00.00.03.80.00.00.00.00.00.00.00.00.80.20.00.089.70.00.00.00.00.00.20.80.00.00.00.01.70.00.00.00.00.20.00.2
0.20.00.30.00.02.90.00.00.00.00.00.00.00.01.10.00.00.00.00.30.00.00.20.00.00.091.00.00.30.30.60.00.00.50.00.00.00.00.00.00.00.20.00.00.0
0.20.30.50.50.30.00.016.00.00.04.81.00.00.00.30.20.50.82.20.60.00.00.00.00.30.00.273.80.60.50.80.50.50.30.20.00.30.00.01.70.30.00.31.10.0
0.00.00.60.20.00.00.00.00.00.00.00.20.00.00.80.00.00.30.60.30.00.00.00.00.50.00.00.092.50.20.20.00.00.20.00.00.00.00.00.00.00.20.00.00.0
0.00.00.00.00.00.00.00.20.00.00.30.00.00.01.90.00.00.00.50.00.00.00.00.00.00.00.20.00.083.05.70.00.00.00.00.00.20.00.50.00.00.00.00.00.0
0.21.60.00.00.20.20.00.50.00.01.30.00.00.01.30.00.20.03.70.20.00.00.00.00.00.01.00.60.29.282.70.00.00.60.00.00.50.00.00.20.20.00.00.20.0
0.00.80.00.00.00.00.00.02.91.30.00.00.00.00.00.00.00.00.30.00.00.20.50.00.00.00.00.00.00.00.387.61.00.00.00.00.00.00.30.00.00.23.70.00.0
0.01.60.00.00.30.60.00.00.00.00.00.00.20.00.50.60.20.00.00.03.51.60.20.00.01.30.00.20.00.00.30.892.50.20.00.00.00.00.00.30.00.00.80.01.9
0.00.00.00.00.00.00.00.20.00.00.20.00.00.00.00.00.00.00.20.20.00.00.00.00.00.01.30.00.20.00.00.00.090.60.00.00.00.00.00.60.00.00.00.00.0
2.51.30.00.00.50.60.00.00.20.00.00.00.00.00.60.00.30.00.00.20.00.00.00.00.00.00.00.00.20.60.20.20.00.291.40.00.00.00.20.00.20.00.00.30.0
0.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.60.00.00.60.20.00.00.20.00.00.00.00.00.00.00.00.00.098.90.01.90.00.00.00.00.00.20.2
0.20.20.00.20.21.60.00.00.00.00.30.00.00.00.00.00.00.60.00.00.30.00.00.00.00.00.00.50.30.00.00.00.20.00.00.094.90.00.00.00.20.00.00.00.0
0.20.00.00.00.00.00.00.00.00.50.00.00.00.00.00.00.00.00.00.00.00.30.00.00.00.30.00.00.00.00.00.00.00.00.00.00.095.10.00.00.00.00.00.00.0
0.00.00.20.20.00.00.00.30.00.00.00.00.01.30.20.20.00.00.20.20.00.00.32.10.20.00.01.00.00.00.00.20.00.00.00.00.00.089.70.00.21.00.00.00.0
0.00.00.20.00.00.00.00.20.00.00.00.00.00.00.00.01.60.20.20.00.00.00.00.00.00.00.00.80.00.00.20.00.00.00.00.00.00.00.090.60.00.00.00.00.0
0.00.00.30.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.60.00.00.291.90.20.00.20.0
0.00.01.04.40.00.00.00.20.00.00.00.00.00.00.00.00.50.00.20.50.00.00.00.20.00.00.00.60.00.00.20.00.00.00.00.00.00.00.50.00.088.70.20.00.0
0.00.00.00.00.00.20.20.00.00.00.00.00.30.00.00.00.00.00.00.20.00.00.00.00.01.00.00.30.00.20.86.51.10.20.00.00.00.00.00.00.00.092.70.00.3
0.30.30.00.00.00.00.01.60.00.30.50.00.00.00.00.00.30.01.60.00.00.00.00.00.00.00.01.60.01.11.70.00.00.00.00.00.30.00.01.12.70.00.093.00.0
0.00.00.00.00.00.00.00.00.00.00.00.00.22.40.00.20.00.00.00.00.04.10.60.00.00.20.00.00.00.00.00.20.60.20.00.00.00.00.00.00.00.00.00.081.0
Airplane
Airport
Baseball
BasketballBeachBridge
Chaparral
Church
CirFarmCloud
CommA
DenseRDesertForest
Freeway Go
lfGTFHarbor
IndustA
IntersecIslandLake
Meadow
MediumR
MHPark
Mountain
OverpassPalace
Parking
Railway
RailwaySt
RectFarm Ri
ver
Round
RunwaySeaIceShip
Snowberg
SparseR
Stadium
StorageTTennis
Terrace TP
S
Wetland
AirplaneAirportBaseball
BasketballBeachBridge
ChaparralChurchCirFarmCloud
CommADenseRDesertForest
FreewayGolfGTF
HarborIndustAIntersecIslandLake
MeadowMediumRMHParkMountainOverpass
PalaceParkingRailway
RailwayStRectFarm
RiverRound
RunwaySeaIceShip
SnowbergSparseRStadiumStorageTTennisTerrace
TPSWetland
95.20.50.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.20.20.00.00.00.00.01.30.00.00.00.00.00.00.00.00.20.0
0.790.90.00.00.00.20.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.90.00.00.50.00.40.03.20.00.40.00.00.20.00.00.00.20.2
0.00.094.60.20.00.00.00.00.20.00.00.00.00.00.20.00.70.00.00.00.00.00.20.20.00.00.00.00.20.00.00.00.00.40.00.00.00.00.01.60.20.50.00.00.0
0.00.00.493.60.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.50.00.01.60.00.00.0
0.00.20.00.097.30.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.20.40.00.00.00.00.00.00.00.00.00.00.40.00.00.00.20.00.00.00.00.00.00.00.4
0.00.00.20.00.093.80.00.00.00.00.20.00.00.00.20.00.00.40.00.00.00.00.00.00.00.00.20.00.00.40.20.00.20.00.00.00.00.00.00.00.00.00.00.00.0
0.00.00.00.00.00.099.50.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.40.00.00.00.00.00.0
0.40.00.00.00.00.00.079.50.00.04.10.00.00.00.20.00.00.20.00.00.00.00.00.20.00.00.013.00.00.00.00.00.00.20.00.00.00.00.00.20.20.00.00.50.0
0.00.00.00.00.00.00.00.099.10.00.00.00.20.00.00.00.00.00.00.00.00.90.00.00.00.20.00.20.00.00.00.50.00.00.00.00.00.00.00.00.00.00.00.00.0
0.00.20.00.00.00.00.00.00.098.00.00.00.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.00.00.00.00.00.00.00.50.00.00.50.00.00.00.00.01.60.0
0.00.00.00.00.00.00.02.90.00.087.01.40.00.00.20.00.00.20.91.30.00.00.00.00.00.00.50.70.20.21.80.00.00.20.00.00.00.00.00.00.00.00.00.40.0
0.00.00.20.20.00.00.00.40.00.01.890.70.00.00.20.00.20.20.20.90.00.00.05.00.90.00.20.50.00.20.00.00.00.70.00.00.00.00.00.20.00.40.00.00.0
0.00.70.00.00.00.00.00.00.00.40.00.096.60.00.00.00.00.00.00.00.00.00.00.00.03.20.00.00.00.00.00.00.90.00.00.20.00.00.00.00.00.00.00.00.5
0.00.00.20.00.00.00.40.00.00.00.00.00.097.30.90.00.50.00.00.00.00.00.90.00.00.20.00.00.00.00.00.00.50.00.00.00.00.00.20.00.00.00.00.00.4
0.20.20.00.40.00.70.00.00.00.00.20.00.00.291.40.00.20.00.00.20.00.00.90.20.00.00.70.00.41.40.00.00.00.00.50.00.00.00.20.00.00.40.20.00.0
0.00.00.20.00.00.00.00.00.00.00.00.00.00.00.098.20.00.00.00.00.00.00.70.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.20.20.00.40.20.00.4
0.00.00.70.00.00.00.00.00.00.00.20.00.00.00.20.094.50.00.20.20.00.00.00.00.20.00.00.40.00.00.00.00.00.00.40.00.00.00.01.40.00.40.00.00.0
0.00.00.00.00.00.70.00.00.00.00.00.00.00.00.00.00.097.90.00.00.00.00.00.00.40.00.00.00.00.00.00.00.00.00.00.00.90.00.00.00.00.00.00.00.0
0.40.70.40.20.00.00.00.70.00.01.60.40.00.00.20.00.40.295.00.20.00.00.00.20.20.00.20.90.21.13.20.00.20.20.00.00.70.00.00.20.50.20.03.00.0
0.00.00.00.70.00.40.00.20.00.00.20.20.00.00.90.00.00.00.092.10.00.00.00.00.00.00.70.00.40.00.00.00.00.50.40.00.00.00.00.00.00.40.00.00.0
0.00.20.00.02.00.40.00.00.00.00.00.00.00.20.00.00.00.00.00.098.01.80.00.00.00.00.00.00.00.00.00.00.40.00.00.00.00.00.00.00.00.00.00.00.5
0.00.00.00.00.20.40.00.00.00.00.20.00.40.00.00.00.00.00.00.00.490.90.20.00.00.40.00.00.00.00.00.01.40.00.00.00.00.00.00.00.00.00.00.06.6
0.00.00.20.50.00.20.00.00.00.00.00.00.00.90.01.40.00.00.00.00.00.095.00.00.00.00.00.00.00.00.00.40.00.00.00.00.20.00.40.00.00.00.00.00.4
0.00.00.91.30.00.00.00.40.00.00.23.80.00.00.50.00.00.00.22.70.00.00.089.60.70.00.20.40.00.00.00.00.00.00.00.00.00.02.90.00.02.00.00.00.0
0.00.00.20.00.00.00.00.00.00.00.02.50.00.00.20.00.00.70.00.00.00.00.00.796.80.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
0.00.00.00.00.00.00.00.00.00.00.00.02.10.00.00.00.00.00.00.00.00.40.20.00.094.50.00.00.00.00.00.00.70.00.00.00.00.70.00.00.00.00.50.00.2
0.00.00.20.00.01.10.00.00.00.00.00.00.00.00.20.00.00.00.00.20.00.00.00.00.00.095.00.00.40.40.20.00.00.50.00.00.00.00.00.00.00.00.00.00.0
0.70.00.00.00.00.00.013.80.00.03.40.90.00.00.40.00.20.01.10.20.00.00.00.40.20.00.078.20.00.20.20.00.20.70.20.00.20.00.20.50.00.00.40.90.0
0.40.00.20.00.00.00.00.20.00.00.00.00.00.00.20.00.00.00.40.70.00.00.00.00.20.00.00.297.90.00.40.00.00.20.00.00.40.00.00.20.00.40.00.00.0
0.00.40.00.00.00.20.00.20.00.00.00.00.00.01.80.00.00.00.50.20.00.00.20.00.00.00.70.20.291.36.10.00.00.00.70.00.20.00.00.00.00.00.00.20.0
0.00.50.00.00.00.20.00.00.00.00.40.00.00.00.70.00.20.00.90.00.00.00.00.00.00.00.70.40.04.685.70.20.00.20.20.00.50.00.00.20.00.00.00.50.0
0.00.40.20.00.00.00.00.00.50.40.00.00.00.00.00.00.00.00.00.00.00.00.50.00.00.20.20.00.00.00.091.60.40.00.00.00.00.00.20.00.00.23.80.00.0
0.00.70.00.00.20.70.00.00.00.00.20.00.00.20.20.20.00.00.00.00.41.10.40.00.00.70.00.20.00.00.20.492.30.00.00.00.00.20.00.40.00.00.70.01.6
0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.40.00.00.00.00.00.00.00.00.00.00.00.00.096.10.00.00.00.00.00.20.00.00.00.00.0
2.04.30.00.00.00.40.00.00.00.00.00.00.00.00.90.00.00.00.00.00.00.00.00.00.00.00.20.40.00.00.40.00.00.292.30.00.20.00.00.00.00.00.00.40.0
0.00.00.00.00.20.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.40.00.00.00.00.00.00.00.00.00.00.00.00.00.099.50.00.20.00.00.00.00.00.00.0
0.00.20.00.00.00.70.00.20.00.00.20.00.00.00.00.00.00.20.00.00.40.00.00.00.00.00.00.40.00.00.00.00.20.00.00.095.70.00.00.00.20.00.00.00.0
0.00.00.00.00.20.00.00.00.00.40.00.00.40.00.00.00.00.00.00.00.00.70.00.00.00.40.00.00.00.00.00.00.00.00.00.00.098.40.00.00.00.00.00.00.0
0.00.00.00.00.00.00.00.40.00.00.00.00.01.30.20.00.00.00.00.00.00.00.53.20.50.00.00.90.00.00.00.20.00.00.00.00.00.094.60.00.21.30.00.00.0
0.00.00.40.00.00.00.00.20.00.00.20.00.00.00.00.03.20.00.20.00.00.00.00.00.00.00.00.40.00.00.50.00.00.00.00.00.00.00.093.90.00.00.00.00.0
0.20.00.20.20.00.00.00.40.00.00.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.00.00.00.20.00.20.00.00.00.00.00.50.00.00.097.50.00.00.20.0
0.00.00.92.90.00.00.00.00.00.00.00.20.00.00.20.00.00.00.20.50.00.00.00.40.00.00.20.40.00.00.00.00.00.00.00.00.00.00.20.00.092.10.00.00.0
0.00.00.00.00.00.20.00.00.20.00.00.00.20.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.26.61.10.00.40.20.00.00.20.00.00.094.10.00.0
0.00.00.00.00.00.00.00.90.00.40.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.00.01.30.00.40.40.00.00.00.00.00.00.00.00.71.30.00.092.00.0
0.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.20.00.00.00.00.23.80.20.00.00.40.00.00.00.00.00.00.70.00.00.20.00.00.00.00.00.00.20.088.9
Airplane
Airport
Baseball
BasketballBeachBridge
Chaparral
Church
CirFarmCloud
CommA
DenseRDesertForest
Freeway Go
lfGTFHarbor
IndustA
IntersecIslandLake
Meadow
MediumR
MHPark
Mountain
OverpassPalace
Parking
Railway
RailwaySt
RectFarm Ri
ver
Round
RunwaySeaIceShip
Snowberg
SparseR
Stadium
StorageTTennis
Terrace TP
S
Wetland
AirplaneAirportBaseball
BasketballBeachBridge
ChaparralChurchCirFarmCloud
CommADenseRDesertForest
FreewayGolfGTF
HarborIndustAIntersecIslandLake
MeadowMediumRMHParkMountainOverpass
PalaceParkingRailway
RailwayStRectFarm
RiverRound
RunwaySeaIceShip
SnowbergSparseRStadiumStorageTTennisTerrace
TPSWetland
Figure 6: Confusion matrices under the training ratio of 10% (top) and 20% (bottom) by
using ResNet-TP-101 on the NWPU-RESISC45 dataset.
and context aggregation by the two-pathways boost the classification accuracy.
In addition, we observe that increasing training images (70 to 140 per class)
lead to significant performance gains, which slightly differs from the saturation
in UCM Land Use, probably due to the scene variation and data diversity in
the NWPU-RESISC45 dataset.
Pre-Trained vs. Fine-Tuned: Our last experiment evaluates the alternative
method for ResNet-TP model generation. While the fine-tuned method follows
the pipeline of Fig. 1, the pre-trained approach is only composed of phase 1 and
3 in the figure, and the model parameters are directly learned from ImageNet.
13
conv_5_1 conv_5_2
Figure 7: Evaluation of the ResNet-TP parameters and components with with different train-
ing ratio on the NWPU-RESISC45 dataset.
NWPU FT - PT
77808386899295
ResNet-TP-18 ResNet-TP-34 ResNet-TP-50 ResNet-TP-101
Fine-tuned,TR=20%
Fine-tuned,TR=10%
Pre-trained,TR=20%
Pre-trained,TR=10%
Figure 8: Comparison of the pre-trained and fine-tuned ResNet-TP models on the NWPU-
RESISC45 dataset. TR indicates training ratio.
The comparison between the curves in Fig. 8 verifies that for both training
ratios using the fine-tuned network is important toward a more discriminant
representation. We also find that the fine-tuned method outperforms the pre-
trained method even though the training images is half of which for pre-trained.
4. Conclusion
In this work, we have introduced ResNet-TP, a deeper convolutional network
with context aggregation to generate a discriminant representation for satellite
image scene classification. Through empirical scene classification experiments,
we have shown that proposed ResNet-TP based representation is more effective
than previous deep features, generating very competitive results on the UCM
Land Use and NWPU-RESISC45 datasets. For future work, we plan to incor-
porate multi-scale and multiple layers into the ResNet-TP based representation,
14
and also explore the performance benefits of a combination of this representation
with other features.
References
[1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
nition, in: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2016, pp. 770–778.
[2] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
Processing Systems (NIPS), 2012, pp. 1097–1105.
[3] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep con-
volutional networks for visual recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence 37 (9) (2015) 1904–1916.
[4] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
scale image recognition, in: International Conference on Learning Repre-
sentations (ICLR), 2015.
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: A large-
scale hierarchical image database, in: IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 2009, pp. 248–255.
[7] Q. Liu, R. Hang, H. Song, Z. Li, Learning multiscale deep features for
high-resolution satellite image scene classification, IEEE Transactions on
Geoscience and Remote Sensing PP (99) (2017) 1–10.
[8] G. J. Scott, M. R. England, W. A. Starms, R. A. Marcum, C. H. Davis,
Training deep convolutional neural networks for land-cover classification
15
of high-resolution imagery, IEEE Geoscience and Remote Sensing Letters
14 (4) (2017) 549–553.
[9] X. Han, Y. Zhong, L. Cao, L. Zhang, Pre-trained alexnet architecture with
pyramid pooling and supervision for high spatial resolution remote sensing
image scene classification, Remote Sensing 9 (8) (2017) 848.
[10] G. S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, X. Lu, Aid: A
benchmark data set for performance evaluation of aerial scene classification,
IEEE Transactions on Geoscience and Remote Sensing 55 (7) (2017) 3965–
3981.
[11] F. P. S. Luus, B. P. Salmon, F. van den Bergh, B. T. J. Maharaj, Multi-
view deep learning for land-use classification, IEEE Geoscience and Remote
Sensing Letters 12 (12) (2015) 2448–2452.
[12] B. Zhao, B. Huang, Y. Zhong, Transfer learning with fully pretrained deep
convolution networks for land-use classification, IEEE Geoscience and Re-
mote Sensing Letters 14 (9) (2017) 1436–1440.
[13] X. Han, Y. Zhong, L. Cao, L. Zhang, Pre-trained alexnet architecture with
pyramid pooling and supervision for high spatial resolution remote sensing
image scene classification, Remote Sensing 9 (8).
[14] F. Hu, G.-S. Xia, J. Hu, L. Zhang, Transferring deep convolutional neu-
ral networks for the scene classification of high-resolution remote sensing
imagery, Remote Sensing 7 (11) (2015) 14680–14707.
[15] K. Nogueira, O. A. Penatti, J. A. dos Santos, Towards better exploiting
convolutional neural networks for remote sensing scene classification, Pat-
tern Recognition 61 (2017) 539 – 556.
[16] G. Cheng, J. Han, X. Lu, Remote sensing image scene classification: Bench-
mark and state of the art, Proceedings of the IEEE 105 (10) (2017) 1865–
1883.
16
[17] G. Cheng, Z. Li, X. Yao, L. Guo, Z. Wei, Remote sensing image scene
classification using bag of convolutional features, IEEE Geoscience and
Remote Sensing Letters 14 (10) (2017) 1735–1739.
[18] G. Wang, B. Fan, S. Xiang, C. Pan, Aggregating rich hierarchical features
for scene classification in remote sensing imagery, IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing 10 (9) (2017)
4104–4115.
[19] Y. Liu, C. Huang, Scene classification via triplet networks, IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing PP (99)
(2017) 1–18.
[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast fea-
ture embedding, in: ACM International Conference on Multimedia, 2014,
pp. 675–678.
[21] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely con-
nected convolutional networks, in: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017.
[22] F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions,
in: International Conference on Learning Representations (ICLR), 2016.
[23] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convo-
lution, and fully connected crfs, arXiv preprint arXiv:1606.00915.
[24] C. L. M. D. F. Rene, V. A. R. G. D. Hager, Temporal convolutional net-
works for action segmentation and detection, in: IEEE International Con-
ference on Computer Vision (ICCV), 2017.
[25] Y. Zheng, H. Ye, L. Wang, J. Pu, Learning multiviewpoint context-aware
representation for rgb-d scene classification, IEEE Signal Processing Letters
25 (1) (2018) 30–34.
17
[26] A. Gupta, A. M. Rush, Dilated convolutions for modeling long-distance
genomic dependencies, arXiv preprint arXiv:1710.01278.
[27] Y. Yang, S. D. Newsam, Bag-of-visual-words and spatial extensions for
land-use classification, in: Proceedings of the 18th SIGSPATIAL Interna-
tional Conference on Advances in Geographic Information Systems, 2010,
pp. 270–279.
[28] W. Yang, X. Yin, G. S. Xia, Learning high-level features for satellite im-
age classification with limited labeled samples, IEEE Transactions on Geo-
science and Remote Sensing 53 (8) (2015) 4472–4482.
[29] C. Cusano, P. Napoletano, R. Schettini, Remote sensing image classification
exploiting multiple kernel learning, IEEE Geoscience and Remote Sensing
Letters 12 (11) (2015) 2331–2335.
18