18
Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Yingbin Zheng a , Bingfei Fu a , Weiyuan Shao a , Jian Pu b , Hao Ye a a Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai, China b East China Normal University, Shanghai, China Abstract Scene classification is a fundamental problem to understand the high-resolution remote sensing imagery. Recently, convolutional neural network (ConvNet) has achieved remarkable performance in different tasks, and significant efforts have been made to develop various representations for satellite image scene classifica- tion. In this paper, we present a novel representation based on a deeper ConvNet with context aggregation. The proposed two-pathway ResNet (ResNet-TP) ar- chitecture adopts the ResNet [1] as backbone, and the two pathways allow the network to model both local details and regional context. The ResNet-TP based representation is generated by global average pooling on the last convolutional layers from both pathways. Experiments on two scene classification datasets, UCM Land Use and NWPU-RESISC45, show that the proposed mechanism achieves promising improvements over state-of-the-art methods. Keywords: High-resolution remote sensing imagery, scene classification, convolutional neural network, ConvNet, residual learning, context aggregation. 1. Introduction With the growing deployment of remote sensing instruments, satellite image scene classification, or scene classification from high-resolution remote sensing imagery, has drawn attention from remote sensing community for its potential applications in various problems such as environmental monitoring and agricul- ture. Multiple challenges exist to produce accurate scene classification results. Preprint submitted to Elsevier February 5, 2018 arXiv:1802.00631v1 [eess.IV] 2 Feb 2018

Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

  • Upload
    hanhan

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

Satellite Image Scene Classification based on DeeperConvNet with Context Aggregation

Yingbin Zhenga, Bingfei Fua, Weiyuan Shaoa, Jian Pub, Hao Yea

aShanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai, ChinabEast China Normal University, Shanghai, China

Abstract

Scene classification is a fundamental problem to understand the high-resolution

remote sensing imagery. Recently, convolutional neural network (ConvNet) has

achieved remarkable performance in different tasks, and significant efforts have

been made to develop various representations for satellite image scene classifica-

tion. In this paper, we present a novel representation based on a deeper ConvNet

with context aggregation. The proposed two-pathway ResNet (ResNet-TP) ar-

chitecture adopts the ResNet [1] as backbone, and the two pathways allow the

network to model both local details and regional context. The ResNet-TP based

representation is generated by global average pooling on the last convolutional

layers from both pathways. Experiments on two scene classification datasets,

UCM Land Use and NWPU-RESISC45, show that the proposed mechanism

achieves promising improvements over state-of-the-art methods.

Keywords: High-resolution remote sensing imagery, scene classification,

convolutional neural network, ConvNet, residual learning, context aggregation.

1. Introduction

With the growing deployment of remote sensing instruments, satellite image

scene classification, or scene classification from high-resolution remote sensing

imagery, has drawn attention from remote sensing community for its potential

applications in various problems such as environmental monitoring and agricul-

ture. Multiple challenges exist to produce accurate scene classification results.

Preprint submitted to Elsevier February 5, 2018

arX

iv:1

802.

0063

1v1

[ee

ss.I

V]

2 F

eb 2

018

Page 2: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

Large intra-class variation in the same scene class is a common issue. Moreover,

the semantic gap between the scene semantic and the image features could fur-

ther increase the difficulties of robust classification. Thus, the design of suitable

representations on satellite images to deal with the challenges is of fundamental

importance.

Great progress has been achieved in the recent years with the utility of rep-

resentations based on the convolutional neural networks (ConvNet), which led

to breakthroughs in a number of computer vision problems like image classi-

fication. The typical ConvNet including AlexNet [2], SPP-net [3], VGG [4],

and GoogleNet [5], has also been applied to the task of satellite image scene

classification. As the image number of the satellite image datasets are order-

of-magnitude smaller than that of the image classification datasets (e.g., Ima-

geNet [6]) and may not sufficient to train the robust deep models, these ConvNet

based methods usually employ the off-the-shelf pre-trained deep networks (e.g.,

in [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]). The activations of the layers

or their fusion are considered as the visual representation and sent to the scene

classifiers. Evaluations on the benchmarks show that the deep learning based

features often outperform previous handcrafted features.

The number of stacked layers in most current deep networks for satellite

images is relatively small. For example, [14, 11, 12, 13] design classification

systems based on the 7-layer architecture of AlexNet [2] or its replication Caf-

feNet [20], and [14, 15, 16, 17] employ the 16-layer VGG architecture [4]. Recent

evidence suggests that deeper convolutional networks are more flexible and pow-

erful with high modeling capacity for image classification [1, 21]. Some previous

works (e.g., [18, 8]) employ the ResNet [1] as one of the basic models. However,

the effectiveness of these deeper models and how their performance depends on

the number of layers are still not fully exploited for remote sensing images.

In this work, we focus on the problem of deeper convolutional networks, and

introduce an image representation built upon a novel deeper architecture for

satellite images, which adopts the Residual Networks (ResNet) [1] as backbone.

The two-pathway ResNet (or ResNet-TP abbreviatedly) is proposed, and Fig. 1

2

Page 3: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

ImageNet Image

conv1+

pool1

OutputStride 4 8 16 32

64

32

pool5

pool5

Imag

e L

abel

Satellite Image

conv1+

pool1

OutputStride 4 8

conv2_x

16

conv3_x

32

conv4_x

64

32

pool5

pool5

Phase1: pre-training Shared Parameters

Para. Fine-tuning

conv2_x conv3_x conv4_x

Input Image

conv1+

pool1

OutputStride 4 8

conv2_x

16

conv3_x

32

conv4_x

64

32

pool5

pool5

Representation

Phase2: fine-tuningPhase3: feature extraction

Shared Parameters

Scen

e L

abel

FC

FC

Figure 1: Pipeline of the proposed framework with two-pathway ResNet (ResNet-TP). The

network is pre-trained using ImageNet database (Phase 1). Phase 2 produces the fine-tuned

network with the satellite image dataset. A given satellite image goes through the network

and the representation is generated from the global average pooling on the last convolutional

layers (Phase 3).

illustrates the pipeline. The proposed structure aims to aggregate the contextual

information to enhance the feature discrimination. The input images go through

two paths of convolutional operations after conv4 x layers: one path follows

the default building block settings, and another path incorporates the dilation

within convolutional layers to expand the receptive field, which has been shown

to be effective in many tasks such as semantic segmentation [22, 23], video

action detection [24], RGB-D [25], and DNA modeling [26]. Training the deeper

ConvNet is usually more difficult and may lead to higher risk of overfitting,

especially when using the relatively small remote sensing dataset. Therefore,

we employ the transfer learning strategy to reuse the parameters learned from

image classification dataset. The idea of constructing contextual representations

has been taken in several previous remote sensing works, e.g., [13] and [7]. These

approaches use a single spatial pyramid pooling [3] on the feature maps of last

convolutional layer, which is usually tiny after progressively resolution reduction

3

Page 4: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

IN OUT

IN OUTIN

Ws

Figure 2: The building block with the deeper residual function F . IN and OUT denote

number of in-plane and out-plane, respectively. Top: the basic building block. Bottom: the

bottleneck building block.

of previous operations. ResNet-TP is designed with contextual pathways before

last convolutional layers, and is able to alleviate the loss of spatial acuity caused

by tiny feature maps. To evaluate our proposed framework, we report the

evaluations on the recent NWPU-RESISC45 [16] and the UC Merced (UCM)

Land Use dataset [27]. Our representation is compared with several recent

approaches and achieves state-of-the-art performance.

2. Methodology

2.1. ResNet Architecture

We begin with a brief review of the ResNet and the residual learning to

address the training issue of deeper neural networks, which is the foundation to

win the ILSVRC&COCO 2015 competition for the tasks of ImageNet detection,

ImageNet localization, COCO detection, and COCO segmentation [1]. First,

downsampling is performed directly by one 7 × 7 convolutional layer and one

max-pooling (over 2 × 2 pixel window) with stride 2, respectively. The main

component used to construct the architecture is the stacked convolutional layers

with shortcut connections. Such building block is defined as

H(x) = F(x, {Wi}) + Wsx, (1)

where x and H(x) are the input and output of the building block, F(·) is

the residual mapping function to be learned, Wi is the parameters of the con-

volutional layers, and Ws is linear projection matrix to ensure the dimension

4

Page 5: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

Table 1: Configuration of the groups in ResNet-TP Architecture. Suppose the input image

is with size 224 × 224. Basic(IN,OUT ) and Bottleneck(IN,OUT ) denote the basic and

bottleneck building block with number of in-plane IN and out-plane OUT (see Fig. 2). ‘×ni’

indicates stacking ni blocks, where [n2, n3, n4, n5]=[2,2,2,2] for 18 layer, [3,4,6,3] for 34/50

layer, [3,4,23,3] for 101 layer.

GroupBlock Output size,

18/34 layer 50/101 layer dilation

conv1+pool1 [7×7, 64]; Max Pooing 56×56, 1

conv2 x Basic(64,64)×n2 Bottleneck(64,256)×n2 56×56, 1

conv3 x Basic(128,128)×n3 Bottleneck(128,512)×n3 28×28, 1

conv4 x Basic(256,256)×n4 Bottleneck(256,1024)×n4 14×14, 1

conv5 2 xBasic(512,512)×n5 Bottleneck(512,2048)×n5

14×14, 2

conv5 1 x 7×7, 1

matching of x and F (Ws is set as identity matrix when they are with the same

dimension). The operation F(·) + Wsx is performed by a shortcut connection

and element-wise addition. There are usually two or three layers within one

building block, and two typical building blocks are shown in Fig. 2, where the

basic building block is for 18/34-layer and the bottleneck building block is for

50/101/152-layer in [1]. The convolution is performed with the stride of 2 after

a few building blocks to reduce the resolution of feature maps. Unlike previous

architectures such as AlexNet [2] and VGG [4], ResNet has no hidden fully-

connected (FC) layers; it ends with a global average pooling and then a N -way

FC layer with softmax (N is the number of classes). We refer the reader to [1]

for more details.

2.2. Context Aggregation

We now elaborate the construction of ResNet-TP. The architecture of the

network is summarized in Table 1. In general, the network contains six groups of

layers or building blocks. Group conv1+pool1 consist of the 7×7 convolutional

5

Page 6: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

layer and the max-pooling, and conv2 x to conv4 x are with a stack of building

blocks. All their configurations follow the generic design presented as in Sect.

2.1, and differ only in the depth of blocks. Consider an input image with

224× 224 pixels, group conv4 x is with output stride of 16 and thus its feature

map size is 14× 14.

We introduce group with dilation convolutional layers and construct the

two-pathway architecture. This architecture is made of two streams: a pathway

with normal building blocks (conv5 1 x) and another with larger receptive fields

(conv5 2 x). The dilation is operated on the 3 × 3 convolutional layer in the

building block. Let x be the input feature map and w be the filter weights

associated with the dilation convolutional layer, the output y for position p =

(p1, p2) is defined as:

y(p) =∑d∈Gd

w(d) · x(p + d) (2)

where Gd = {(−d,−d), (−d, 0), . . . , (0, d), (d, d)} is the grid for the 3 × 3 filters

and d is the dilation. We set the dilation d = 2 for conv5 2 x, and the layers

in conv5 1 x can also be considered as a special case with d = 1. The moti-

vation for this architectural design is that we would like the prediction to be

influenced by two aspects: the visual details of the region around each pixel of

the feature map as well as its larger context. In fact, ResNet-TP is degenerated

to the standard ResNet when conv5 2 x and its subsequent layers are removed.

Finally, we connect the last convolutional hidden layers in both pathways with

the global average pooling followed by the FC layer with softmax to perform a

prediction of the labels.

2.3. Model Training and Implementation Details

The ResNet-TP architecture is with a large amount of parameters to train.

A traditional remote sensing dataset contains thousands of high-resolution satel-

lite images, which is far less than the image classification datasets for training

the state-of-the-art deep learning models. Following previous works [14, 7, 8],

training of ResNet-TP is based on the transfer learning strategy and Fig. 1

6

Page 7: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

illustrates the overall framework of the proposed ResNet-TP based scene repre-

sentation.

The whole training procedure as well as the feature extraction are carried

out via the open source PyTorch library and an Nvidia Titan X (Pascal) GPU.

The first phase is to get a pre-trained model using the ImageNet database [6].

During this process, due to the network with only conv5 1 x pathway having

the same structure with the original ResNet, we set the weights of conv1 x to

conv5 1 x with the existing PyTorch ResNet models1. Directly updating model

from this initialization lead to performance drop, as the parameters of conv5 2 x

are randomly initialized. On the other hand, it is time-consuming if the model is

trained from scratch, since ImageNet contains millions of images. Here we make

a compromise by learning the weights of conv5 2 x and its subsequent layers

from the network with only conv5 2 x pathway and by frozen of conv1 x to

conv4 x2. We compare this pre-training strategy with the model trained from

scratch under ResNet-TP-18, and find that they are with similar performance

on ImageNet validation set, while its training is much faster.

For the fine-tuning phase, we only fine-tune the building block groups af-

ter conv3 x by using the training satellite images and their labels due to the

limitation of the GPU memory. We take random rotated, mirrored, or scaled

images for data augmentation during fine-tuning. Finally, the representation is

obtained from the global average pooling in both pathways, and the linear SVM

classifier with default setting C = 1 is carried out for a fair comparison with

previous works [18, 7, 16, 17].

1The download link can be found from https://github.com/pytorch/vision/blob/

master/torchvision/models/resnet.py2The stochastic gradient descent (SGD) is used with batch size of 64 and momentum of

0.9. The learning rate is initially set to be 0.01 and is divided by 10 every 30 epochs.

7

Page 8: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

Figure 3: Scene categories from UCM Land Use: (1st row) agricultural, airplane, baseball

diamond, beach, buildings, chaparral, dense residential; (2nd row) forest, freeway, golf course,

harbor, intersection, medium residential, mobile home park; (3rd row) overpass, parking lot,

river, runway, sparse residential, storage tanks, tennis court.

3. Experiments

To evaluate the effectiveness of the proposed method, we compare it with

several state-of-the-art approaches on two remote sensing scene classification

datasets, including the widely used 21-Class UCM Land Use dataset [27] and

the recent proposed 45-Class NWPU-RESISC45 dataset [16].

3.1. UCM Land Use

The UCM Land Use dataset contains 2100 aerial scene images extracted from

United States Geological Survey (USGS) national maps. Each land use class is

composed of 100 images with the spatial resolution of 1 ft and the size of 256×

256 pixels. The scene categories and the sample images are illustrated in Fig. 3.

In our experiments, the images are wrapped into the size of 224× 224, and the

subset with a fixed image number (range from 5 to 90 for each class) is randomly

selected as training set. To test the stability of the proposed representation, the

experiments are executed 20 times to obtain convincing results, and the mean

accuracy and standard deviation are adopted as the evaluation indicators.

We first evaluate the overall results of proposed representation and compare

them with several state-of-the-art approaches. Table 2 summarizes the overall

8

Page 9: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

Figure 4: Scene categories from the NWPU-RESISC45 dataset: (1st row) airplane, airport,

baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland; (2nd

row) cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track

field, harbor; (3rd row) industrial area, intersection, island, lake, meadow, medium residential,

mobile home park, mountain, overpass; (4th row) palace, parking lot, railway, railway station,

rectangular farmland, river, roundabout, runway, sea ice; (5th row) ship, snowberg, sparse

residential, stadium, storage tank, tennis court, terrace, thermal power station, wetland.

accuracy and standard deviation of all the classes in the UCM Land Use3. As

can be seen from the table, the ResNet-TP based representation shows very

competitive performance with different number of training images, which is sig-

nificantly better than the other representations when the training images are

limited. The approach ResNet152 EMR [18] is also a ResNet based representa-

tion, but is generated by combining information from multiple layers with larger

input image size (320×320). When the input image size is set to 224×224 pix-

els, the classification accuracy is 98.38%, which is inferior to ours. We believe

that ResNet-TP based representation is also complementary to these mixed-

resolution methods since they focus on different levels of information, which

3Here we only compare the results from single representation. We believe the fusion with

other features leads to the better performance, however it is not within the scope of this work.

9

Page 10: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

Table 2: Overall accuracies and standard deviations (%) of the proposed methods and state-

of-the-arts on the UCM dataset. ‘-’ indicates that the results are not available in the corre-

sponding paper.

Number of images 5 50 80

SSEP [28] 65.34± 2.01 - -

MKL [29] 64.78± 1.62 88.68± 1.10 91.26± 1.17

SPP-net MKL [7] 75.33± 1.86 95.72± 0.50 96.38± 0.92

AlexNet-SPP-SS [9] - - 96.67± 0.94

VGG-16 [10] - 94.14± 0.69 95.21± 1.20

ResNet50 [8] - - 98.50± 1.40

ResNet152 EMR [18] - - 98.90

ResNet-TP-101 78.12± 2.28 98.03± 0.39 98.71± 0.49

will be examined in the future work.

Network Depth and Pathways: Recall that the ResNet-TP are constructed

from the stacking of building blocks, based on different block setting shown in

Table 1. In Table 3 we list the classification accuracy versus the network depth

and pathway. Using ResNet-TP-18 model, we can already get an accuracy of

98.26%, and adding more layers further boosts the accuracy. We also analyze

the influence of pathways. The network with only one stream of the dilation

convolution conv5 2 x and its subsequences achieves higher performance than

that of conv5 1 x under the setting of 18/101 layer, and lower in 34/50 layer.

We conjecture that the applied single pathway may not be the one at which

the network responds with optimal confidence, and context aggregation with

multiple pathways increase the robustness.

Number of Training Images: We also evaluate the effect of training im-

age number in the representation. We observe significant performance gains

when the number of training images increases from 10 to 50, after which the

performance tends to be saturated. Another observation is that the result of

ResNet-TP-50 is similar to the accuracy of ResNet-TP-101 in most of the com-

10

Page 11: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

Table 3: Evaluation on UCM dataset with different pathway of the ResNet-TP models.

conv5 1 x and conv5 2 x indicates the network with only one stream of ResNet-TP. The

number of training images is 80.

Model Full conv5 1 x conv5 2 x

ResNet-TP-18 98.26± 0.74 97.80± 0.77 97.83± 0.90

ResNet-TP-34 98.43± 0.68 98.37± 0.54 98.06± 0.73

ResNet-TP-50 98.56± 0.53 98.49± 0.63 98.15± 0.62

ResNet-TP-101 98.71± 0.49 98.50± 0.59 98.57± 0.52

UCM parameter

80

85

90

95

100

10 20 30 40 50 60 70 80 90

ResNet-TP-18 ResNet-TP-34 ResNet-TP-50 ResNet-TP-101

Figure 5: Evaluation of the ResNet-TP models with different number of training images on

the UCM dataset.

parisons, indicating that the computation could be saved by ResNet-TP-50 with

marginal performance drop.

3.2. NWPU-RESISC45

The NWPU-RESISC45 dataset contains 31500 remote sensing images ex-

tracted from Google Earth covering more than 100 countries and regions. Each

scene class is composed of 700 images with the spatial resolution varied from

about 30 to 0.2 m per pixel. Sample images and the scene categories are shown

in Fig. 4, and we also wrap the images into the size of 224× 224. We follow the

official train/test split strategy with two training ratios, i.e., 10% (10% for train-

ing and 90% for testing) and 20% (20% for training and 80% for testing). We

repeat the evaluations ten times under each training ratio by randomly splitting

the dataset and also report the mean accuracy and standard deviation.

11

Page 12: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

Table 4: Overall accuracies and standard deviations (%) of the proposed methods and state-

of-the-arts under different training ratios on the NWPU-RESISC45 dataset. The results of

pre-trained and fine-tuned ConvNets are reported in [16].

Feature Training ratios

ref. Network 10% 20%

AlexNet 81.22± 0.19 85.16± 0.18

Pre-trained ConvNet GoogleNet 87.15± 0.45 90.36± 0.18

VGG-16 82.57± 0.12 86.02± 0.18

AlexNet 81.22± 0.19 85.16± 0.18

Fine-tuned ConvNet GoogleNet 87.15± 0.45 90.36± 0.18

VGG-16 82.57± 0.12 86.02± 0.18

BoCF [17]

AlexNet 55.22± 0.39 59.22± 0.18

GoogleNet 78.92± 0.17 80.97± 0.17

VGG-16 82.65± 0.31 84.32± 0.17

[19] Triplet networks - 92.33± 0.20

This workResNet-TP-18 87.79± 0.28 91.03± 0.26

ResNet-TP-101 90.70± 0.18 93.47± 0.26

We compare ResNet-TP based representation with several baselines and

state-of-the-art approaches. Among them, the first group contains several well-

known baseline descriptors, including pre-trained or fine-tuned AlexNet [2],

GoogleNet [5], and VGG-16 [4]. Table 4 shows that the proposed represen-

tation outperforms all the baseline descriptors as well as the state-of-the-art

approaches shown in the middle part of Table 4, including the Bag of Convolu-

tional Features (BoCF) [17] and a very recent triplet networks [19]. In Fig. 6,

we report the confusion matrix and detail classification accuracy for each scene

label using training ratios of 20%.

Network and Training Ratio: We study the performance of different network

settings and training ratios, and Results are given in Fig. 7. Similar to the

observation from UCM dataset, adding the layers in ResNet-TP architecture

12

Page 13: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

94.91.10.00.00.00.20.00.30.00.20.30.00.00.00.00.00.00.60.60.00.00.00.00.00.00.00.00.00.80.20.30.00.00.01.30.00.80.00.00.00.20.00.00.30.0

0.389.80.00.00.01.10.00.00.20.60.20.00.00.00.00.30.30.00.20.00.00.00.00.00.00.50.00.80.20.01.41.10.20.05.10.00.30.00.00.20.30.00.51.30.2

0.00.089.40.60.00.20.00.00.20.00.00.00.00.20.20.30.00.00.00.00.00.00.20.00.00.00.00.80.00.00.00.00.20.00.00.00.20.00.01.30.00.30.20.00.0

0.00.01.491.30.00.00.00.80.00.01.30.00.00.00.50.00.30.01.90.20.00.00.00.20.00.00.20.81.40.00.00.50.00.50.00.00.00.01.00.00.51.90.00.00.0

0.00.20.00.094.80.50.00.00.00.20.00.00.20.00.20.00.00.00.00.02.20.00.00.00.00.00.00.20.00.00.00.00.60.00.20.00.30.00.00.00.00.00.00.00.3

0.00.20.30.00.288.40.00.00.00.00.20.00.00.00.00.00.00.30.00.00.20.00.00.00.00.00.20.00.00.30.30.00.00.00.00.00.60.00.00.20.00.00.00.00.2

0.00.00.00.00.00.097.50.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.60.00.00.00.00.00.0

0.30.00.00.00.00.20.074.80.00.05.70.20.00.00.00.00.00.00.50.30.00.00.00.20.00.00.012.50.20.00.30.00.00.50.00.00.00.00.00.81.60.00.00.80.0

0.00.00.00.00.00.00.00.096.00.00.00.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.0

0.00.00.00.00.30.00.00.00.096.20.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.30.00.00.00.00.00.00.00.00.60.00.00.30.00.00.00.00.01.10.0

0.00.20.30.20.00.20.02.20.00.081.30.50.00.00.00.00.00.81.71.60.00.00.00.20.20.00.62.50.60.61.10.20.00.80.00.00.00.00.00.00.00.20.00.50.0

0.00.01.90.00.00.00.00.30.00.02.186.70.00.00.00.00.50.20.51.00.00.00.03.72.90.00.00.50.00.20.20.00.01.10.00.00.00.00.00.20.01.00.00.00.0

0.01.40.00.00.30.01.60.00.00.20.00.095.20.00.00.00.00.00.00.00.00.00.00.00.04.10.00.00.20.00.00.00.30.00.00.30.00.20.00.00.00.00.20.00.2

0.00.00.30.00.00.00.80.00.00.00.00.00.093.80.81.00.20.00.00.00.00.01.60.00.01.30.00.00.00.00.00.00.60.00.00.00.00.00.80.00.00.00.20.01.7

0.00.00.50.60.01.90.00.60.00.00.20.00.00.290.00.00.00.00.00.60.00.00.30.30.00.04.40.00.52.10.30.00.00.31.10.00.00.00.80.00.00.80.20.00.0

0.00.00.30.00.00.00.00.00.00.00.00.00.00.00.295.70.00.00.00.00.00.00.50.00.00.00.00.20.00.00.00.20.20.00.00.00.00.00.30.00.20.20.60.00.0

0.00.00.80.00.00.00.00.00.00.00.00.00.00.00.00.294.90.20.30.00.00.20.00.20.00.00.00.20.00.00.00.00.00.00.00.00.00.00.02.50.00.30.00.00.0

0.20.00.00.00.00.50.00.00.00.00.00.00.00.00.00.00.094.90.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.50.00.00.00.50.00.00.00.0

0.00.50.50.00.00.00.00.20.00.00.00.20.00.00.20.00.20.083.30.30.00.00.00.20.20.00.30.00.01.02.20.00.00.00.00.00.00.00.00.01.10.20.01.10.0

0.20.00.00.30.00.30.01.30.00.00.80.50.00.01.00.00.00.20.392.70.00.00.01.00.00.00.80.01.40.20.00.00.03.70.00.00.00.00.00.00.20.50.00.00.0

0.00.30.00.02.50.20.00.00.00.20.20.00.00.60.00.00.00.00.00.091.31.30.00.00.00.00.00.50.00.00.00.00.20.00.00.80.00.00.00.20.00.00.00.00.5

0.20.30.20.00.50.00.00.00.50.30.00.00.20.00.00.00.00.00.00.01.791.30.00.00.01.00.00.00.00.00.00.01.00.00.00.00.00.80.00.00.00.00.00.012.9

0.20.00.50.80.00.50.00.00.00.00.00.00.01.40.21.40.00.00.00.00.00.095.60.00.00.50.00.00.30.20.01.90.20.00.20.00.50.01.90.00.00.00.50.00.6

0.00.00.30.80.00.00.00.30.00.00.07.30.00.20.30.00.00.00.50.80.00.00.091.95.20.00.01.60.00.20.00.00.00.30.00.00.00.03.50.00.04.40.00.00.0

0.20.00.30.00.00.00.00.20.00.00.63.70.00.00.00.00.20.30.60.00.00.00.00.290.50.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0

0.00.00.00.00.00.00.00.00.20.00.00.03.80.00.00.00.00.00.00.00.00.80.20.00.089.70.00.00.00.00.00.20.80.00.00.00.01.70.00.00.00.00.20.00.2

0.20.00.30.00.02.90.00.00.00.00.00.00.00.01.10.00.00.00.00.30.00.00.20.00.00.091.00.00.30.30.60.00.00.50.00.00.00.00.00.00.00.20.00.00.0

0.20.30.50.50.30.00.016.00.00.04.81.00.00.00.30.20.50.82.20.60.00.00.00.00.30.00.273.80.60.50.80.50.50.30.20.00.30.00.01.70.30.00.31.10.0

0.00.00.60.20.00.00.00.00.00.00.00.20.00.00.80.00.00.30.60.30.00.00.00.00.50.00.00.092.50.20.20.00.00.20.00.00.00.00.00.00.00.20.00.00.0

0.00.00.00.00.00.00.00.20.00.00.30.00.00.01.90.00.00.00.50.00.00.00.00.00.00.00.20.00.083.05.70.00.00.00.00.00.20.00.50.00.00.00.00.00.0

0.21.60.00.00.20.20.00.50.00.01.30.00.00.01.30.00.20.03.70.20.00.00.00.00.00.01.00.60.29.282.70.00.00.60.00.00.50.00.00.20.20.00.00.20.0

0.00.80.00.00.00.00.00.02.91.30.00.00.00.00.00.00.00.00.30.00.00.20.50.00.00.00.00.00.00.00.387.61.00.00.00.00.00.00.30.00.00.23.70.00.0

0.01.60.00.00.30.60.00.00.00.00.00.00.20.00.50.60.20.00.00.03.51.60.20.00.01.30.00.20.00.00.30.892.50.20.00.00.00.00.00.30.00.00.80.01.9

0.00.00.00.00.00.00.00.20.00.00.20.00.00.00.00.00.00.00.20.20.00.00.00.00.00.01.30.00.20.00.00.00.090.60.00.00.00.00.00.60.00.00.00.00.0

2.51.30.00.00.50.60.00.00.20.00.00.00.00.00.60.00.30.00.00.20.00.00.00.00.00.00.00.00.20.60.20.20.00.291.40.00.00.00.20.00.20.00.00.30.0

0.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.60.00.00.60.20.00.00.20.00.00.00.00.00.00.00.00.00.098.90.01.90.00.00.00.00.00.20.2

0.20.20.00.20.21.60.00.00.00.00.30.00.00.00.00.00.00.60.00.00.30.00.00.00.00.00.00.50.30.00.00.00.20.00.00.094.90.00.00.00.20.00.00.00.0

0.20.00.00.00.00.00.00.00.00.50.00.00.00.00.00.00.00.00.00.00.00.30.00.00.00.30.00.00.00.00.00.00.00.00.00.00.095.10.00.00.00.00.00.00.0

0.00.00.20.20.00.00.00.30.00.00.00.00.01.30.20.20.00.00.20.20.00.00.32.10.20.00.01.00.00.00.00.20.00.00.00.00.00.089.70.00.21.00.00.00.0

0.00.00.20.00.00.00.00.20.00.00.00.00.00.00.00.01.60.20.20.00.00.00.00.00.00.00.00.80.00.00.20.00.00.00.00.00.00.00.090.60.00.00.00.00.0

0.00.00.30.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.60.00.00.291.90.20.00.20.0

0.00.01.04.40.00.00.00.20.00.00.00.00.00.00.00.00.50.00.20.50.00.00.00.20.00.00.00.60.00.00.20.00.00.00.00.00.00.00.50.00.088.70.20.00.0

0.00.00.00.00.00.20.20.00.00.00.00.00.30.00.00.00.00.00.00.20.00.00.00.00.01.00.00.30.00.20.86.51.10.20.00.00.00.00.00.00.00.092.70.00.3

0.30.30.00.00.00.00.01.60.00.30.50.00.00.00.00.00.30.01.60.00.00.00.00.00.00.00.01.60.01.11.70.00.00.00.00.00.30.00.01.12.70.00.093.00.0

0.00.00.00.00.00.00.00.00.00.00.00.00.22.40.00.20.00.00.00.00.04.10.60.00.00.20.00.00.00.00.00.20.60.20.00.00.00.00.00.00.00.00.00.081.0

Airplane

Airport

Baseball

BasketballBeachBridge

Chaparral

Church

CirFarmCloud

CommA

DenseRDesertForest

Freeway Go

lfGTFHarbor

IndustA

IntersecIslandLake

Meadow

MediumR

MHPark

Mountain

OverpassPalace

Parking

Railway

RailwaySt

RectFarm Ri

ver

Round

RunwaySeaIceShip

Snowberg

SparseR

Stadium

StorageTTennis

Terrace TP

S

Wetland

AirplaneAirportBaseball

BasketballBeachBridge

ChaparralChurchCirFarmCloud

CommADenseRDesertForest

FreewayGolfGTF

HarborIndustAIntersecIslandLake

MeadowMediumRMHParkMountainOverpass

PalaceParkingRailway

RailwayStRectFarm

RiverRound

RunwaySeaIceShip

SnowbergSparseRStadiumStorageTTennisTerrace

TPSWetland

95.20.50.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.20.20.00.00.00.00.01.30.00.00.00.00.00.00.00.00.20.0

0.790.90.00.00.00.20.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.90.00.00.50.00.40.03.20.00.40.00.00.20.00.00.00.20.2

0.00.094.60.20.00.00.00.00.20.00.00.00.00.00.20.00.70.00.00.00.00.00.20.20.00.00.00.00.20.00.00.00.00.40.00.00.00.00.01.60.20.50.00.00.0

0.00.00.493.60.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.50.00.01.60.00.00.0

0.00.20.00.097.30.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.20.40.00.00.00.00.00.00.00.00.00.00.40.00.00.00.20.00.00.00.00.00.00.00.4

0.00.00.20.00.093.80.00.00.00.00.20.00.00.00.20.00.00.40.00.00.00.00.00.00.00.00.20.00.00.40.20.00.20.00.00.00.00.00.00.00.00.00.00.00.0

0.00.00.00.00.00.099.50.00.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.40.00.00.00.00.00.0

0.40.00.00.00.00.00.079.50.00.04.10.00.00.00.20.00.00.20.00.00.00.00.00.20.00.00.013.00.00.00.00.00.00.20.00.00.00.00.00.20.20.00.00.50.0

0.00.00.00.00.00.00.00.099.10.00.00.00.20.00.00.00.00.00.00.00.00.90.00.00.00.20.00.20.00.00.00.50.00.00.00.00.00.00.00.00.00.00.00.00.0

0.00.20.00.00.00.00.00.00.098.00.00.00.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.00.00.00.00.00.00.00.50.00.00.50.00.00.00.00.01.60.0

0.00.00.00.00.00.00.02.90.00.087.01.40.00.00.20.00.00.20.91.30.00.00.00.00.00.00.50.70.20.21.80.00.00.20.00.00.00.00.00.00.00.00.00.40.0

0.00.00.20.20.00.00.00.40.00.01.890.70.00.00.20.00.20.20.20.90.00.00.05.00.90.00.20.50.00.20.00.00.00.70.00.00.00.00.00.20.00.40.00.00.0

0.00.70.00.00.00.00.00.00.00.40.00.096.60.00.00.00.00.00.00.00.00.00.00.00.03.20.00.00.00.00.00.00.90.00.00.20.00.00.00.00.00.00.00.00.5

0.00.00.20.00.00.00.40.00.00.00.00.00.097.30.90.00.50.00.00.00.00.00.90.00.00.20.00.00.00.00.00.00.50.00.00.00.00.00.20.00.00.00.00.00.4

0.20.20.00.40.00.70.00.00.00.00.20.00.00.291.40.00.20.00.00.20.00.00.90.20.00.00.70.00.41.40.00.00.00.00.50.00.00.00.20.00.00.40.20.00.0

0.00.00.20.00.00.00.00.00.00.00.00.00.00.00.098.20.00.00.00.00.00.00.70.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.20.20.00.40.20.00.4

0.00.00.70.00.00.00.00.00.00.00.20.00.00.00.20.094.50.00.20.20.00.00.00.00.20.00.00.40.00.00.00.00.00.00.40.00.00.00.01.40.00.40.00.00.0

0.00.00.00.00.00.70.00.00.00.00.00.00.00.00.00.00.097.90.00.00.00.00.00.00.40.00.00.00.00.00.00.00.00.00.00.00.90.00.00.00.00.00.00.00.0

0.40.70.40.20.00.00.00.70.00.01.60.40.00.00.20.00.40.295.00.20.00.00.00.20.20.00.20.90.21.13.20.00.20.20.00.00.70.00.00.20.50.20.03.00.0

0.00.00.00.70.00.40.00.20.00.00.20.20.00.00.90.00.00.00.092.10.00.00.00.00.00.00.70.00.40.00.00.00.00.50.40.00.00.00.00.00.00.40.00.00.0

0.00.20.00.02.00.40.00.00.00.00.00.00.00.20.00.00.00.00.00.098.01.80.00.00.00.00.00.00.00.00.00.00.40.00.00.00.00.00.00.00.00.00.00.00.5

0.00.00.00.00.20.40.00.00.00.00.20.00.40.00.00.00.00.00.00.00.490.90.20.00.00.40.00.00.00.00.00.01.40.00.00.00.00.00.00.00.00.00.00.06.6

0.00.00.20.50.00.20.00.00.00.00.00.00.00.90.01.40.00.00.00.00.00.095.00.00.00.00.00.00.00.00.00.40.00.00.00.00.20.00.40.00.00.00.00.00.4

0.00.00.91.30.00.00.00.40.00.00.23.80.00.00.50.00.00.00.22.70.00.00.089.60.70.00.20.40.00.00.00.00.00.00.00.00.00.02.90.00.02.00.00.00.0

0.00.00.20.00.00.00.00.00.00.00.02.50.00.00.20.00.00.70.00.00.00.00.00.796.80.00.20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0

0.00.00.00.00.00.00.00.00.00.00.00.02.10.00.00.00.00.00.00.00.00.40.20.00.094.50.00.00.00.00.00.00.70.00.00.00.00.70.00.00.00.00.50.00.2

0.00.00.20.00.01.10.00.00.00.00.00.00.00.00.20.00.00.00.00.20.00.00.00.00.00.095.00.00.40.40.20.00.00.50.00.00.00.00.00.00.00.00.00.00.0

0.70.00.00.00.00.00.013.80.00.03.40.90.00.00.40.00.20.01.10.20.00.00.00.40.20.00.078.20.00.20.20.00.20.70.20.00.20.00.20.50.00.00.40.90.0

0.40.00.20.00.00.00.00.20.00.00.00.00.00.00.20.00.00.00.40.70.00.00.00.00.20.00.00.297.90.00.40.00.00.20.00.00.40.00.00.20.00.40.00.00.0

0.00.40.00.00.00.20.00.20.00.00.00.00.00.01.80.00.00.00.50.20.00.00.20.00.00.00.70.20.291.36.10.00.00.00.70.00.20.00.00.00.00.00.00.20.0

0.00.50.00.00.00.20.00.00.00.00.40.00.00.00.70.00.20.00.90.00.00.00.00.00.00.00.70.40.04.685.70.20.00.20.20.00.50.00.00.20.00.00.00.50.0

0.00.40.20.00.00.00.00.00.50.40.00.00.00.00.00.00.00.00.00.00.00.00.50.00.00.20.20.00.00.00.091.60.40.00.00.00.00.00.20.00.00.23.80.00.0

0.00.70.00.00.20.70.00.00.00.00.20.00.00.20.20.20.00.00.00.00.41.10.40.00.00.70.00.20.00.00.20.492.30.00.00.00.00.20.00.40.00.00.70.01.6

0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.00.00.00.00.40.00.00.00.00.00.00.00.00.00.00.00.00.096.10.00.00.00.00.00.20.00.00.00.00.0

2.04.30.00.00.00.40.00.00.00.00.00.00.00.00.90.00.00.00.00.00.00.00.00.00.00.00.20.40.00.00.40.00.00.292.30.00.20.00.00.00.00.00.00.40.0

0.00.00.00.00.20.00.00.00.00.20.00.00.00.00.00.00.00.00.00.00.40.00.00.00.00.00.00.00.00.00.00.00.00.00.099.50.00.20.00.00.00.00.00.00.0

0.00.20.00.00.00.70.00.20.00.00.20.00.00.00.00.00.00.20.00.00.40.00.00.00.00.00.00.40.00.00.00.00.20.00.00.095.70.00.00.00.20.00.00.00.0

0.00.00.00.00.20.00.00.00.00.40.00.00.40.00.00.00.00.00.00.00.00.70.00.00.00.40.00.00.00.00.00.00.00.00.00.00.098.40.00.00.00.00.00.00.0

0.00.00.00.00.00.00.00.40.00.00.00.00.01.30.20.00.00.00.00.00.00.00.53.20.50.00.00.90.00.00.00.20.00.00.00.00.00.094.60.00.21.30.00.00.0

0.00.00.40.00.00.00.00.20.00.00.20.00.00.00.00.03.20.00.20.00.00.00.00.00.00.00.00.40.00.00.50.00.00.00.00.00.00.00.093.90.00.00.00.00.0

0.20.00.20.20.00.00.00.40.00.00.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.00.00.00.20.00.20.00.00.00.00.00.50.00.00.097.50.00.00.20.0

0.00.00.92.90.00.00.00.00.00.00.00.20.00.00.20.00.00.00.20.50.00.00.00.40.00.00.20.40.00.00.00.00.00.00.00.00.00.00.20.00.092.10.00.00.0

0.00.00.00.00.00.20.00.00.20.00.00.00.20.00.00.00.00.00.00.00.00.00.20.00.00.00.00.00.00.00.26.61.10.00.40.20.00.00.20.00.00.094.10.00.0

0.00.00.00.00.00.00.00.90.00.40.00.00.00.00.00.00.00.00.20.20.00.00.00.00.00.00.01.30.00.40.40.00.00.00.00.00.00.00.00.71.30.00.092.00.0

0.00.00.00.00.00.00.20.00.00.00.00.00.00.00.00.20.00.00.00.00.23.80.20.00.00.40.00.00.00.00.00.00.70.00.00.20.00.00.00.00.00.00.20.088.9

Airplane

Airport

Baseball

BasketballBeachBridge

Chaparral

Church

CirFarmCloud

CommA

DenseRDesertForest

Freeway Go

lfGTFHarbor

IndustA

IntersecIslandLake

Meadow

MediumR

MHPark

Mountain

OverpassPalace

Parking

Railway

RailwaySt

RectFarm Ri

ver

Round

RunwaySeaIceShip

Snowberg

SparseR

Stadium

StorageTTennis

Terrace TP

S

Wetland

AirplaneAirportBaseball

BasketballBeachBridge

ChaparralChurchCirFarmCloud

CommADenseRDesertForest

FreewayGolfGTF

HarborIndustAIntersecIslandLake

MeadowMediumRMHParkMountainOverpass

PalaceParkingRailway

RailwayStRectFarm

RiverRound

RunwaySeaIceShip

SnowbergSparseRStadiumStorageTTennisTerrace

TPSWetland

Figure 6: Confusion matrices under the training ratio of 10% (top) and 20% (bottom) by

using ResNet-TP-101 on the NWPU-RESISC45 dataset.

and context aggregation by the two-pathways boost the classification accuracy.

In addition, we observe that increasing training images (70 to 140 per class)

lead to significant performance gains, which slightly differs from the saturation

in UCM Land Use, probably due to the scene variation and data diversity in

the NWPU-RESISC45 dataset.

Pre-Trained vs. Fine-Tuned: Our last experiment evaluates the alternative

method for ResNet-TP model generation. While the fine-tuned method follows

the pipeline of Fig. 1, the pre-trained approach is only composed of phase 1 and

3 in the figure, and the model parameters are directly learned from ImageNet.

13

Page 14: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

conv_5_1 conv_5_2

Figure 7: Evaluation of the ResNet-TP parameters and components with with different train-

ing ratio on the NWPU-RESISC45 dataset.

NWPU FT - PT

77808386899295

ResNet-TP-18 ResNet-TP-34 ResNet-TP-50 ResNet-TP-101

Fine-tuned,TR=20%

Fine-tuned,TR=10%

Pre-trained,TR=20%

Pre-trained,TR=10%

Figure 8: Comparison of the pre-trained and fine-tuned ResNet-TP models on the NWPU-

RESISC45 dataset. TR indicates training ratio.

The comparison between the curves in Fig. 8 verifies that for both training

ratios using the fine-tuned network is important toward a more discriminant

representation. We also find that the fine-tuned method outperforms the pre-

trained method even though the training images is half of which for pre-trained.

4. Conclusion

In this work, we have introduced ResNet-TP, a deeper convolutional network

with context aggregation to generate a discriminant representation for satellite

image scene classification. Through empirical scene classification experiments,

we have shown that proposed ResNet-TP based representation is more effective

than previous deep features, generating very competitive results on the UCM

Land Use and NWPU-RESISC45 datasets. For future work, we plan to incor-

porate multi-scale and multiple layers into the ResNet-TP based representation,

14

Page 15: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

and also explore the performance benefits of a combination of this representation

with other features.

References

[1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-

nition, in: IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 2016, pp. 770–778.

[2] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with

deep convolutional neural networks, in: Advances in Neural Information

Processing Systems (NIPS), 2012, pp. 1097–1105.

[3] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep con-

volutional networks for visual recognition, IEEE Transactions on Pattern

Analysis and Machine Intelligence 37 (9) (2015) 1904–1916.

[4] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-

scale image recognition, in: International Conference on Learning Repre-

sentations (ICLR), 2015.

[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: A large-

scale hierarchical image database, in: IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), 2009, pp. 248–255.

[7] Q. Liu, R. Hang, H. Song, Z. Li, Learning multiscale deep features for

high-resolution satellite image scene classification, IEEE Transactions on

Geoscience and Remote Sensing PP (99) (2017) 1–10.

[8] G. J. Scott, M. R. England, W. A. Starms, R. A. Marcum, C. H. Davis,

Training deep convolutional neural networks for land-cover classification

15

Page 16: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

of high-resolution imagery, IEEE Geoscience and Remote Sensing Letters

14 (4) (2017) 549–553.

[9] X. Han, Y. Zhong, L. Cao, L. Zhang, Pre-trained alexnet architecture with

pyramid pooling and supervision for high spatial resolution remote sensing

image scene classification, Remote Sensing 9 (8) (2017) 848.

[10] G. S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, X. Lu, Aid: A

benchmark data set for performance evaluation of aerial scene classification,

IEEE Transactions on Geoscience and Remote Sensing 55 (7) (2017) 3965–

3981.

[11] F. P. S. Luus, B. P. Salmon, F. van den Bergh, B. T. J. Maharaj, Multi-

view deep learning for land-use classification, IEEE Geoscience and Remote

Sensing Letters 12 (12) (2015) 2448–2452.

[12] B. Zhao, B. Huang, Y. Zhong, Transfer learning with fully pretrained deep

convolution networks for land-use classification, IEEE Geoscience and Re-

mote Sensing Letters 14 (9) (2017) 1436–1440.

[13] X. Han, Y. Zhong, L. Cao, L. Zhang, Pre-trained alexnet architecture with

pyramid pooling and supervision for high spatial resolution remote sensing

image scene classification, Remote Sensing 9 (8).

[14] F. Hu, G.-S. Xia, J. Hu, L. Zhang, Transferring deep convolutional neu-

ral networks for the scene classification of high-resolution remote sensing

imagery, Remote Sensing 7 (11) (2015) 14680–14707.

[15] K. Nogueira, O. A. Penatti, J. A. dos Santos, Towards better exploiting

convolutional neural networks for remote sensing scene classification, Pat-

tern Recognition 61 (2017) 539 – 556.

[16] G. Cheng, J. Han, X. Lu, Remote sensing image scene classification: Bench-

mark and state of the art, Proceedings of the IEEE 105 (10) (2017) 1865–

1883.

16

Page 17: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

[17] G. Cheng, Z. Li, X. Yao, L. Guo, Z. Wei, Remote sensing image scene

classification using bag of convolutional features, IEEE Geoscience and

Remote Sensing Letters 14 (10) (2017) 1735–1739.

[18] G. Wang, B. Fan, S. Xiang, C. Pan, Aggregating rich hierarchical features

for scene classification in remote sensing imagery, IEEE Journal of Selected

Topics in Applied Earth Observations and Remote Sensing 10 (9) (2017)

4104–4115.

[19] Y. Liu, C. Huang, Scene classification via triplet networks, IEEE Journal of

Selected Topics in Applied Earth Observations and Remote Sensing PP (99)

(2017) 1–18.

[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast fea-

ture embedding, in: ACM International Conference on Multimedia, 2014,

pp. 675–678.

[21] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely con-

nected convolutional networks, in: IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), 2017.

[22] F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions,

in: International Conference on Learning Representations (ICLR), 2016.

[23] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab:

Semantic image segmentation with deep convolutional nets, atrous convo-

lution, and fully connected crfs, arXiv preprint arXiv:1606.00915.

[24] C. L. M. D. F. Rene, V. A. R. G. D. Hager, Temporal convolutional net-

works for action segmentation and detection, in: IEEE International Con-

ference on Computer Vision (ICCV), 2017.

[25] Y. Zheng, H. Ye, L. Wang, J. Pu, Learning multiviewpoint context-aware

representation for rgb-d scene classification, IEEE Signal Processing Letters

25 (1) (2018) 30–34.

17

Page 18: Satellite Image Scene Classification based on Deeper ConvNet with Context Aggregation Image Scene Classi cation based on Deeper ConvNet with Context Aggregation Yingbin Zheng a, Bingfei

[26] A. Gupta, A. M. Rush, Dilated convolutions for modeling long-distance

genomic dependencies, arXiv preprint arXiv:1710.01278.

[27] Y. Yang, S. D. Newsam, Bag-of-visual-words and spatial extensions for

land-use classification, in: Proceedings of the 18th SIGSPATIAL Interna-

tional Conference on Advances in Geographic Information Systems, 2010,

pp. 270–279.

[28] W. Yang, X. Yin, G. S. Xia, Learning high-level features for satellite im-

age classification with limited labeled samples, IEEE Transactions on Geo-

science and Remote Sensing 53 (8) (2015) 4472–4482.

[29] C. Cusano, P. Napoletano, R. Schettini, Remote sensing image classification

exploiting multiple kernel learning, IEEE Geoscience and Remote Sensing

Letters 12 (11) (2015) 2331–2335.

18