Ground Control to Major Tom: the importance of ﬁeld surveys in … · anypoint in training set. This prevents any over-lap between satellite images between test and train set. This

Ground Control to Major Tom: the importance offield surveys in remotely sensed data analysis

Ian W. BolligerUniversity of California,

BerkeleyBerkeley, CA

[email protected]

Tamma CarletonUniversity of California,


[email protected]

Solomon HsiangUniversity of California,


[email protected]

Jonathan KadishUniversity of California,


[email protected]

Jonathan ProctorUniversity of California,


[email protected]

Benjamin RechtUniversity of California,


[email protected]

Esther RolfUniversity of California,


esther [email protected]

Vaishaal ShankarUniversity of California,


[email protected]

1. BACKGROUNDRecent increases in the availability of remotely senseddata have created a remarkable opportunity to mea-sure fundamental indicators of human well-beingat a global scale. These new data sources, suchas satellite-derived climate variables and remotesensing imagery, are particularly valuable in loca-tions where direct on-site data are rarely collected,such as developing countries and regions affectedby conflict. Increasingly high-resolution imageryprovides rich information on a host of landscapecharacteristics that may be correlated with vari-ous social and economic outcomes of policy andresearch interest. A growing set of approaches isbeing developed to condense data contained withinthese images into meaningful and comparable met-rics for prediction of socioeconomic and develop-ment indicators. These advances open a window tothe construction and analysis of globally-comprehensivedatasets that shed light on factors of living condi-tions previously outside the reach of standard datacollection efforts. Such approaches for transform-ing imagery into meaningful social data have beendesigned to predict outcomes such as: socioeco-nomic class in Lima, Peru [TATCZ11]; quality oflife in cities in Indiana [LW07, JGBH05] and Geor-gia [LF97]; and most recently, household consump-tion and assets in five African countries [JBX+16].

These existing remote sensing approaches are highlyspecialized; researchers develop application-specifictechniques to predict socioeconomic outcomes in alocation of interest, relying on available observa-tional data. This generates two key limitations.First, ground truth data are often sparse in placeswhere remote sensing tools are of unique value.The predictive skill of these techniques is constrained

Bloomberg Data for Good Exchange Conference.24-Sep-2017, Chicago, IL, USA.

by available data, and returns on increases to thequantity or quality of observational data remainunknown. Second, findings are heavily dependenton the method used to transform imagery into us-able image features for prediction. For example,remote sensing imagery applications often treatthe pixel as the observational unit, but recent ad-vances in machine learning algorithms and com-putational power have enabled information on thespatial structure of an entire image to be extracted[LZZ+14].

In this project, we build a modular, scalable sys-tem that can collect, store, and process millions ofsatellite images. We test the relative importanceof both of the key limitations constraining the pre-vailing literature by applying this system to a data-rich environment. To overcome classic data avail-ability concerns, and to quantify their implicationsin an economically meaningful context, we operatein a data rich environment and work with an out-come variable directly correlated with key indica-tors of socioeconomic well-being. We collect publicrecords of sale prices of homes within the UnitedStates, and then gradually degrade our rich samplein a range of different ways which mimic the sam-pling strategies employed in actual survey-baseddatasets. Pairing each house with a correspondingset of satellite images, we use image-based featuresto predict housing prices within each of these de-graded samples. To generalize beyond any givenfeaturization methodology, our system contains anindependent featurization module, which can beinterchanged with any preferred image classifica-tion tool. We test this system under five realisticsampling strategies, and with three image classi-fication techniques. Our initial findings demon-strate that while satellite imagery can be used topredict housing prices with considerable accuracy,the size and nature of the ground truth sampleis a fundamental determinant of the usefulness of

arX

iv:1

710.

0934

2v2

[cs

.CY

] 4

Apr

201

8

https://creativecommons.org/licenses/by/4.0/

imagery for this category of socioeconomic predic-tion. We quantify the returns to improving thedistribution and size of observed data, and showthat the image classification method is a second-order concern. Our results provide clear guidancefor the development of adaptive sampling strate-gies in data-sparse locations where satellite-basedmetrics may be integrated with standard surveydata, while also suggesting that advances from im-age classification techniques for satellite imagerycould be further augmented by more robust sam-pling strategies.

2. OUR APPROACHWealth estimation is a widely applicable and gen-erally difficult problem in remote sensing [JBX+16,HSW12]. To execute this task in a data-rich set-ting, we use a dataset of all home sales in Ari-zona from 2010 to 2016, pulling georeferenced datafrom public records and pairing each home withassociated aerial satellite imagery. We link homeprices with images by associating every satelliteimage with the average log price of homes whichfall within the image. To capture meaningful vari-ation in housing prices while keeping the modelsimple, we classify a satellite image into one ofthree regions of the Arizona home price distribu-tion (see Fig. 1).

We simulate real world data collection processesthat are common in data-scarce regions by sub-sampling our dataset of all home prices under var-ious sampling strategies (see Fig. 2(m:p)). Thesesampling strategies are designed to reflect how ac-tual surveys are conducted. For example, the De-mographic and Health Survey (DHS) and the Liv-ing Standards Measurement Survey (LSMS) bothsurvey individuals using a cluster-based survey ap-proach (we call this “cluster” sampling). Admin-istrative borders with differential data collectionefforts on either side often generate samples withabrupt geographic boundaries (we call this “lat-itudinally stratified” or “longitudinally stratified”sampling). Using each of these sampling strate-gies, we then compare prediction performance ofmodels trained on each sub-sampled dataset. Thestrategies we consider for this work are:

1. Uniform at random (UAR) sampling. Sam-ples are collected uniformly at random fromthe distribution of homes in the area of in-terest (AOI).

2. Cluster sampling. Samples are collected froma pre-specified number of clusters geograph-ically spread out in the AOI.

3. Latitudinally stratified sampling. Samplesare collected north or south of a particularlatitude line, but uniform across longitudesin the AOI.

4. Longitudinally stratified sampling. Samplesare collected west or east of a particular lon-gitude line, but uniform across latitudes inthe AOI.

Image classification is a well studied problem incomputer vision, and the current state-of-the-artalgorithms for this task are Convolutional NeuralNetworks (CNNs). To generalize our findings be-yond a single classification technique, we use threedistinct baseline approaches:

1. GIST descriptor, a low dimensional represen-tation of images specialized for “scene” de-scriptions [OT01].

2. Pre-trained CNN (the VGG network), trainedon a data set of 1.2 million natural images,from 1000 categories The penultimate layerof this CNN provides a succinct representa-tion of images that works well for many clas-sification tasks [SZ14, HAE16, OBLS14].

3. Random CNN (Coates-Ng). Random con-volutional networks, for which the filters aresampled from a pre-specified distribution, tendto work very well for certain tasks [CN12,MSP+17].

Though neural networks are computationally in-tensive to train, computing the output of pre-trainedor random neural networks is relatively efficient.Thus, restricting our attention to pre-trained andrandom networks allows us to easily process mil-lions of images in the cloud. Using PyWren [JVSR17],a serverless framework for python, we can easilyscale up our embarrassingly parallel featurizationby using AWS Lambda and AWS EC2. With thisframework we can process 1 million images in lessthan 1 hour, which allows us to analyze problemsat large spatial extents and fine resolution.

After featurizing, we solve an L2-regularized linearregression problem to obtain model predictions.Performance is measured with respect to a heldout data set sampled UAR from the state of Ari-zona. We report performance in terms of meanaverage precision (MAP), a metric that accommo-dates for both type 1 and type 2 error in predic-tions. This metric also corresponds to the areaunder the precision-recall curve, and is commonlyused in machine learning literature.

3. INITIAL RESULTSThe results of using the sampling schemes and clas-sification techniques defined above are shown inFig. 2, where each sampling scheme is comparedwith UAR sampling. The curves show mean aver-age precision as a function of sample size. 1

Results obtained from UAR sampling (shown inblue in Figs. 2(a:l)) are consistently better than

1Between the submission and the final version ofthis paper we have changed the data to better re-flect the actual task at hand. We made sure everypoint in our test set is at least 100 meters fromanypoint in training set. This prevents any over-lap between satellite images between test and trainset. This change causes an overall decrease in pre-dictive power, but maintains all claims about sam-pling strategy in the submission version of paper.

Figure 1: (a) Distribution of home sale prices in Arizona. Examples of satellite images which are featurizedto predict average sale price, representing houses from: (b) the poorest σ of the housing price distribution(class 0); (c) the middle of the housing price distribution (class 1); and (d) the richest σ of the housingprice distribution (class 2).

those obtained by the other sampling strategies.The dashed blue lines represent the performance ofthe model trained from 5,000 UAR-sampled train-ing points. In many comparisons, other samplingstrategies require many times more samples to reachthe 5000-point performance of UAR. Some strate-gies (i.e. latitudinal) perform close to as well asUAR at low sample number, yet additional dataprovides a smaller marginal benefit, resulting inlarger performance differences at larger sample size.In some cases, e.g. cluster sampling with 200 clus-ters using VGG features, we require more than40,000 examples (more than 8x UAR examples) toreach the performance of using 5,000 UAR sam-pled data points. In fact, the leveling off of thecurves (exemplified by the cluster sampling datawith 4 clusters) indicates that there may be funda-mental limitations to the prediction power we canachieve with certain strategies, and furthermore,that these asymptotes are significantly lower thanthe prediction power obtainable from UAR sam-pled data.

Fig. 3 displays prediction performance broken downby class label, where 100,000 data points are sam-pled according to each sampling scheme. As ex-pected, UAR outperforms all other sampling schemes,with cluster sampling with 4 centers performingworst. Within sampling schemes, the consistentdiscrepancy in performance for the middle class asopposed to the low and high classes can be ex-plained as an effect of class imbalance in the linearsolve.

One striking feature of this breakdown of perfor-mance is that most of the variation between sam-pling schemes occurs within the tails of the distri-bution (the ‘Low’ and ‘High’ classes). One com-pelling explanation for this trend is that samplingschemes which expose more variety of data resultin higher predictive power in these smaller labelclasses. This would explain why UAR performsbest, followed by schemes that sample over a widespatial range (clusters with 200 centers, longitu-

dinal, and latitudinal), and lastly followed by thescheme which has the most limited range of spa-tially distributed examples (clusters with 5 cen-ters).

In sum, these results imply that sampling schemeis an important factor to consider in imagery pre-diction tasks, especially when making predictionsover subpopulations.

4. DISCUSSIONIn most machine learning problems, including im-age classification tasks, data are treated as if gen-erated from an independent and identically dis-tributed process. However, in reality, most re-mote sensing tasks use observations that are gen-erated from non-random, non-uniform processes,such as cluster sampling or other forms of geo-graphic stratification. In locations where remotesensing is uniquely valuable, cluster sampling isparticularly common. For example, the DHS servesas the predominant source of health data coveringdeveloping countries, and conducts surveys usingcensuses of village clusters, as do the World Bank’sLSMS surveys. Non-uniform sampling is generallyunavoidable in remote sensing applications – re-searchers can only train models using limited avail-able ground observations.

By sub-sampling a dataset of the all home saleprices in Arizona, we quantify the loss in predic-tive power suffered under non-uniform samplingschemes. Our results highlight the substantial gainsfrom uniform random sampling, regardless of thefeaturization applied to the raw satellite imagery.Moreover, we demonstrate that gains resulting fromincreased data collection, while large, are oftendwarfed by gains realizable from modifying sam-pling design. These preliminary findings suggestthat the pragmatic benefits of cluster sampling orother geographic stratification methods (such asminimizing transportation costs) should be care-fully weighed against the substantial performancecosts of foregoing uniform sampling. With fur-

Ran

dom

con

volu

tiona

l ne

ural

net

wor

kVG

G b

ottle

neck

fe

atur

es

Cluster sampling with 4 clusters

Cluster sampling with 200 clusters Latitudinally stratified Longitudinally stratified

0.41: UAR score for 5,000 samples

GIS

T de

scrip

tor

MAP


UARClusters - 4

UARClusters - 200

UARLatitudinal

UARLongitudinal

UARClusters - 4

UARClusters - 200

UARLatitudinal

UARLongitudinal

UARClusters - 4

UARLatitudinal

UARLongitudinal

UARClusters - 200

MAP

MAP

0 20,000 40,000 60,000 80,000 100,000Sample size

0 20,000 40,000 60,000 80,000 100,000Sample size

0 20,000 40,000 60,000 80,000 100,000Sample size

0 20,000 40,000 60,000 80,000 100,000Sample size

a b c d

e f g h

li j k

pm n o

0.55

0.50

0.45

0.40

0.35

0.30

0.75

0.70

0.65

0.60

0.55

0.50

0.75

0.70

0.65

0.60

0.55

0.50


Longitude

Latit

ude

Longitude Longitude Longitude

Figure 2: The effects of sampling strategy, featurization technique, and sample size on the predictivepower of satellite imagery. Panels (a) through (d) use a GIST descriptor; panels (e) through (h) useVGG bottleneck features; panels (i) through (l) use a random convolutional neural network. Columnsshow different sampling strategies. Panels (m) through (p) show the houses that were sampled (red) andall non-sampled houses in the corpus of AZ house prices (black).

Class-specific auPRC by sampling scheme(Random conv net)

Sing

le c

lass

auP

RC

Sampling scheme

Middle (class 1)Low (class 0)High (class 2)

Figure 3: Performance of satellite imagery classification task using a random convolutional neural networkacross different sampling schemes. One vs. all binary prediction performance is compared for houses ineach of the three label classes: the lowest σ of the sale price distribution (green), the highest σ (yellow),and the remaining center of the distribution (grey).

ther testing of this model both within and outsidethe United States, our approach has the poten-

tial to directly inform the development of adaptivesampling strategies that can effectively combine

small amounts of ground-based observation withspatially comprehensive, high resolution satellitedata.

These initial findings represent a single applicationof a broader, scalable system for collection, stor-age, and aggregation of a large number of high-resolution satellite images. We plan to apply thisscalable system to study the generalizability of ourinitial findings to other regions of the United States,and eventually to the international contexts wheretraditional remote sensing applications to economicdevelopment questions proliferate. In future work,we seek to create policy-relevant output that guidesthe development of optimal adaptive sampling strate-gies, where the location-specific, on-the-ground costsof implementing a given strategy are weighed againstthe benefits of improving predictive power.

AcknowledgementsThis material is based upon work supported by theNational Science Foundation Graduate ResearchFellowship under Grant No. DGE 1106400.

5. REFERENCES[CN12] Adam Coates and Andrew Y Ng.

Learning feature representations withk-means. In Neural networks: Tricksof the trade, pages 561–580. Springer,2012.

[HAE16] Minyoung Huh, Pulkit Agrawal, andAlexei A Efros. What makesimagenet good for transfer learning?arXiv preprint arXiv:1608.08614,2016.

[HSW12] J Vernon Henderson, AdamStoreygard, and David N Weil.Measuring economic growth fromouter space. American EconomicReview, 102(2):994–1028, 2012.

[JBX+16] Neal Jean, Marshall Burke, MichaelXie, W. Matthew Davis, David B.Lobell, and Stefano Ermon.Combining satellite imagery andmachine learning to predict poverty.Science, 353(6301):790–794, August2016.

[JGBH05] Ryan Jensen, Jay Gatrell, JimBoulton, and Bruce Harper. UsingRemote Sensing and GeographicInformation Systems to Study UrbanQuality of Life and Urban ForestAmenities. Ecology and Society, 9(5),January 2005.

[JVSR17] Eric Jonas, Shivaram Venkataraman,Ion Stoica, and Benjamin Recht.Occupy the cloud: Distributedcomputing for the 99%. arXivpreprint arXiv:1702.04024, 2017.

[LF97] C. P. Lo and Benjamin J. Faber.Integration of landsat thematicmapper and census data for qualityof life assessment. Remote Sensing ofEnvironment, 62(2):143–157,November 1997.

[LW07] G. Li and Q. Weng. Measuring thequality of life in city of Indianapolisby integration of remote sensing andcensus data. International Journal ofRemote Sensing, 28(2):249–267,January 2007.

[LZZ+14] Miao Li, Shuying Zang, Bing Zhang,Shanshan Li, and Changshan Wu. AReview of Remote Sensing ImageClassification Techniques: the Roleof Spatio-contextual Information.European Journal of Remote Sensing,47(1):389–411, January 2014.

[MSP+17] Alyssa Morrow, Vaishaal Shankar,Devin Petersohn, Anthony Joseph,Benjamin Recht, and Nir Yosef.Convolutional kitchen sinks fortranscription factor binding siteprediction. arXiv preprintarXiv:1706.00125, 2017.

[OBLS14] Maxime Oquab, Leon Bottou, IvanLaptev, and Josef Sivic. Learningand transferring mid-level imagerepresentations using convolutionalneural networks. In Proceedings ofthe IEEE conference on computervision and pattern recognition, pages1717–1724, 2014.

[OT01] Aude Oliva and Antonio Torralba.Modeling the shape of the scene: Aholistic representation of the spatialenvelope. International journal ofcomputer vision, 42(3):145–175, 2001.

[SZ14] Karen Simonyan and AndrewZisserman. Very deep convolutionalnetworks for large-scale imagerecognition. arXiv preprintarXiv:1409.1556, 2014.

[TATCZ11] Francisco J. Tapiador, SilvaniaAvelar, Carlos Tavares-Correa, andRainer Zah. Deriving fine-scalesocioeconomic information of urbanareas using very high-resolutionsatellite imagery. InternationalJournal of Remote Sensing,32(21):6437–6456, November 2011.

Documents

Ground Control to Major Tom: the importance of ﬁeld surveys in … · anypoint in training set. This prevents any over-lap between satellite images between test and train set. This