Open Data Innovation: Building on Open Data Sets for Innovative Applications

Preview:

Citation preview

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Open Data InnovationBuilding on Open Data Sets for Innovative

Applications

Jed Sundwall

Open Data Technical Business Development Manager

jed@amazon.com

Agenda

• Open data on AWS overview– Why open data matters to AWS

• Landsat on AWS– The newest AWS public dataset

• Frank Warmerdam from Planet Labs– Open data in the geospatial world

• What’s next for open data on AWS

Open Data on AWS

What is open data?

Open data is data that can be used by anyone for any purpose for free.

Many of our customers, such as Esri, the Weather Company, and the

Climate Corporation, rely on quality open data as much as they rely on our

computing, storage, and other web services.

Open data on AWS

Amazon Web Services provides a comprehensive toolkit for gathering,

storing, analyzing, and working with data at any scale.

Amazon Elastic MapReduce

(Amazon EMR) provides the

Apache Hadoop analytics

framework as an easy-to-use

managed service.

Amazon S3 lets you store

and retrieve any amount of

data, at any time, from

anywhere on the web.

Amazon DynamoDB is a

fully-managed NoSQL

database service that makes

it cost-effective to store and

retrieve any amount of data.

The power of open data on AWS

Making data open on AWS enables more innovation by making data

available for rapid access to our flexible and low-cost computing

resources.

Amazon

EC2

Amazon

EMR

Amazon

Redshift

Amazon

DynamoDB

AWS

Lambda

The Weather Company saves $1 million per year running its

forecasting application on AWS

The Weather Company provides millions of people

with the world’s best weather forecasts,

content and data, every day.

Using AWS, TWC can scale as

necessary to handle constantly

changing workloads and maintain

our 11-millisecond response time.

Bryson Koehler

EVP, CTO, CIO, The Weather Company

“ • Needed a cost-effective, scalable

alternative to operating 13 data centers

with legacy systems.

• TWC ingests, stores, and analyzes

ingests 4 GB of weather data per

second from over 800 sources.

• Designed to handle more than 15 billion

API calls each day, at a rate of 150,000

per second.

• Reduced its on-premises IT

environment form 13 to 6 data centers.

Data Enrichment

Sen

sem

akin

g

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Lower cost of knowledge(Efficiency)

Open data as a platform

Data Creation Data Enrichment

Sen

sem

akin

g

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Efficiency

Open data as a platform

Data Enrichment

Sen

sem

akin

g

AmazonKinesis

AmazonEC2

AmazonEC2

AWS DataPipeline

AmazonS3

AmazonRDS

AmazonEMR

AmazonRedshift

AmazonDynamoDB

AWSLambda

Open data as a platform

Moovit: Smart Public Transportation

• Mobile app turns bus and

train riders into real-time

sensors for city government

• Integrates with city back-end

systems to improve both

service and rider experience

• Powered worldwide by the

AWS cloud

• First government-wide national intelligent map portal – Integrated map system for government agencies to deliver location-based

services and information to government agencies and citizens

– Powers over 100 government GIS websites and applications

– Reduced costs by 60%

Singapore government

“AWS has helped my organization

to provide better service availability

and handle higher traffic load at a

lower cost.” —Chan Chin Wai, Chief Information Officer

Singapore Land Authority

Landsat on AWS

Public datasets on AWS

To enable more innovation, AWS hosts a selection of datasets that anyone

can access for free. Data in our public datasets is available for rapid

access to our flexible and low-cost computing resources.

Earth Science

NASA Earth Exchange

(NASA NEX)

Life Sciences

1000 Genomes Project

Internet Science

Common Crawl Corpus

Landsat

The Landsat program is a joint effort

of the U.S. Geological Survey and

NASA. It is the longest running

program to gather Earth imagery

from space and is considered the

gold standard for natural resources

satellite imagery.

Landsat is big open data

The Landsat program is a joint effort

of the U.S. Geological Survey and

NASA. It is the longest running

program to gather Earth imagery

from space and is considered the

gold standard for natural resources

satellite imagery.

It has traditionally been time-

consuming and expensive to

acquire, store, and analyze Landsat

data.

Landsat on AWS

We have committed to making up to

1 petabyte of Landsat imagery

readily available as objects on

Amazon S3.

Now, anyone can analyze Landsat

data at web scale with no significant

up-front investment of time or capital

expense.

Esri—Unlock Earth’s Secrets

Esri has created a tool to show how

ArcGIS Online can quickly visualize

Landsat data for live visualization and

analysis within the browser.

“These are not pre-generated cache

services limited to just visualization—

they are dynamic, high-performance

image services that perform on-the-

fly processing and dynamic

mosaicking of Landsat’s multi-

spectral and multi-temporal imagery.”

http://www.esri.com/landsatonaws

landsat-util

Landsat on AWS helped

Development Seed make

optimizations that make landsat-util

over 2× faster and allow for more

functionality.

https://developmentseed.org/blog/2015/03/19/aws-landsat-archive/

Landsat-live

Mapbox created Landsat-live, a map

that is constantly refreshed with the

latest satellite imagery from NASA’s

Landsat 8 satellite.

Creating a live Earth imagery

pipeline is possible because Landsat

imagery is available on Amazon S3

within hours of creation.

https://www.mapbox.com/blog/landsat-live-live/

MATLAB—Landsat8 Data Explorer

MathWorks created a freely

downloadable MATLAB based tool

for accessing, processing, and

visualizing Landsat 8 data.

The tool allows MATLAB users to

find Landsat 8 scenes, analyze

them, and combine them with other

sources of GIS data for new

visualizations.

http://blogs.mathworks.com/steve/2015/03/19/matlab-landsat-8-aws/

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Frank Warmerdam

Geospatial Software Developer, Planet Labs

frank@planet.com

www.planet.com

• Geospatial software developer for 20 years

• PCI, independent consultant, Google, Planet Labs

• OSGeo/open source/open data

• Not really very “sciency”

• Working on the “data pipeline” team

Frank Warmerdam

Why not?• Costly to collect

• Hard to control

Why?• Open datasets are an enabler for innovation

• Making data open ensures optimum utilization

• Geodata (images, maps) are common heritage– More like science than art or literature

Open geo data—why?

• Free by default!• TIGER/Line

– National roads from US Census Bureau– Base of many commercial roadmaps (Google, etc.)

• NAIP– 2 m resolution air photos of continental US– Base for many commercial image maps

• Landsat– 30 m images of the world for 30+ years

• NASA/NOAA/USGS– Science data– Weather data– Geological data

Open geo data—USA

We consume:

• Landsat8 PAN

• Landsat8 RGB

• NAIP

• CGIAR DEM

• SRTM90

• SRTM30

• NED

• Open Street Map

• NOAA cloud predictions

Open geo data @ Planet Labs

• Format conversions

• Slow servers (i.e. USGS, lots of 503s)

• Incomplete mirrors (i.e. missing Landsat updates)

• Dynamic datasets require constant monitoring

• Storage is costly

• Such a waste of bandwidth!

Why not share one copy?

Ingesting is a hassle

• Mapbox loaded recent NAIP data on Amazon S3

• Offered to Mark Korver at Amazon Web Services

• Mark put in an AWS-funded Amazon S3 bucket

• Available with “requester pays” for network egress

Planet Labs attaches to this NAIP data

• Reference from the foreign Amazon S3 bucket

• Need to sign all requests (for requester pays)

• /vsicurl/ works (used to get footprint cheaply)

• Succeeded in building 4.7m NAIP mosaic of CONUS!

One example: NAIP

• AWS provides up to 1 PB of S3 storage

• AWS provides free network egress

• MapBox (Charlie and Amit) provides USGS pull library

and expertise

• Planet Labs writes ingestion scripts

• Planet Labs provides Amazon EC2 workers for ingestion

• Updating every two hours

• All scenes from January 2015 on

• Selective backfill from 2014 and earlier

Landsat on AWS

• TAR files split into internally compressed TIF

• External overviews

• Simple HTTP access (no auth)

• /vsicurl/ capable (with caveats)

• _MTL file (soon) also available as .json

• .csv scene list in root of bucket

http://github.com/landsat-pds

https://s3-us-west-2.amazonaws.com/landsat-pds/L8/index.html

Landsat on AWS—organization

• Easy access to desired bands

• Tiling and overviews potentially support

mapping/viewing applications efficiently

• HTTP/VSI Curl support for the win

• Reduce load on USGS

• Mount Amazon S3 bucket via file system on Amazon

EC2 instance

• Open to collaboration and layered tools

Landsat on AWS—advantages

• Ingest into our system “via reference to Amazon S3”

• Successful used for mosaicking etc.

• We now track L8 PDS hourly

Landsat on AWS—Planet Labs

{

"DATA_TYPE": "L1GT",

"MTL_link": "https://s3-us-west-2.amazonaws.com/landsat-

pds/L8/183/018/LC81830182014347LGN00/LC81830182014347LGN00_MTL.txt",

"cloud_cover": {

"cloud_mask_link": "https://storage.planet-

labs.com/v0/scenes/landsat8_qa/LC81830182014347LGN00_BQA.TIF",

"estimated": "0.92"

},

"derived_from": {

"input_params": {

"ARGS": "--next-run"

},

"job_url": "https://jobs.planet-labs.com/v0/programs/l8_aws_process/jobs/26996483",

"process": "l8_aws_process"

},

"footprint": {...},

"index_link": "https://s3-us-west-2.amazonaws.com/landsat-

pds/L8/183/018/LC81830182014347LGN00/index.html",

"pass_at": "2014-12-13 00:00:00",

"remote_info": {

"backend": "s3_remote",

"s3_bucket": "landsat-pds",

"s3_path": "L8/183/018/LC81830182014347LGN00/LC81830182014347LGN00_B11.TIF"

}

}

Landsat on AWS—Planet Labs

• Promote use in the community

• Divert existing USGS pullers to this

• Promote integrations– Development Seed’s landsat-util and libra viewer– Additional catalog interfaces

– Web map view onto data

• Show case derivative works (mosaics, etc.)

• More “operators”

Landsat on AWS—future steps

• This is the future!• Public access (HTTP)• Preserve source data (pixels and metadata)• Organize for efficient use• Keep up to date• Amazon S3 -> anyone can spin up Amazon EC2 nearby

Other datasets:• Elevation (Stamen project)• Planet Labs public datasets (more soonish)• …

Cloud hosted raw geodata

Landsat on AWS as a platform

What’s Next…

What’s next

• More open data– If you rely on open data for your work, we want to hear from you

• More services and features– AWS JavaScript S3 Explorer: a simple JavaScript application for

displaying the contents of an Amazon S3 bucket in the browser.

– https://github.com/awslabs/aws-js-s3-explorer

– Roughly 90–95% of our roadmap is driven by what our

customers tell us matters, so tell us at opendata@amazon.com

SAN FRANCISCO

Jed Sundwall

jed@amazon.com

Recommended