39
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Open Data Innovation Building on Open Data Sets for Innovative Applications Jed Sundwall Open Data Technical Business Development Manager [email protected]

Open Data Innovation: Building on Open Data Sets for Innovative Applications

Embed Size (px)

Citation preview

Page 1: Open Data Innovation: Building on Open Data Sets for Innovative Applications

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Open Data InnovationBuilding on Open Data Sets for Innovative

Applications

Jed Sundwall

Open Data Technical Business Development Manager

[email protected]

Page 2: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Agenda

• Open data on AWS overview– Why open data matters to AWS

• Landsat on AWS– The newest AWS public dataset

• Frank Warmerdam from Planet Labs– Open data in the geospatial world

• What’s next for open data on AWS

Page 3: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Open Data on AWS

Page 4: Open Data Innovation: Building on Open Data Sets for Innovative Applications

What is open data?

Open data is data that can be used by anyone for any purpose for free.

Many of our customers, such as Esri, the Weather Company, and the

Climate Corporation, rely on quality open data as much as they rely on our

computing, storage, and other web services.

Page 5: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Open data on AWS

Amazon Web Services provides a comprehensive toolkit for gathering,

storing, analyzing, and working with data at any scale.

Amazon Elastic MapReduce

(Amazon EMR) provides the

Apache Hadoop analytics

framework as an easy-to-use

managed service.

Amazon S3 lets you store

and retrieve any amount of

data, at any time, from

anywhere on the web.

Amazon DynamoDB is a

fully-managed NoSQL

database service that makes

it cost-effective to store and

retrieve any amount of data.

Page 6: Open Data Innovation: Building on Open Data Sets for Innovative Applications

The power of open data on AWS

Making data open on AWS enables more innovation by making data

available for rapid access to our flexible and low-cost computing

resources.

Amazon

EC2

Amazon

EMR

Amazon

Redshift

Amazon

DynamoDB

AWS

Lambda

Page 7: Open Data Innovation: Building on Open Data Sets for Innovative Applications

The Weather Company saves $1 million per year running its

forecasting application on AWS

The Weather Company provides millions of people

with the world’s best weather forecasts,

content and data, every day.

Using AWS, TWC can scale as

necessary to handle constantly

changing workloads and maintain

our 11-millisecond response time.

Bryson Koehler

EVP, CTO, CIO, The Weather Company

“ • Needed a cost-effective, scalable

alternative to operating 13 data centers

with legacy systems.

• TWC ingests, stores, and analyzes

ingests 4 GB of weather data per

second from over 800 sources.

• Designed to handle more than 15 billion

API calls each day, at a rate of 150,000

per second.

• Reduced its on-premises IT

environment form 13 to 6 data centers.

Page 8: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Data Enrichment

Sen

sem

akin

g

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Lower cost of knowledge(Efficiency)

Open data as a platform

Page 9: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Data Creation Data Enrichment

Sen

sem

akin

g

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Efficiency

Open data as a platform

Page 10: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Data Enrichment

Sen

sem

akin

g

AmazonKinesis

AmazonEC2

AmazonEC2

AWS DataPipeline

AmazonS3

AmazonRDS

AmazonEMR

AmazonRedshift

AmazonDynamoDB

AWSLambda

Open data as a platform

Page 11: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Moovit: Smart Public Transportation

• Mobile app turns bus and

train riders into real-time

sensors for city government

• Integrates with city back-end

systems to improve both

service and rider experience

• Powered worldwide by the

AWS cloud

Page 12: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• First government-wide national intelligent map portal – Integrated map system for government agencies to deliver location-based

services and information to government agencies and citizens

– Powers over 100 government GIS websites and applications

– Reduced costs by 60%

Singapore government

“AWS has helped my organization

to provide better service availability

and handle higher traffic load at a

lower cost.” —Chan Chin Wai, Chief Information Officer

Singapore Land Authority

Page 13: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Landsat on AWS

Page 14: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Public datasets on AWS

To enable more innovation, AWS hosts a selection of datasets that anyone

can access for free. Data in our public datasets is available for rapid

access to our flexible and low-cost computing resources.

Earth Science

NASA Earth Exchange

(NASA NEX)

Life Sciences

1000 Genomes Project

Internet Science

Common Crawl Corpus

Page 15: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Landsat

The Landsat program is a joint effort

of the U.S. Geological Survey and

NASA. It is the longest running

program to gather Earth imagery

from space and is considered the

gold standard for natural resources

satellite imagery.

Page 16: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Landsat is big open data

The Landsat program is a joint effort

of the U.S. Geological Survey and

NASA. It is the longest running

program to gather Earth imagery

from space and is considered the

gold standard for natural resources

satellite imagery.

It has traditionally been time-

consuming and expensive to

acquire, store, and analyze Landsat

data.

Page 17: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Landsat on AWS

We have committed to making up to

1 petabyte of Landsat imagery

readily available as objects on

Amazon S3.

Now, anyone can analyze Landsat

data at web scale with no significant

up-front investment of time or capital

expense.

Page 18: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Esri—Unlock Earth’s Secrets

Esri has created a tool to show how

ArcGIS Online can quickly visualize

Landsat data for live visualization and

analysis within the browser.

“These are not pre-generated cache

services limited to just visualization—

they are dynamic, high-performance

image services that perform on-the-

fly processing and dynamic

mosaicking of Landsat’s multi-

spectral and multi-temporal imagery.”

http://www.esri.com/landsatonaws

Page 19: Open Data Innovation: Building on Open Data Sets for Innovative Applications

landsat-util

Landsat on AWS helped

Development Seed make

optimizations that make landsat-util

over 2× faster and allow for more

functionality.

https://developmentseed.org/blog/2015/03/19/aws-landsat-archive/

Page 20: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Landsat-live

Mapbox created Landsat-live, a map

that is constantly refreshed with the

latest satellite imagery from NASA’s

Landsat 8 satellite.

Creating a live Earth imagery

pipeline is possible because Landsat

imagery is available on Amazon S3

within hours of creation.

https://www.mapbox.com/blog/landsat-live-live/

Page 21: Open Data Innovation: Building on Open Data Sets for Innovative Applications

MATLAB—Landsat8 Data Explorer

MathWorks created a freely

downloadable MATLAB based tool

for accessing, processing, and

visualizing Landsat 8 data.

The tool allows MATLAB users to

find Landsat 8 scenes, analyze

them, and combine them with other

sources of GIS data for new

visualizations.

http://blogs.mathworks.com/steve/2015/03/19/matlab-landsat-8-aws/

Page 22: Open Data Innovation: Building on Open Data Sets for Innovative Applications

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Frank Warmerdam

Geospatial Software Developer, Planet Labs

[email protected]

www.planet.com

Page 23: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• Geospatial software developer for 20 years

• PCI, independent consultant, Google, Planet Labs

• OSGeo/open source/open data

• Not really very “sciency”

• Working on the “data pipeline” team

Frank Warmerdam

Page 24: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Why not?• Costly to collect

• Hard to control

Why?• Open datasets are an enabler for innovation

• Making data open ensures optimum utilization

• Geodata (images, maps) are common heritage– More like science than art or literature

Open geo data—why?

Page 25: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• Free by default!• TIGER/Line

– National roads from US Census Bureau– Base of many commercial roadmaps (Google, etc.)

• NAIP– 2 m resolution air photos of continental US– Base for many commercial image maps

• Landsat– 30 m images of the world for 30+ years

• NASA/NOAA/USGS– Science data– Weather data– Geological data

Open geo data—USA

Page 26: Open Data Innovation: Building on Open Data Sets for Innovative Applications

We consume:

• Landsat8 PAN

• Landsat8 RGB

• NAIP

• CGIAR DEM

• SRTM90

• SRTM30

• NED

• Open Street Map

• NOAA cloud predictions

Open geo data @ Planet Labs

Page 27: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• Format conversions

• Slow servers (i.e. USGS, lots of 503s)

• Incomplete mirrors (i.e. missing Landsat updates)

• Dynamic datasets require constant monitoring

• Storage is costly

• Such a waste of bandwidth!

Why not share one copy?

Ingesting is a hassle

Page 28: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• Mapbox loaded recent NAIP data on Amazon S3

• Offered to Mark Korver at Amazon Web Services

• Mark put in an AWS-funded Amazon S3 bucket

• Available with “requester pays” for network egress

Planet Labs attaches to this NAIP data

• Reference from the foreign Amazon S3 bucket

• Need to sign all requests (for requester pays)

• /vsicurl/ works (used to get footprint cheaply)

• Succeeded in building 4.7m NAIP mosaic of CONUS!

One example: NAIP

Page 29: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• AWS provides up to 1 PB of S3 storage

• AWS provides free network egress

• MapBox (Charlie and Amit) provides USGS pull library

and expertise

• Planet Labs writes ingestion scripts

• Planet Labs provides Amazon EC2 workers for ingestion

• Updating every two hours

• All scenes from January 2015 on

• Selective backfill from 2014 and earlier

Landsat on AWS

Page 30: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• TAR files split into internally compressed TIF

• External overviews

• Simple HTTP access (no auth)

• /vsicurl/ capable (with caveats)

• _MTL file (soon) also available as .json

• .csv scene list in root of bucket

http://github.com/landsat-pds

https://s3-us-west-2.amazonaws.com/landsat-pds/L8/index.html

Landsat on AWS—organization

Page 31: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• Easy access to desired bands

• Tiling and overviews potentially support

mapping/viewing applications efficiently

• HTTP/VSI Curl support for the win

• Reduce load on USGS

• Mount Amazon S3 bucket via file system on Amazon

EC2 instance

• Open to collaboration and layered tools

Landsat on AWS—advantages

Page 32: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• Ingest into our system “via reference to Amazon S3”

• Successful used for mosaicking etc.

• We now track L8 PDS hourly

Landsat on AWS—Planet Labs

Page 33: Open Data Innovation: Building on Open Data Sets for Innovative Applications

{

"DATA_TYPE": "L1GT",

"MTL_link": "https://s3-us-west-2.amazonaws.com/landsat-

pds/L8/183/018/LC81830182014347LGN00/LC81830182014347LGN00_MTL.txt",

"cloud_cover": {

"cloud_mask_link": "https://storage.planet-

labs.com/v0/scenes/landsat8_qa/LC81830182014347LGN00_BQA.TIF",

"estimated": "0.92"

},

"derived_from": {

"input_params": {

"ARGS": "--next-run"

},

"job_url": "https://jobs.planet-labs.com/v0/programs/l8_aws_process/jobs/26996483",

"process": "l8_aws_process"

},

"footprint": {...},

"index_link": "https://s3-us-west-2.amazonaws.com/landsat-

pds/L8/183/018/LC81830182014347LGN00/index.html",

"pass_at": "2014-12-13 00:00:00",

"remote_info": {

"backend": "s3_remote",

"s3_bucket": "landsat-pds",

"s3_path": "L8/183/018/LC81830182014347LGN00/LC81830182014347LGN00_B11.TIF"

}

}

Landsat on AWS—Planet Labs

Page 34: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• Promote use in the community

• Divert existing USGS pullers to this

• Promote integrations– Development Seed’s landsat-util and libra viewer– Additional catalog interfaces

– Web map view onto data

• Show case derivative works (mosaics, etc.)

• More “operators”

Landsat on AWS—future steps

Page 35: Open Data Innovation: Building on Open Data Sets for Innovative Applications

• This is the future!• Public access (HTTP)• Preserve source data (pixels and metadata)• Organize for efficient use• Keep up to date• Amazon S3 -> anyone can spin up Amazon EC2 nearby

Other datasets:• Elevation (Stamen project)• Planet Labs public datasets (more soonish)• …

Cloud hosted raw geodata

Page 36: Open Data Innovation: Building on Open Data Sets for Innovative Applications

Landsat on AWS as a platform

Page 37: Open Data Innovation: Building on Open Data Sets for Innovative Applications

What’s Next…

Page 38: Open Data Innovation: Building on Open Data Sets for Innovative Applications

What’s next

• More open data– If you rely on open data for your work, we want to hear from you

• More services and features– AWS JavaScript S3 Explorer: a simple JavaScript application for

displaying the contents of an Amazon S3 bucket in the browser.

– https://github.com/awslabs/aws-js-s3-explorer

– Roughly 90–95% of our roadmap is driven by what our

customers tell us matters, so tell us at [email protected]

Page 39: Open Data Innovation: Building on Open Data Sets for Innovative Applications

SAN FRANCISCO

Jed Sundwall

[email protected]