Data Processing without Servers | AWS Public Sector Summit 2016

Preview:

Citation preview

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jed Sundwall, Global Open Data LeadJune 21, 2016

Data Processing Without Servers: Serverless Processing of Landsat 8 Imagery

Using AWS Lambda with Landsat on AWS

What is Landsat?

Landsat

The Landsat program is a joint effort of the U.S. Geological Survey and NASA. It is the longest running program to gather Earth imagery from space and is considered the gold standard for natural resources satellite imagery.

Landsat—not just pretty pictures

Landsat scenes are made up of multiple files, each of which includes data about different kinds of light reflected off of Earth.

Each pixel of each Landsat 8 file represents a 12-bit measurement of light reflected off a 30m2 part of our planet. Each Landsat 8 scene contains about 840 million pixels and takes up about 800 MB.

We currently host over 400,000 Landsat 8 scenes and make about 700 new scenes available on Amazon S3 every day.

That’s 588 billion pixels a day.

RGBvisible light

Infraredvegetation

Shortwave infraredurban areas

Wellington, New Zealand

What does “serverless” mean?

“Serverless” is an approach to software development that eliminates the need for maintaining and administering servers

What does “serverless” mean?

Application design is facilitated through interaction with third-party APIs/services and self-created non-server based APIs.

What does “serverless” mean?

AWS Lambda

AWS Lambda

Serverless compute service that runs code in response to events and automatically manages the underlying compute resources

AWS Lambda

COMPUTE SERVICE

EVENT DRIVEN

Run code at any scale without thinking about

servers

Code only runs when it needs to run, charged on execution time

AWS Lambda + Landsat

Landsat on AWS

Landsat on AWS makes each band of each scene readily available as objects on Amazon S3.

Data can be accessed programmatically via HTTP and quickly deployed to any of our products for analysis and processing.

An Amazon SNS topic publishes a notification whenever a new scene is available.

Landsat on AWS

Landsat TIFFs represent individual wavelengths of light, and need to be combined to be interpretable by most people.

Using image processing tools, we can combine multiple bands into one “true color” image.

Our goal is to create true color images automatically as each scene is made publically available.

AWSLambda

AmazonDynamoDB

AmazonS3

AmazonSNS

We can seamlessly integrate various Amazon Web Services products to create a serverless architecture that will achieve this quickly and cost-effectively.

AWSLambda

AmazonDynamoDB

AmazonS3

AmazonSNS

Serverless architecture

AWS Lambda

Landsat 8 bucket

Amazon SNS Target bucket

Amazon DynamoDB

{ "Records": [ { "EventVersion": "1.0", "EventSubscriptionArn": "arn:aws:sns:EXAMPLE", "EventSource": "aws:sns", "Sns": { "SignatureVersion": "1", "Timestamp": "1970-01-01T00:00:00.000Z", "Signature": "EXAMPLE", "SigningCertUrl": "EXAMPLE", "MessageId": "95df01b4-ee98-5cb9-9903-4c221d41eb5e", "Message": "{\"Records\":[{\"eventVersion\":\"2.0\",\"eventSource\":\"aws:s3\",\"awsRegion\":\"us-west-2\",\"eventTime\":\"2016-01-16T01:36:55.014Z\",\"eventName\":\"ObjectCreated:Put\",\"userIdentity\":{\"principalId\":\"AWS:AIDAILHHXPNIKSGVUGOZK\"},\"requestParameters\":{\"sourceIPAddress\":\"52.27.39.85\"},\"responseElements\":{\"x-amz-request-id\":\"078952E6C7CC52B4\",\"x-amz-id-2\":\"Xboo1ULzd7PxY27iIaGXjUStV8TmG52JAbiWQpiRJWuRqfaBhLcc0XMUKNmXgd5fbIfRd1IcrgE=\"},\"s3\":{\"s3SchemaVersion\":\"1.0\",\"configurationId\":\"NewHTML\",\"bucket\":{\"name\":\"landsat-pds\",\"ownerIdentity\":{\"principalId\":\"A3LZTVCZQ87CNW\"},\"arn\":\"arn:aws:s3:::landsat-pds\"},\"object\":{\"key\":\"L8/169/060/LC81690602016015LGN00/index.html\",\"size\":3780,\"eTag\":\"736e4e5a36cb8a1c6cbfc58659126ff1\",\"sequencer\":\"0056999EB6F8BDBB8D\"}}}]}", "Type": "Notification", "UnsubscribeUrl": "EXAMPLE", "TopicArn": "arn:aws:sns:EXAMPLE", "Subject": "TestInvoke" } } ]

An Amazon SNS topic publishes a notification whenever a new scene is available.

This is what a notification looks like. It’s a JavaScript Object Notation (JSON) object.

{ "Records": [ { "EventVersion": "1.0", "EventSubscriptionArn": "arn:aws:sns:EXAMPLE", "EventSource": "aws:sns", "Sns": { "SignatureVersion": "1", "Timestamp": "1970-01-01T00:00:00.000Z", "Signature": "EXAMPLE", "SigningCertUrl": "EXAMPLE", "MessageId": "95df01b4-ee98-5cb9-9903-4c221d41eb5e", "Message": "{\"Records\":[{\"eventVersion\":\"2.0\",\"eventSource\":\"aws:s3\",\"awsRegion\":\"us-west-2\",\"eventTime\":\"2016-01-16T01:36:55.014Z\",\"eventName\":\"ObjectCreated:Put\",\"userIdentity\":{\"principalId\":\"AWS:AIDAILHHXPNIKSGVUGOZK\"},\"requestParameters\":{\"sourceIPAddress\":\"52.27.39.85\"},\"responseElements\":{\"x-amz-request-id\":\"078952E6C7CC52B4\",\"x-amz-id-2\":\"Xboo1ULzd7PxY27iIaGXjUStV8TmG52JAbiWQpiRJWuRqfaBhLcc0XMUKNmXgd5fbIfRd1IcrgE=\"},\"s3\":{\"s3SchemaVersion\":\"1.0\",\"configurationId\":\"NewHTML\",\"bucket\":{\"name\":\"landsat-pds\",\"ownerIdentity\":{\"principalId\":\"A3LZTVCZQ87CNW\"},\"arn\":\"arn:aws:s3:::landsat-pds\"},\"object\":{\"key\":\"L8/169/060/LC81690602016015LGN00/index.html\",\"size\":3780,\"eTag\":\"736e4e5a36cb8a1c6cbfc58659126ff1\",\"sequencer\":\"0056999EB6F8BDBB8D\"}}}]}", "Type": "Notification", "UnsubscribeUrl": "EXAMPLE", "TopicArn": "arn:aws:sns:EXAMPLE", "Subject": "TestInvoke" } } ]

An Amazon SNS topic publishes a notification whenever a new scene is available.

This is what a notification looks like. It’s a JavaScript Object Notation (JSON) object.

Programmatic access to dataL8/169/060/LC81690602016015LGN00/index.html → LC81690602016015LGN00_B1.TIF → LC81690602016015LGN00_B2.TIF → LC81690602016015LGN00_B3.TIF … → LC81690602016015LGN00_MTL.txt

The notification has given us everything we need to find the data for our task. AWS Lambda can do all of this automatically.

Serverless architecture

AWS Lambda

Landsat 8 bucket

Amazon SNS Target bucket

Amazon DynamoDB

The SNS message object is available to the Lambda function on execution.

From this object, we obtain the base Landsat scene information (Path, Row, Scene ID), as well as the MTL text file containing the detailed metadata for the scene.

Native JSONNext, the Lambda function retrieves the text file containing the scene metadata.

The metadata is parsed and converted to JSON.

Native JSONHaving the metadata available in JSON will allow for much easier storage of the metadata in DynamoDB.

After storing the scene metadata, the function then invokes an additional fleet of Lambda functions.

Each function is tasked with downloading the .TIF corresponding to the three bands to generate a true color image, converting them to a .JPG, and uploading them back to S3 to make them available to the parent Lambda function.

Lambda functions natively include the open source image processing library ImageMagick.

We call this library to retrieve the three compressed .JPG bands, assemble them into a single .JPG, and then make color/contrast adjustments.

The parent Lambda function uploads the converted bands and the processed true color image to S3.

We can then make these finished .JPGs publically available, or available only to a specific application, depending on the use case

Thank you!Jed Sundwall, Global Open Data Lead – jed@amazon.comRyan Opsitos, Solutions Architect – opsitosr@amazon.com

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Matthew Hanson, Development Seed, @geoskeptic

June 21, 2016

OSM-STATSGamification for Humanitarian Mapping

OpenStreetMap

Open map data Roads, rivers, buildings (e.g., hospitals)

Crowd-sourced mapping platform Users create vectors from satellite imagery OSM tasking manager identifies critical areas

Missing Maps

An initiative to map out areas most in need Humanitarian response Third-world regions with poor coverage

Organize marathons Events with groups of volunteers focus on a region

Website of statistics from marathons Keep track of contributions by hashtags users include in commits

OSM-Stats

Website of statistics by users and hashtags Track different groups, different mapathons Offer a reward mechanism to encourage contributions

Users earn badges for different statistics e.g., km of roads, # of buildings

Leaderboards for users and hashtags Produce stats in real-time for added fun at mapathons

missingmaps.org

OSM infrastructure

Commits (changesets) by users published every minute Include metadata, but not geometries http://planet.osm.org/replication/changesets/

Geometries made available by minute via ‘overpass’ http://overpass-api.de/

OSM-Stats Architecture

planet-stream

Node app Streams metadata and geometries from sources

Combine them using Redis Push augmented changesets to Amazon Kinesis stream Docker container running on Amazon EC2

osm-stats-workers

AWS Lambda with Node v4.3.2Event mapping to Amazon Kinesis streamCalculates metrics from each changes

Geometry calculations from vector data Determination of countries edited Ancillary data: user, editor used

Add to Amazon RDS database

Deployment Use Python script and boto3 Deploy database

Create Amazon RDS and osm-stats database, with inbound rules Migrate and populate

Create Amazon Kinesis stream Create AWS Lambda

Create with appropriate permissions—Amazon Kinesis, Amazon RDS security group pair Create event mapping

Deploy Amazon EC2 Create instance, create security groups Use fabric to upload .env file (with URLs and names of above services), Dockerfiles docker-compose up -d: starts pushing to stream as soon as augmented changesets

created

Why Lambda and Amazon Kinesis?

Microservices architecture Smaller replaceable components Easier to scale pieces

Lambda provides low-cost solution at scale Activity can vary from a few to 100 changesets/min

Amazon Kinesis stream allows flexible input for historical processing

Lambda Invocations and Durations

Plots using librato

Lambda lessons

Local testing framework would have been useful Lambda logs take some work

aws-cli—combined with Python or Bash scripts can be useful to parse logs awslogs—Amazon CloudWatch logs for Humans (

https://github.com/jorgebastida/awslogs) Error handling

Lambda function design should handle all errors—don’t let it return a failure Include top-level catch to catch any errors, log, and return success

Database connections using Knex Database pools and Lambda container reuse (pool min=0 !)

Lambda Security and VPCs

Initially configured closed RDS with Lambda accessPaired security groups for RDS and Lambda

As part of VPC, Lambda is in a bubbleosm-stats-workers—makes requests elsewhere

OSM API for tasking manager data

Ended up opening up RDS to the worldSecurity groups also seem to cause intermittent pool errors