33
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Martin Holste, FireEye October 2015 CMP403 AWS Lambda Simplifying Big Data Workloads

(CMP403) AWS Lambda: Simplifying Big Data Workloads

Embed Size (px)

Citation preview

Page 1: (CMP403) AWS Lambda: Simplifying Big Data Workloads

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Martin Holste, FireEye

October 2015

CMP403

AWS LambdaSimplifying Big Data Workloads

Page 2: (CMP403) AWS Lambda: Simplifying Big Data Workloads

What to Expect from the Session

This is a deep-dive on general computing uses for

AWS Lambda.

• You will understand what makes Lambda a big deal for

big data.

• You will not learn about asynchronously triggered

workloads (see related sessions for that).

• You will see interactive, data-driven user experiences

that work with minimal ops overhead and at any scale.

Page 3: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Problem: Big data, little time

At FireEye, one of the ways we protect customers is by

analyzing mountains of event data to find “evil.”

Some of it we have online in indexes, some of it we have in

cold storage on Amazon S3.

We needed to be able to take advantage of the rich history

in our archived data without hurting our user experience.

Page 4: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Our app creates questions and finds answers

Lambda-

driven search

and analytics

EMR

analytic

output

EC2-based

proprietary

detection

Amazon EMR triggers

investigations

EC2-based

indexed

search

AWS Lambda provides context

Questions Answers

Page 5: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Amazon EMR

Scheduled jobs that process all

data for anomaly detection:

• K-means

• Linear regression

• Geographic time-lining

What analysis are we doing?

AWS Lambda

Free-form searching to drive ad

hoc:

• Reports

• Visualizations

• Analytical statistics (clustering,

correlation, linear regression,

etc.)

Page 6: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Visualize search results analytically

User-defined analytics

based on ad hoc features

of the search result set

draw attention to otherwise

uninteresting facets of the

data.

Page 7: (CMP403) AWS Lambda: Simplifying Big Data Workloads

How big is our Big?

For an average customer:

Average security event size is about 3k bytes at 20k

events/sec ~= 60 MB/sec, which is about 5 TB/day.

One week = 35 TB, 12 billion events.

Page 8: (CMP403) AWS Lambda: Simplifying Big Data Workloads

How long does this take?

A single process downloads, decompresses, greps, and

processes at about 35k events/sec (105 MB/sec).

To process a week of data:

Processes Time Scale

1 ~4 days

10 ~6 hours

100 ~1 hour

1000 ~5 minutes

10000 seconds0

50,000100,000150,000200,000250,000300,000350,000400,000

1 10 100 1000

Page 9: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Lambda FTW

What if you could spin up 10k

processes in 100 ms?

Standard map-reduce pattern

without the startup time or hassle

of map-reduce frameworks.

Write your simple worker code,

and let cascading Lambda

functions handle the heavy lifting.

Page 10: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Lambda cascade

AWS Big Data blog: “Building Scalable and Responsive Big Data Interfaces with AWS Lambda”

Page 11: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Code components

Basic web app

Handles UI request,

invokes cascade

functions, streams

results.

Cascade function

Invokes workers,

aggregates and

returns results. Can

be made recursive.

Worker function

Performs atomic

work, returns

results to invoker.

Page 12: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Basic web app

var listStream = new S3KeyListStream(searchParams);

var lambdaStream = new LambdaStream(maxWorkers);

listStream

.pipe(lambdaStream, { end: false })

.pipe(serverSentStream)

.pipe(httpResponse);

Page 13: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Basic web app key points

• Batched async execution within an async pipeline is very

unintuitive.

• Trick is to use end:false to manually call end in pipeline

code when all work is done.

• Pipeline will naturally queue up batches to stay under

configured Lambda provisioning limits.

Page 14: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Lambda cascade function

// Chop our given list of keys up into batches

var batches = [];

var batch = [];

for (var i = 0, len = allKeys.length; i < len; i++){

batch.push(allKeys[i]);

if (batch.length >= batchSize){

batches.push(batch.slice());

batch = [];

}

}

Page 15: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Lambda cascade function (continued)

// Invoke each batch in parallel, returning aggregated result when all are finished.

async.map(batches, invoke,

function (err, results) {

if (err) {

context.fail('async.map error: ' + err.toString());

return;

}

context.succeed(results);

Page 16: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Lambda cascade function key points

• Nature of the data and workload will dictate the correct

batch sizes to give a cascade function. Need to avoid

running out of memory to aggregate results.

• 100:1 seems to work well, good balance between low

cascade overhead and manageable intermediate result

size.

Page 17: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Worker function

var lineSplitter = new eventstream.split();

lineSplitter.on(‘data’, process).on(‘end’, cb);

// Create our pipeline

s3.getObject({

Bucket: srcBucket,

Key: srcKey

})

.createReadStream()

.pipe(zlib.createGunzip())

.pipe(lineSplitter);

Page 18: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Worker function key points

• Use the full 1.5 GB of memory.

• Download Amazon S3 keys concurrently.

• 5 seems to be the magic number for files in the 2-3 MB

range.

• Use a faster decompression algorithm like LZ4 high-

compression, which is up to 32x faster than zlib.

• Make sure warnings and failures percolate up with

results.

Page 19: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Non–Amazon S3–sourced workloads

Lambda can source from anything:

Amazon DynamoDB

Amazon RDS

Amazon Kinesis

Amazon EC2 endpoints

The Internet

Page 20: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Example Twitter App

Page 21: (CMP403) AWS Lambda: Simplifying Big Data Workloads

How do my followers feel about _____

1. Enter in a keyword to the UI.

2. A Lambda worker executes for each follower.

3. Sentiment is reviewed (positive/negative/neutral).

4. Results are aggregated.

Page 22: (CMP403) AWS Lambda: Simplifying Big Data Workloads
Page 23: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Streaming Results

Page 24: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Progressive results

Thirty seconds is an eternity in UX time.

Go beyond a progress bar, return streaming, progressive

results.

Show something meaningful in 3-5 seconds, final result in

30.

Graphically represent the updating data.

Page 25: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Mechanical sympathy

Visualizing the result stream as it matures communicates

the magnitude of the work being performed and shows

value.

Page 26: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Lambda Use Cases

Page 27: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Lambda is the future (and past)

It demonstrates the essence of AWS: capability through

simplicity.

These things are no longer needed:

• Servers

• Operating systems

• Networking

Dev effort focuses only on core competencies, not

infrastructure.

Page 28: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Dev advantages

• If the code works once, it works

at any scale.

• Unit and integration testing are

easy (no cluster setup required).

• Any failures are due to faulty

code or bad input, which are

caught by good unit tests.

Page 29: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Beyond containers

• No patching, all upgrades are core

competency updates

• No instance monitoring, only app

monitoring

• Goes beyond containers, devs

have ultra-consistent environment

Page 30: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Remember mainframes?

Mainframes offer attractive operating model,

unattractive graphical capabilities.

PCs take over by bringing the compute to

the people for a rich, graphical experience.

Ubiquitous mobile broadband centralizes the

compute again by allowing best of both

worlds.

1970’s

1990’s

2010’s

Page 31: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Related Sessions

ARC308 - The Serverless Company Using AWS Lambda:

Streamlining Architecture with AWS

CMP301 - AWS Lambda: Event-Driven Code in the Cloud

Page 32: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Remember to complete

your evaluations!

Page 33: (CMP403) AWS Lambda: Simplifying Big Data Workloads

Thank you!