(BDT313) Amazon DynamoDB For Big Data

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Nate Slater, AWS Solutions Architect

October 2015

BDT313

Amazon DynamoDB for Big DataA Hands-on Look at Using Amazon

DynamoDB for Big Data Workloads

What to Expect from the Session

• A focus on the “how” not the “what”:

• We look at fully functional implementations of

several big data architectures.

• Learn how AWS services abstract much of the

complexity of big data without sacrificing power and

scale.

• Demonstrate how combinations of services from the

AWS data ecosystem can be used to create feature rich

systems for analyzing data.

What is “Big Data?”

• Like many technology catch-phrases, “big data” tends to

be defined in many different ways.

• Most definitions will include mention of two primary

characteristics:

• Size

• Velocity

Characteristics of Big Data

• The quantity of data is increasing at a rapid rate.

• Raw data from a variety of sources is increasingly being used to

answer key business questions:

• Log files

• How are your applications being used and who is using them?

• Application performance monitoring

• What is the extent that poorly performing apps are affecting my business?

• Application metrics

• How will users respond to this new feature?

• Security

• Who has access to my infrastructure, what do they have access to, and

how are they accessing it? Is this a threat?

Characteristics of Big Data

• The growth in data volume means the flow of data is

moving at an ever faster rate:

• MB/s is normal

• GB/s are increasingly common.

• Number of connected users is growing at an amazing

rate:

• Estimates of 75 billion connected devices by 2020.

• 105 or 106 transactions per second are not uncommon in

big data applications.

The “Sweet Spot” of Big Data

Size

StructureVelocityDynamoDB

Transactional Data Processing

DynamoDB is well-suited for transactional processing:

• High concurrency

• Strong consistency

• Atomic updates of single items

• Conditional updates for de-dupe and optimistic concurrency

• Supports both key/value and JSON document schema

• Capable of handling large table sizes with low latency data access

Demo 1: Store and Index Metadata for Objects

Stored in Amazon S3

Demo 1: Use Case

We have a large number of digital audio files stored in

Amazon S3 and we want to make them searchable:

• Use DynamoDB as the primary data store for the

metadata.

• Index and query the metadata using Elasticsearch.

Demo 1: Steps to Implement

1. Create a Lambda function that reads the metadata from the ID3 tag

and inserts it into a DynamoDB table.

2. Enable S3 notifications on the S3 bucket storing the audio files.

3. Enable streams on the DynamoDB table.

4. Create a second Lambda function that takes the metadata in

DynamoDB and indexes it using Elasticsearch.

5. Enable the stream as the event source for the Lambda function.

Demo 1: Key Takeaways

1. DynamoDB + Elasticsearch = Durable, scalable, highly-available

database with rich query capabilities.

2. Use Lambda functions to respond to events in both DynamoDB

streams and Amazon S3 without having to manage any underlying

compute infrastructure.

Demo 2 – Execute Queries Against Multiple Data

Sources Using DynamoDB and Hive

Demo 2: Use Case

We want to enrich our audio file metadata stored in

DynamoDB with additional data from the Million Song

dataset:

• Million song data set is stored in text files.

• ID3 tag metadata is stored in DynamoDB.

• Use Amazon EMR with Hive to join the two datasets

together in a query.


1. Spin up an Amazon EMR cluster with Hive.

2. Create an external Hive table using the

DynamoDBStorageHandler.

3. Create an external Hive table using the Amazon S3 location of the

text files containing the Million Song project metadata.

4. Create and run a Hive query that joins the two external tables

together and writes the joined results out to Amazon S3.

5. Load the results from Amazon S3 into DynamoDB.


1. Use Amazon EMR to quickly provision a Hadoop cluster with Hive

and to tear it down when done.

2. Use of Hive with DynamoDB allows items in DynamoDB tables to

be queried/joined with data from a variety of sources.

Demo 3 – Store and Analyze Sensor Data with

DynamoDB and Amazon Redshift

Demo 3: Use Case

A large number of sensors are taking readings at regular intervals. You

need to aggregate the data from each reading into a data warehouse

for analysis:

• Use Amazon Kinesis to ingest the raw sensor data.

• Store the sensor readings in DynamoDB for fast access and real-

time dashboards.

• Store raw sensor readings in Amazon S3 for durability and backup.

• Load the data from Amazon S3 into Amazon Redshift using AWS

Lambda.


1. Create two Lambda functions to read data from the Amazon

Kinesis stream.

2. Enable the Amazon Kinesis stream as an event source for each

Lambda function.

3. Write data into DynamoDB in one of the Lambda functions.

4. Write data into Amazon S3 in the other Lambda function.

5. Use the aws-lambda-redshift-loader to load the data in Amazon S3

into Amazon Redshift in batches.


1. Amazon Kinesis + Lambda + DynamoDB = Scalable, durable,

highly available solution for sensor data ingestion with very low

operational overhead.

2. DynamoDB is well-suited for near-realtime queries of recent sensor

data readings.

3. Amazon Redshift is well-suited for deeper analysis of sensor data

readings spanning longer time horizons and very large numbers of

records.

4. Using Lambda to load data into Amazon Redshift provides a way to

perform ETL in frequent intervals.

Summary

• The versatility of DynamoDB makes it a cornerstone component of

many data architectures.

• “Big data” solutions usually involve a number of different tools for

storage, processing, and analysis.

• The AWS ecosystem offers a rich and powerful set of services that

make it possible to build scalable and durable “big data”

architectures with ease.

Remember to complete

your evaluations!

Thank you!

Technology

(BDT313) Amazon DynamoDB For Big Data