Upload
amazon-web-services
View
4.749
Download
0
Embed Size (px)
Citation preview
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nate Slater, AWS Solutions Architect
October 2015
BDT313
Amazon DynamoDB for Big DataA Hands-on Look at Using Amazon
DynamoDB for Big Data Workloads
What to Expect from the Session
• A focus on the “how” not the “what”:
• We look at fully functional implementations of
several big data architectures.
• Learn how AWS services abstract much of the
complexity of big data without sacrificing power and
scale.
• Demonstrate how combinations of services from the
AWS data ecosystem can be used to create feature rich
systems for analyzing data.
What is “Big Data?”
• Like many technology catch-phrases, “big data” tends to
be defined in many different ways.
• Most definitions will include mention of two primary
characteristics:
• Size
• Velocity
Characteristics of Big Data
• The quantity of data is increasing at a rapid rate.
• Raw data from a variety of sources is increasingly being used to
answer key business questions:
• Log files
• How are your applications being used and who is using them?
• Application performance monitoring
• What is the extent that poorly performing apps are affecting my business?
• Application metrics
• How will users respond to this new feature?
• Security
• Who has access to my infrastructure, what do they have access to, and
how are they accessing it? Is this a threat?
Characteristics of Big Data
• The growth in data volume means the flow of data is
moving at an ever faster rate:
• MB/s is normal
• GB/s are increasingly common.
• Number of connected users is growing at an amazing
rate:
• Estimates of 75 billion connected devices by 2020.
• 105 or 106 transactions per second are not uncommon in
big data applications.
The “Sweet Spot” of Big Data
Size
StructureVelocityDynamoDB
Transactional Data Processing
DynamoDB is well-suited for transactional processing:
• High concurrency
• Strong consistency
• Atomic updates of single items
• Conditional updates for de-dupe and optimistic concurrency
• Supports both key/value and JSON document schema
• Capable of handling large table sizes with low latency data access
Demo 1: Store and Index Metadata for Objects
Stored in Amazon S3
Demo 1: Use Case
We have a large number of digital audio files stored in
Amazon S3 and we want to make them searchable:
• Use DynamoDB as the primary data store for the
metadata.
• Index and query the metadata using Elasticsearch.
Demo 1: Steps to Implement
1. Create a Lambda function that reads the metadata from the ID3 tag
and inserts it into a DynamoDB table.
2. Enable S3 notifications on the S3 bucket storing the audio files.
3. Enable streams on the DynamoDB table.
4. Create a second Lambda function that takes the metadata in
DynamoDB and indexes it using Elasticsearch.
5. Enable the stream as the event source for the Lambda function.
Demo 1: Key Takeaways
1. DynamoDB + Elasticsearch = Durable, scalable, highly-available
database with rich query capabilities.
2. Use Lambda functions to respond to events in both DynamoDB
streams and Amazon S3 without having to manage any underlying
compute infrastructure.
Demo 2 – Execute Queries Against Multiple Data
Sources Using DynamoDB and Hive
Demo 2: Use Case
We want to enrich our audio file metadata stored in
DynamoDB with additional data from the Million Song
dataset:
• Million song data set is stored in text files.
• ID3 tag metadata is stored in DynamoDB.
• Use Amazon EMR with Hive to join the two datasets
together in a query.
Demo 2: Steps to Implement
1. Spin up an Amazon EMR cluster with Hive.
2. Create an external Hive table using the
DynamoDBStorageHandler.
3. Create an external Hive table using the Amazon S3 location of the
text files containing the Million Song project metadata.
4. Create and run a Hive query that joins the two external tables
together and writes the joined results out to Amazon S3.
5. Load the results from Amazon S3 into DynamoDB.
Demo 2: Key Takeaways
1. Use Amazon EMR to quickly provision a Hadoop cluster with Hive
and to tear it down when done.
2. Use of Hive with DynamoDB allows items in DynamoDB tables to
be queried/joined with data from a variety of sources.
Demo 3 – Store and Analyze Sensor Data with
DynamoDB and Amazon Redshift
Demo 3: Use Case
A large number of sensors are taking readings at regular intervals. You
need to aggregate the data from each reading into a data warehouse
for analysis:
• Use Amazon Kinesis to ingest the raw sensor data.
• Store the sensor readings in DynamoDB for fast access and real-
time dashboards.
• Store raw sensor readings in Amazon S3 for durability and backup.
• Load the data from Amazon S3 into Amazon Redshift using AWS
Lambda.
Demo 3: Steps to Implement
1. Create two Lambda functions to read data from the Amazon
Kinesis stream.
2. Enable the Amazon Kinesis stream as an event source for each
Lambda function.
3. Write data into DynamoDB in one of the Lambda functions.
4. Write data into Amazon S3 in the other Lambda function.
5. Use the aws-lambda-redshift-loader to load the data in Amazon S3
into Amazon Redshift in batches.
Demo 3: Key Takeaways
1. Amazon Kinesis + Lambda + DynamoDB = Scalable, durable,
highly available solution for sensor data ingestion with very low
operational overhead.
2. DynamoDB is well-suited for near-realtime queries of recent sensor
data readings.
3. Amazon Redshift is well-suited for deeper analysis of sensor data
readings spanning longer time horizons and very large numbers of
records.
4. Using Lambda to load data into Amazon Redshift provides a way to
perform ETL in frequent intervals.
Summary
• The versatility of DynamoDB makes it a cornerstone component of
many data architectures.
• “Big data” solutions usually involve a number of different tools for
storage, processing, and analysis.
• The AWS ecosystem offers a rich and powerful set of services that
make it possible to build scalable and durable “big data”
architectures with ease.
Remember to complete
your evaluations!
Thank you!