Introduction To Elastic MapReduce at WHUG

Possible real-world situation

● We have big data and/or very long, embarrassingly parallel computation

● Our data may grow fast● We want to start and try Hadoop asap ● We do not have our own infrastructure● We do not have Hadoop administrators● We have limited funds

Possible solution

Amazon Elastic MapReduce (EMR)● Hadoop framework running on the web scale

infrastructure of Amazon

EMR Benefits

Elastic (scalable)● Use one, hundred, or even thousands of

instances to process even petabytes of data● Modify the number of instances while the job

flow is running● Start computation within minutes

EMR Benefits

Easy to use● No configuration necessary

○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster

● Easy-to-use tools and plugins available○ AWS Web Management Console○ Command Line Tools by Amazon○ Amazon EMR API, SDK, Libraries○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio

for EMR)

EMR Benefits

Reliable● Build on Amazon's highly available and

battle-tested infrastructure● Provision new nodes to replace those that

fail● Used by e.g.:

EMR Benefits

Cost effective● Pay for what you use (for each started hour)● Choose various instance types that meets

your requirements● Possibility to reserve instances for 1 or 3

years to pay less for hour

EMR Overview

Amazon Elastic MapReduce (Amazon EMR) works in conjunction with ● Amazon EC2 to rent computing instances

(with Hadoop installed)● Amazon S3 to store input and output data,

scripts/applications and logs

EMR Architectural Overview

* image from the Internet

EC2 Instance Types

* image from Big Data University, Course: "Hadoop and the Amazon Cloud"

EMR Pricing - "On-demand" instances

Standard Family Instances (US East Region) http://aws.amazon.com/elasticmapreduce/pricing/

EC2 & S3 Pricing - Real-world example

New York Times wanted to host all public domain articles from 1851 to 1922.● 11 million articles● 4 TB of raw image TIFF input data converted

to 1.5 TB of PDF documents● 100 EC2 Instances rented● < 24 hours of computation● $240 paid (not including storage & bandwidth)● 1 employee assigned to this task

EC2 & S3 Pricing - Real-world example

How much

did they pay for storage and bandwidth?

S3 Pricing

http://aws.amazon.com/s3/pricing/

EC2 & S3 Pricing Calculator

Simple Monthly Calculator:http://calculator.s3.amazonaws.com/calc5.html

AWS Free Usage Tier (Per Month)

Available for free to new AWS customers for 12 months following AWS sign-up date e.g.:● 750 hours of Amazon EC2 Micro Instance

usage ○ 613 MB of memory and 32-bit or 64-bit platform

● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests

● 15 GB of bandwidth out aggregated across all AWS services

http://aws.amazon.com/free/

EMR - Support for Hadoop Ecosystem

Develop and run MapReduce application using:● Java● Streaming (e.g. Ruby, Perl, Python, PHP, R,

or C++)● Pig● Hive HBase can be easily installed using set of EC2 scripts

EMR - Featured Users * logos form http://aws.amazon.com/elasticmapreduce/

EMR - Case Study - Yelp

● help people connect with great local business

● share reviews and insights ● as of November 2010:

○ 39 million monthly unique visitors○ in total, 14 million reviews posted

EMR - Case Study - Yelp

● uses S3 to store daily logs (~100GB/day) and photos

● uses EMR to power features like○ People who viewed this also viewed○ Review highlights○ Autocomplete in search box○ Top searches

● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR

mrjob - WordCount example

from mrjob.job import MRJob

class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1

def reducer(self, word, occurrences): yield word, sum(occurrences)

if __name__ == '__main__': MRWordCounter.run()

mrjob - run on EMR

$ python wordcount.py --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < 's3://input-bucket/*.txt' > output

Million Song Dataset

● Contains detailed acoustic and contextual data for one million popular songs

● ~300 GB of data● Publicly available

○ for download: http://www.infochimps.com/collections/million-songs

○ for processing using EMR: http://tbmmsd.s3.amazonaws.com/

Million Song Dataset

Contains data such as:● Song's title, year and hotness● Song's tempo, duration, danceability,

energy, loudness, segments count, preview (URL to mp3 file) and so on

● Artist's name and hotness

Million Song Dataset - Song's density

Song's density* can be defined as the average number of notes or atomic sounds (called segments) per second in a song.

density = segmentCnt / duration * based on Paul Lamere's blog - http://bit.ly/qUbLdQ

Million Song Dataset - Task*

Simple music recommendation system● Calculate density for each song● Find hot songs with similar density * based on Paul Lamere's blog - http://bit.ly/qUbLdQ

Million Song Dataset - MapReduce

Input data● 339 files● Each file contains ~3 000 songs● Each song is represented by one line in

input file● Fields are separated by a tab character

Mapper● Reads song's data from each line of input

text● Calculate song's density● Emits song's density as key with some other

details as value <line_offset, song_data> ->

<density, (artist_name, song_title, song_url)>

public void map(LongWritable key, Text value,OutputCollector<FloatWritable, TripleTextWritable> output, Reporterreporter) throws IOException {

song.parseLine(value.toString());if (song.tempo > 0 && song.duration > 0 ) {

// calculate densityfloat density = ((float) song.segmentCnt) / song.duration;

denstyWritable.set(density);songWritable.set(song.artistName, song.title, song.preview);

output.collect(denstyWritable, songWritable);}

Reducer● Identity Reducer● Each Reducer gets density values from

different range: <i,i+1)*,** <density, [(artist_name, song_title, song_url)]> ->

<density, (artist_name, song_title, song_url)>

* thanks to a custom Partitioner** not optimal partitioning (partitions are not balanced)

Demo - used software

● Karmasphere Studio for EMR (Eclipse plugin)○ graphical environment that supports the complete

lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html)

Demo - used software

● Karmasphere Studio for EMR (Eclipse plugin)

images from:http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html

Please watch video on WHUG channel on YouTube http://www.youtube.com/watch?v=Azwilbn8GCs

Thank you!

Join us ! whug.org

Introduction To Elastic MapReduce at WHUG

Education

A Modern Framework for Amazon Elastic MapReduce (BDT309) | AWS re:Invent 2013

Cloud Computing Lec2 · •Amazon Web Services –Amazon EC2 –Amazon S3 –Amazon EBS ... •Amazon Elastic MapReduce •Elastic Load Balancing •etc. 18/02/2014 Satish Srirama

Amazon Elastic MapReduceawsdocs.s3.amazonaws.com/ElasticMapReduce/latest/emr-api.pdf · Welcome This is the Amazon Elastic MapReduce API Reference.This guide provides descriptions

Aws dc elastic-mapreduce

The 3 Introduction to rd June 21, 2012 Meeting of WHUG ...files.meetup.com/3137102/WHUG 3. Apache Pig - Adam Kawa.pdf · Introduction to Apache Pig Adam Kawa The 3rd Meeting of WHUG

Scaling Information Retrieval to the Webmooney/ir-course/slides/ScalingIR.pdf · Apache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System

BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012

Introduction To Apache Pig at WHUG

(BDT316) Offloading ETL to Amazon Elastic MapReduce

Getting Started with Amazon Elastic MapReduce 1.2.2 · Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Amazon Elastic MapReduce -- Getting started with Hadoop

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Perl on Amazon Elastic MapReduce

CS 425 / ECE 428 Distributed Systems Fall 2019 · – Google: MapReduce and Sawzall – Amazon: Elastic MapReduce service (pay-as-you-go) – Google (MapReduce) • Indexing: a chain

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

Deep Dive - Amazon Elastic MapReduce (EMR)

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

MapReduce in Amazon Web Services. Introduction Amazon Elastic MapReduce – Amazon provides MapReduce framework and interface – Data Store: Amazon Simple