Introduction To Elastic MapReduce at WHUG

Possible real-world situation

● We have big data and/or very long, embarrassingly parallel computation

● Our data may grow fast● We want to start and try Hadoop asap ● We do not have our own infrastructure● We do not have Hadoop administrators● We have limited funds

Possible solution

Amazon Elastic MapReduce (EMR)● Hadoop framework running on the web scale

infrastructure of Amazon

EMR Benefits

Elastic (scalable)● Use one, hundred, or even thousands of

instances to process even petabytes of data● Modify the number of instances while the job

flow is running● Start computation within minutes

EMR Benefits

Easy to use● No configuration necessary

○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster

● Easy-to-use tools and plugins available○ AWS Web Management Console○ Command Line Tools by Amazon○ Amazon EMR API, SDK, Libraries○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio

for EMR)

EMR Benefits

Reliable● Build on Amazon's highly available and

battle-tested infrastructure● Provision new nodes to replace those that

fail● Used by e.g.:

EMR Benefits

Cost effective● Pay for what you use (for each started hour)● Choose various instance types that meets

your requirements● Possibility to reserve instances for 1 or 3

years to pay less for hour

EMR Overview

Amazon Elastic MapReduce (Amazon EMR) works in conjunction with ● Amazon EC2 to rent computing instances

(with Hadoop installed)● Amazon S3 to store input and output data,

scripts/applications and logs

EMR Architectural Overview

* image from the Internet

EC2 Instance Types

* image from Big Data University, Course: "Hadoop and the Amazon Cloud"

EMR Pricing - "On-demand" instances

Standard Family Instances (US East Region) http://aws.amazon.com/elasticmapreduce/pricing/

http://aws.amazon.com/elasticmapreduce/pricing/

EC2 & S3 Pricing - Real-world example

New York Times wanted to host all public domain articles from 1851 to 1922.● 11 million articles● 4 TB of raw image TIFF input data converted

to 1.5 TB of PDF documents● 100 EC2 Instances rented● < 24 hours of computation● $240 paid (not including storage & bandwidth)● 1 employee assigned to this task

EC2 & S3 Pricing - Real-world example

How much

did they pay for storage and bandwidth?

S3 Pricing

http://aws.amazon.com/s3/pricing/

http://aws.amazon.com/s3/pricing/

EC2 & S3 Pricing Calculator

Simple Monthly Calculator:http://calculator.s3.amazonaws.com/calc5.html

http://calculator.s3.amazonaws.com/calc5.html

AWS Free Usage Tier (Per Month)

Available for free to new AWS customers for 12 months following AWS sign-up date e.g.:● 750 hours of Amazon EC2 Micro Instance

usage ○ 613 MB of memory and 32-bit or 64-bit platform

● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests

● 15 GB of bandwidth out aggregated across all AWS services

http://aws.amazon.com/free/

http://aws.amazon.com/free/

EMR - Support for Hadoop Ecosystem

Develop and run MapReduce application using:● Java● Streaming (e.g. Ruby, Perl, Python, PHP, R,

or C++)● Pig● Hive HBase can be easily installed using set of EC2 scripts

●

EMR - Featured Users * logos form http://aws.amazon.com/elasticmapreduce/

http://aws.amazon.com/elasticmapreduce/

EMR - Case Study - Yelp

● help people connect with great local business

● share reviews and insights ● as of November 2010:

○ 39 million monthly unique visitors○ in total, 14 million reviews posted

●



● uses S3 to store daily logs (~100GB/day) and photos

● uses EMR to power features like○ People who viewed this also viewed○ Review highlights○ Autocomplete in search box○ Top searches

● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR

mrjob - WordCount example

from mrjob.job import MRJob

class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1

def reducer(self, word, occurrences): yield word, sum(occurrences)

if __name__ == '__main__': MRWordCounter.run()

mrjob - run on EMR

$ python wordcount.py --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < 's3://input-bucket/*.txt' > output

Demo

Million Song Dataset

● Contains detailed acoustic and contextual data for one million popular songs

● ~300 GB of data● Publicly available

○ for download: http://www.infochimps.com/collections/million-songs

○ for processing using EMR: http://tbmmsd.s3.amazonaws.com/

http://www.infochimps.com/collections/million-songs

http://www.infochimps.com/collections/million-songs

http://tbmmsd.s3.amazonaws.com/

http://tbmmsd.s3.amazonaws.com/

Million Song Dataset

Contains data such as:● Song's title, year and hotness● Song's tempo, duration, danceability,

energy, loudness, segments count, preview (URL to mp3 file) and so on

● Artist's name and hotness

Million Song Dataset - Song's density

Song's density* can be defined as the average number of notes or atomic sounds (called segments) per second in a song.

density = segmentCnt / duration * based on Paul Lamere's blog - http://bit.ly/qUbLdQ

http://bit.ly/qUbLdQ

Million Song Dataset - Task*

Simple music recommendation system● Calculate density for each song● Find hot songs with similar density * based on Paul Lamere's blog - http://bit.ly/qUbLdQ

http://bit.ly/qUbLdQ

Million Song Dataset - MapReduce

Input data● 339 files● Each file contains ~3 000 songs● Each song is represented by one line in

input file● Fields are separated by a tab character


Mapper● Reads song's data from each line of input

text● Calculate song's density● Emits song's density as key with some other

details as value <line_offset, song_data> ->

<density, (artist_name, song_title, song_url)>

public void map(LongWritable key, Text value,OutputCollector<FloatWritable, TripleTextWritable> output, Reporterreporter) throws IOException {

song.parseLine(value.toString());if (song.tempo > 0 && song.duration > 0 ) {

// calculate densityfloat density = ((float) song.segmentCnt) / song.duration;

denstyWritable.set(density);songWritable.set(song.artistName, song.title, song.preview);

output.collect(denstyWritable, songWritable);}

}


Reducer● Identity Reducer● Each Reducer gets density values from

different range: <i,i+1)*,** <density, [(artist_name, song_title, song_url)]> ->

<density, (artist_name, song_title, song_url)>

* thanks to a custom Partitioner** not optimal partitioning (partitions are not balanced)

Demo - used software

● Karmasphere Studio for EMR (Eclipse plugin)○ graphical environment that supports the complete

lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html)

http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html


Demo - used software

● Karmasphere Studio for EMR (Eclipse plugin)

images from:http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html


Video

Please watch video on WHUG channel on YouTube http://www.youtube.com/watch?v=Azwilbn8GCs

http://www.youtube.com/watch?v=Azwilbn8GCs

http://www.youtube.com/watch?v=Azwilbn8GCs

Thank you!

Join us ! whug.org

Education

Introduction To Elastic MapReduce at WHUG