39

Introduction To Elastic MapReduce at WHUG

Embed Size (px)

DESCRIPTION

Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group. Watch also demonstration at www.youtube.com/watch?v=Azwilbn8GCs (it show how to create Hadoop cluster on Amazon Elastic MapReduce with Karashpere Studio for EMR (a plugin for Eclipse) to launch big calculations quickly and easily.

Citation preview

Page 1: Introduction To Elastic MapReduce at WHUG
Page 2: Introduction To Elastic MapReduce at WHUG

Possible real-world situation

● We have big data and/or very long, embarrassingly parallel computation

● Our data may grow fast● We want to start and try Hadoop asap ● We do not have our own infrastructure● We do not have Hadoop administrators● We have limited funds

Page 3: Introduction To Elastic MapReduce at WHUG

Possible solution

Amazon Elastic MapReduce (EMR)● Hadoop framework running on the web scale

infrastructure of Amazon

Page 4: Introduction To Elastic MapReduce at WHUG

EMR Benefits

Elastic (scalable)● Use one, hundred, or even thousands of

instances to process even petabytes of data● Modify the number of instances while the job

flow is running● Start computation within minutes

Page 5: Introduction To Elastic MapReduce at WHUG

EMR Benefits

Easy to use● No configuration necessary

○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster

● Easy-to-use tools and plugins available○ AWS Web Management Console○ Command Line Tools by Amazon○ Amazon EMR API, SDK, Libraries○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio

for EMR)

Page 6: Introduction To Elastic MapReduce at WHUG

EMR Benefits

Reliable● Build on Amazon's highly available and

battle-tested infrastructure● Provision new nodes to replace those that

fail● Used by e.g.:

Page 7: Introduction To Elastic MapReduce at WHUG

EMR Benefits

Cost effective● Pay for what you use (for each started hour)● Choose various instance types that meets

your requirements● Possibility to reserve instances for 1 or 3

years to pay less for hour

Page 8: Introduction To Elastic MapReduce at WHUG

EMR Overview

Amazon Elastic MapReduce (Amazon EMR) works in conjunction with ● Amazon EC2 to rent computing instances

(with Hadoop installed)● Amazon S3 to store input and output data,

scripts/applications and logs

Page 9: Introduction To Elastic MapReduce at WHUG

EMR Architectural Overview

* image from the Internet

Page 10: Introduction To Elastic MapReduce at WHUG

EC2 Instance Types

* image from Big Data University, Course: "Hadoop and the Amazon Cloud"

Page 11: Introduction To Elastic MapReduce at WHUG

EMR Pricing - "On-demand" instances

Standard Family Instances (US East Region) http://aws.amazon.com/elasticmapreduce/pricing/

Page 12: Introduction To Elastic MapReduce at WHUG

EC2 & S3 Pricing - Real-world example

New York Times wanted to host all public domain articles from 1851 to 1922.● 11 million articles● 4 TB of raw image TIFF input data converted

to 1.5 TB of PDF documents● 100 EC2 Instances rented● < 24 hours of computation● $240 paid (not including storage & bandwidth)● 1 employee assigned to this task

Page 13: Introduction To Elastic MapReduce at WHUG
Page 14: Introduction To Elastic MapReduce at WHUG

EC2 & S3 Pricing - Real-world example

How much

did they pay for storage and bandwidth?

Page 15: Introduction To Elastic MapReduce at WHUG

S3 Pricing

http://aws.amazon.com/s3/pricing/

Page 16: Introduction To Elastic MapReduce at WHUG

EC2 & S3 Pricing Calculator

Simple Monthly Calculator:http://calculator.s3.amazonaws.com/calc5.html

Page 17: Introduction To Elastic MapReduce at WHUG

AWS Free Usage Tier (Per Month)

Available for free to new AWS customers for 12 months following AWS sign-up date e.g.:● 750 hours of Amazon EC2 Micro Instance

usage ○ 613 MB of memory and 32-bit or 64-bit platform

● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests

● 15 GB of bandwidth out aggregated across all AWS services

http://aws.amazon.com/free/

Page 18: Introduction To Elastic MapReduce at WHUG

EMR - Support for Hadoop Ecosystem

Develop and run MapReduce application using:● Java● Streaming (e.g. Ruby, Perl, Python, PHP, R,

or C++)● Pig● Hive HBase can be easily installed using set of EC2 scripts

Page 19: Introduction To Elastic MapReduce at WHUG

EMR - Featured Users * logos form http://aws.amazon.com/elasticmapreduce/

Page 20: Introduction To Elastic MapReduce at WHUG

EMR - Case Study - Yelp

● help people connect with great local business

● share reviews and insights ● as of November 2010:

○ 39 million monthly unique visitors○ in total, 14 million reviews posted

Page 21: Introduction To Elastic MapReduce at WHUG

EMR - Case Study - Yelp

Page 22: Introduction To Elastic MapReduce at WHUG

EMR - Case Study - Yelp

● uses S3 to store daily logs (~100GB/day) and photos

● uses EMR to power features like○ People who viewed this also viewed○ Review highlights○ Autocomplete in search box○ Top searches

● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR

Page 23: Introduction To Elastic MapReduce at WHUG

mrjob - WordCount example

from mrjob.job import MRJob

class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1

def reducer(self, word, occurrences): yield word, sum(occurrences)

if __name__ == '__main__': MRWordCounter.run()

Page 24: Introduction To Elastic MapReduce at WHUG

mrjob - run on EMR

$ python wordcount.py --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < 's3://input-bucket/*.txt' > output

Page 25: Introduction To Elastic MapReduce at WHUG

Demo

Page 26: Introduction To Elastic MapReduce at WHUG

Million Song Dataset

● Contains detailed acoustic and contextual data for one million popular songs

● ~300 GB of data● Publicly available

○ for download: http://www.infochimps.com/collections/million-songs

○ for processing using EMR: http://tbmmsd.s3.amazonaws.com/

Page 27: Introduction To Elastic MapReduce at WHUG

Million Song Dataset

Contains data such as:● Song's title, year and hotness● Song's tempo, duration, danceability,

energy, loudness, segments count, preview (URL to mp3 file) and so on

● Artist's name and hotness

Page 28: Introduction To Elastic MapReduce at WHUG

Million Song Dataset - Song's density

Song's density* can be defined as the average number of notes or atomic sounds (called segments) per second in a song.

density = segmentCnt / duration   * based on Paul Lamere's blog - http://bit.ly/qUbLdQ

Page 29: Introduction To Elastic MapReduce at WHUG

Million Song Dataset - Task*

Simple music recommendation system● Calculate density for each song● Find hot songs with similar density * based on Paul Lamere's blog - http://bit.ly/qUbLdQ

Page 30: Introduction To Elastic MapReduce at WHUG

Million Song Dataset - MapReduce

Input data● 339 files● Each file contains ~3 000 songs● Each song is represented by one line in

input file● Fields are separated by a tab character

Page 31: Introduction To Elastic MapReduce at WHUG

Million Song Dataset - MapReduce

Mapper● Reads song's data from each line of input

text● Calculate song's density● Emits song's density as key with some other

details as value <line_offset, song_data> ->

<density, (artist_name, song_title, song_url)>

Page 32: Introduction To Elastic MapReduce at WHUG

public void map(LongWritable key, Text value,OutputCollector<FloatWritable, TripleTextWritable> output, Reporterreporter) throws IOException {

 song.parseLine(value.toString());if (song.tempo > 0 && song.duration > 0 ) {

// calculate densityfloat density = ((float) song.segmentCnt) / song.duration;

denstyWritable.set(density);songWritable.set(song.artistName, song.title, song.preview);

output.collect(denstyWritable, songWritable);}

}

Page 33: Introduction To Elastic MapReduce at WHUG

Million Song Dataset - MapReduce

Reducer● Identity Reducer● Each Reducer gets density values from

different range: <i,i+1)*,** <density, [(artist_name, song_title, song_url)]> ->

<density, (artist_name, song_title, song_url)>

* thanks to a custom Partitioner** not optimal partitioning (partitions are not balanced)

Page 34: Introduction To Elastic MapReduce at WHUG

Demo - used software

● Karmasphere Studio for EMR (Eclipse plugin)○ graphical environment that supports the complete

lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html)

Page 35: Introduction To Elastic MapReduce at WHUG

Demo - used software

● Karmasphere Studio for EMR (Eclipse plugin)

images from:http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html

Page 36: Introduction To Elastic MapReduce at WHUG

Video

Page 37: Introduction To Elastic MapReduce at WHUG

Please watch video on WHUG channel on YouTube http://www.youtube.com/watch?v=Azwilbn8GCs

Page 38: Introduction To Elastic MapReduce at WHUG

Thank you!

Page 39: Introduction To Elastic MapReduce at WHUG

Join us ! whug.org