View
2
Download
0
Category
Preview:
Citation preview
Possible real-world situation
● We have big data and/or very long, embarrassingly parallel computation
● Our data may grow fast● We want to start and try Hadoop asap ● We do not have our own infrastructure● We do not have Hadoop administrators● We have limited funds
Possible solution
Amazon Elastic MapReduce (EMR)● Hadoop framework running on the web scale
infrastructure of Amazon
EMR Benefits
Elastic (scalable)● Use one, hundred, or even thousands of
instances to process even petabytes of data● Modify the number of instances while the job
flow is running● Start computation within minutes
EMR Benefits
Easy to use● No configuration necessary
○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster
● Easy-to-use tools and plugins available○ AWS Web Management Console○ Command Line Tools by Amazon○ Amazon EMR API, SDK, Libraries○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio
for EMR)
EMR Benefits
Reliable● Build on Amazon's highly available and
battle-tested infrastructure● Provision new nodes to replace those that
fail● Used by e.g.:
EMR Benefits
Cost effective● Pay for what you use (for each started hour)● Choose various instance types that meets
your requirements● Possibility to reserve instances for 1 or 3
years to pay less for hour
EMR Overview
Amazon Elastic MapReduce (Amazon EMR) works in conjunction with ● Amazon EC2 to rent computing instances
(with Hadoop installed)● Amazon S3 to store input and output data,
scripts/applications and logs
EMR Architectural Overview
* image from the Internet
EC2 Instance Types
* image from Big Data University, Course: "Hadoop and the Amazon Cloud"
EMR Pricing - "On-demand" instances
Standard Family Instances (US East Region) http://aws.amazon.com/elasticmapreduce/pricing/
EC2 & S3 Pricing - Real-world example
New York Times wanted to host all public domain articles from 1851 to 1922.● 11 million articles● 4 TB of raw image TIFF input data converted
to 1.5 TB of PDF documents● 100 EC2 Instances rented● < 24 hours of computation● $240 paid (not including storage & bandwidth)● 1 employee assigned to this task
EC2 & S3 Pricing - Real-world example
How much
did they pay for storage and bandwidth?
EC2 & S3 Pricing Calculator
Simple Monthly Calculator:http://calculator.s3.amazonaws.com/calc5.html
AWS Free Usage Tier (Per Month)
Available for free to new AWS customers for 12 months following AWS sign-up date e.g.:● 750 hours of Amazon EC2 Micro Instance
usage ○ 613 MB of memory and 32-bit or 64-bit platform
● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests
● 15 GB of bandwidth out aggregated across all AWS services
http://aws.amazon.com/free/
EMR - Support for Hadoop Ecosystem
Develop and run MapReduce application using:● Java● Streaming (e.g. Ruby, Perl, Python, PHP, R,
or C++)● Pig● Hive HBase can be easily installed using set of EC2 scripts
●
EMR - Featured Users * logos form http://aws.amazon.com/elasticmapreduce/
EMR - Case Study - Yelp
● help people connect with great local business
● share reviews and insights ● as of November 2010:
○ 39 million monthly unique visitors○ in total, 14 million reviews posted
●
EMR - Case Study - Yelp
EMR - Case Study - Yelp
● uses S3 to store daily logs (~100GB/day) and photos
● uses EMR to power features like○ People who viewed this also viewed○ Review highlights○ Autocomplete in search box○ Top searches
● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR
mrjob - WordCount example
from mrjob.job import MRJob
class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1
def reducer(self, word, occurrences): yield word, sum(occurrences)
if __name__ == '__main__': MRWordCounter.run()
mrjob - run on EMR
$ python wordcount.py --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < 's3://input-bucket/*.txt' > output
Demo
Million Song Dataset
● Contains detailed acoustic and contextual data for one million popular songs
● ~300 GB of data● Publicly available
○ for download: http://www.infochimps.com/collections/million-songs
○ for processing using EMR: http://tbmmsd.s3.amazonaws.com/
Million Song Dataset
Contains data such as:● Song's title, year and hotness● Song's tempo, duration, danceability,
energy, loudness, segments count, preview (URL to mp3 file) and so on
● Artist's name and hotness
Million Song Dataset - Song's density
Song's density* can be defined as the average number of notes or atomic sounds (called segments) per second in a song.
density = segmentCnt / duration * based on Paul Lamere's blog - http://bit.ly/qUbLdQ
Million Song Dataset - Task*
Simple music recommendation system● Calculate density for each song● Find hot songs with similar density * based on Paul Lamere's blog - http://bit.ly/qUbLdQ
Million Song Dataset - MapReduce
Input data● 339 files● Each file contains ~3 000 songs● Each song is represented by one line in
input file● Fields are separated by a tab character
Million Song Dataset - MapReduce
Mapper● Reads song's data from each line of input
text● Calculate song's density● Emits song's density as key with some other
details as value <line_offset, song_data> ->
<density, (artist_name, song_title, song_url)>
public void map(LongWritable key, Text value,OutputCollector<FloatWritable, TripleTextWritable> output, Reporterreporter) throws IOException {
song.parseLine(value.toString());if (song.tempo > 0 && song.duration > 0 ) {
// calculate densityfloat density = ((float) song.segmentCnt) / song.duration;
denstyWritable.set(density);songWritable.set(song.artistName, song.title, song.preview);
output.collect(denstyWritable, songWritable);}
}
Million Song Dataset - MapReduce
Reducer● Identity Reducer● Each Reducer gets density values from
different range: <i,i+1)*,** <density, [(artist_name, song_title, song_url)]> ->
<density, (artist_name, song_title, song_url)>
* thanks to a custom Partitioner** not optimal partitioning (partitions are not balanced)
Demo - used software
● Karmasphere Studio for EMR (Eclipse plugin)○ graphical environment that supports the complete
lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html)
Demo - used software
● Karmasphere Studio for EMR (Eclipse plugin)
images from:http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html
Video
Please watch video on WHUG channel on YouTube http://www.youtube.com/watch?v=Azwilbn8GCs
Thank you!
Join us ! whug.org
Recommended