14

HadoopThe Hadoop Java Software Framework

Embed Size (px)

DESCRIPTION

Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.Topics Covered * The Hadoop architecture. * Thinking in MapReduce. * Run some sample MapReduce Jobs (using Hadoop Streaming). * Introduce PigLatin, a easy to use data processing language.Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.

Citation preview

Page 1: HadoopThe Hadoop Java Software Framework
Page 2: HadoopThe Hadoop Java Software Framework

Hadoop: Playing with data, at scale

If you have lot of data to process…. What should you know?

Mahesh Tiyyagura

25th November, Bangalore

Page 3: HadoopThe Hadoop Java Software Framework

Mahesh Tiyyagura

Email: [email protected] http://www.twitter.com/tmahesh

Work on large scale crawling and extraction of structured data from the web. Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs

Page 4: HadoopThe Hadoop Java Software Framework

Hadoop •  Massively scalable storage and batch data processing system

•  Its all about Scale…… –  Scaling hardware infrastructure (horizontal scaling) –  Scaling operations and maintenance (handling failures) –  Scaling developer productivity (keep it simple)

Page 5: HadoopThe Hadoop Java Software Framework

Numbers you should know… •  You can store say, 10TB of data per node •  1 Disk: 75MB/sec (sequential read) •  Say, you want to process, 200GB of data •  That’s is ~ 1 hour to just read the data!! •  Processing data (CPU) is much much faster (say, 10x) •  To remove the bottleneck, we need to read data in parallel •  Read from 100 Disks in parallel: 7.5GB/sec!! •  Insight: Move computation, NOT data

•  Oh! BTW, Data should not (and cannot) reside on only one node •  In a 1000 node cluster, you can expect ~10 failures per week •  For peace of mind, Reliability should be handled by software

Hadoop is designed to address these issues.

Page 6: HadoopThe Hadoop Java Software Framework

The Platform, in brief… •  HDFS: Storing Data

–  Data spilt into multiple blocks across nodes –  Replication protects data from failures –  A master node orchestrates the read/write requests (without being a bottleneck!!) –  Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable)

•  MapReduce (MR): Processing Data –  A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce) –  You don’t need no knowledge of network IO, node failures, checkpoints, distributed

what?? –  Most of the data processing jobs can be mapped into MapReduce Abstraction –  Data processed locally, in parallel. Reliability is implicit. –  A giant merge sort infrastructure does the magic

Will revisit this slide. Something's are better understood in retrospect.

Page 7: HadoopThe Hadoop Java Software Framework

HDFS

Page 8: HadoopThe Hadoop Java Software Framework

MR: Programming Model •  Map function: (key, value) -> (key1, value1) list •  Reduce function: (key1, value1 list) -> key1, output

•  Examples: –  map(k, v) -> emit (k, v.toUpper()) –  map(k, v) -> foreach c in v; do emit(k, c) done;

–  reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)

Page 9: HadoopThe Hadoop Java Software Framework

MAPREDUCE

Page 10: HadoopThe Hadoop Java Software Framework

Thinking in MapReduce …. •  Word Count Example

–  map(docid, text) -> foreach word in text.split(); do emit(word, 1); done –  reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word,

sum)

•  Document search index (Inverted index) –  map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done –  reduce(term, docid list) -> emit(term, docid list);.

Page 11: HadoopThe Hadoop Java Software Framework

Thinking in MapReduce ….

•  All the anchor text to a page –  map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done –  Reduce(link, anchorText list) -> emit(link, anchorText list);

•  Image resize –  Map(imgid, image) -> emit(imgid, image.resize()); –  No need for reduce

Page 12: HadoopThe Hadoop Java Software Framework

Hadoop Streaming Demo •  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile

•  Each line in InputFile written to stdin of shellMapper.sh •  A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value) •  Each key, value pair is fed as line into stdin of shellReducer.sh •  A line on stdout of shellReducer written to outputFile

•  hadoop jar hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /bin/wc

•  More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html

•  Will share the terminal session now… for DEMO

Page 13: HadoopThe Hadoop Java Software Framework

Brief intro to PIG •  Adhoc data analysis •  An abstraction over mapreduce •  Think of it as a stdlib for mapreduce •  Supports common data processing operators (join, group by) •  A high level language for data processing

PIG demo – switch to terminal

•  Also try… HIVE, exposes a SQL like interface over HDFS data

HIVE demo – switch to terminal

Page 14: HadoopThe Hadoop Java Software Framework