HadoopThe Hadoop Java Software Framework

Hadoop: Playing with data, at scale

If you have lot of data to process…. What should you know?

Mahesh Tiyyagura

25th November, Bangalore

Mahesh Tiyyagura

Email: [email protected] http://www.twitter.com/tmahesh

Work on large scale crawling and extraction of structured data from the web. Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs

Hadoop •  Massively scalable storage and batch data processing system

•  Its all about Scale…… –  Scaling hardware infrastructure (horizontal scaling) –  Scaling operations and maintenance (handling failures) –  Scaling developer productivity (keep it simple)

Numbers you should know… •  You can store say, 10TB of data per node •  1 Disk: 75MB/sec (sequential read) •  Say, you want to process, 200GB of data •  That’s is ~ 1 hour to just read the data!! •  Processing data (CPU) is much much faster (say, 10x) •  To remove the bottleneck, we need to read data in parallel •  Read from 100 Disks in parallel: 7.5GB/sec!! •  Insight: Move computation, NOT data

•  Oh! BTW, Data should not (and cannot) reside on only one node •  In a 1000 node cluster, you can expect ~10 failures per week •  For peace of mind, Reliability should be handled by software

Hadoop is designed to address these issues.

The Platform, in brief… •  HDFS: Storing Data

–  Data spilt into multiple blocks across nodes –  Replication protects data from failures –  A master node orchestrates the read/write requests (without being a bottleneck!!) –  Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable)

•  MapReduce (MR): Processing Data –  A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce) –  You don’t need no knowledge of network IO, node failures, checkpoints, distributed

what?? –  Most of the data processing jobs can be mapped into MapReduce Abstraction –  Data processed locally, in parallel. Reliability is implicit. –  A giant merge sort infrastructure does the magic

Will revisit this slide. Something's are better understood in retrospect.

HDFS

MR: Programming Model •  Map function: (key, value) -> (key1, value1) list •  Reduce function: (key1, value1 list) -> key1, output

•  Examples: –  map(k, v) -> emit (k, v.toUpper()) –  map(k, v) -> foreach c in v; do emit(k, c) done;

–  reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)

MAPREDUCE

Thinking in MapReduce …. •  Word Count Example

–  map(docid, text) -> foreach word in text.split(); do emit(word, 1); done –  reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word,

sum)

•  Document search index (Inverted index) –  map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done –  reduce(term, docid list) -> emit(term, docid list);.

Thinking in MapReduce ….

•  All the anchor text to a page –  map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done –  Reduce(link, anchorText list) -> emit(link, anchorText list);

•  Image resize –  Map(imgid, image) -> emit(imgid, image.resize()); –  No need for reduce

Hadoop Streaming Demo •  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile

•  Each line in InputFile written to stdin of shellMapper.sh •  A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value) •  Each key, value pair is fed as line into stdin of shellReducer.sh •  A line on stdout of shellReducer written to outputFile

•  hadoop jar hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /bin/wc

•  More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html

•  Will share the terminal session now… for DEMO

Brief intro to PIG •  Adhoc data analysis •  An abstraction over mapreduce •  Think of it as a stdlib for mapreduce •  Supports common data processing operators (join, group by) •  A high level language for data processing

PIG demo – switch to terminal

•  Also try… HIVE, exposes a SQL like interface over HDFS data

HIVE demo – switch to terminal

Technology

HadoopThe Hadoop Java Software Framework