An introduction to Hadoop for large scale data analysis

Hadoop – Large scale data analysis

Abhijit Sharma

Page 1 | 04/10/2023

Unprecedented growth in ◦ Data set size - Facebook 21+ PB data

warehouse, 12+ TB/day◦ Un(semi)-structured data – logs, documents,

graphs◦ Connected data web, tags, graphs

Relevant to enterprises – logs, social media, machine generated data, breaking of silos

Page 2 | 04/10/2023

Big Data Trends

Page 3 | 04/10/2023

Putting Big Data to work

Data driven Org – decision support, new offerings◦ Analytics on large data

sets (FB Insights – Page, App etc stats),

◦ Data Mining – Clustering - Google News articles

◦ Search - Google

Embarrassingly data parallel problems◦ Data chunked & distributed across cluster◦ Parallel processing with data locality – task

dispatched where data is◦ Horizontal/Linear scaling approach using

commodity hardware◦ Write Once, Read Many

◦ Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links

Page 4 | 04/10/2023

Problem characteristics and examples

Open source system for large scale batch distributed computing on big data◦ Map Reduce Programming Paradigm & Framework ◦ Map Reduce Infrastructure◦ Distributed File System (HDFS)

Endorsed/used extensively by web giants – Google, FB, Yahoo!

Page 5 | 04/10/2023

What is Hadoop?

Map Reduce is a programming model and an implementation for parallel processing of large data sets

Map processes each logical record per input split to generate a set of intermediate key/value pairs

Reduce merges all intermediate values

associated with the same intermediate key

Page 6 | 04/10/2023

Map Reduce - Definition

Map : Apply a function to each list member - Parallelizable

[1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]

Reduce : Apply a function and an accumulator to each list member

[1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6

Map & Reduce

[1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14

Page 7 | 04/10/2023

Map Reduce - Functional Programming Origins

Page 8 | 04/10/2023

Word Count - Shell

cat * | grep | sort | uniq –cinput| map | shuffle & sort | reduce

Page 9 | 04/10/2023

Word Count - Map Reduce

mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each

occurrence of “the”

reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])

sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum)

Page 10 | 04/10/2023

Word Count - Pseudo code

Word Count / Distributed logs search for # accesses to various URLs◦ Map – emits word/URL, 1 for each doc/log split◦ Reduce – sums up the counts for a specific word/URL

Term Vector generation – term -> [doc-id]◦ Map – emits term, doc-id for each doc split◦ Reduce – Identity Reducer – accumulates the (term, [doc-id,

doc-id ..]) Reverse Links – source -> target to target->

source◦ Map – emits (target, source) for each doc split◦ Reducer – Identity Reducer – accumulates the (target,

[source, source ..])

Page 11 | 04/10/2023

Examples – Map Reduce Defn

Hides complexity of distributed computing

◦ Automatic parallelization of job◦ Automatic data chunking & distribution (via HDFS)◦ Data locality – MR task dispatched where data is◦ Fault tolerant to server, storage, N/W failures◦ Network and disk transfer optimization◦ Load balancing

Page 12 | 04/10/2023

Map Reduce – Hadoop Implementation

Page 13 | 04/10/2023

Hadoop Map Reduce Architecture

Very large files – block size 64 MB/128 MB Data access pattern - Write once read many Writes are large, create & append only Reads are large & streaming Commodity hardware Tolerant to failure – server, storage, network Highly available through transparent

replication Throughput is more important than latency

Page 14 | 04/10/2023

HDFS Characteristics

Page 15 | 04/10/2023

HDFS Architecture

Thanks

Page 16 | 04/10/2023

Page 17 | 04/10/2023

Backup Slides

Page 18 | 04/10/2023

Map & Reduce Functions

Page 19 | 04/10/2023

Job Configuration

Job Tracker tracks MR jobs – runs on master node

Task Tracker◦ Runs on data nodes and tracks Mapper, Reducer

tasks assigned to the node◦ Heartbeats to Job Tracker◦ Maintains and picks up tasks from a queue

Page 20 | 04/10/2023

Hadoop Map Reduce Components

Name Node ◦ Manages the file system namespace and regulates access to

files by clients – stores meta data◦ Mapping of blocks to Data Nodes and replicas◦ Manage replication◦ Executes file system namespace operations like opening,

closing, and renaming files and directories. Data Node

◦ One per node, which manages local storage attached to the node

◦ Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes

◦ Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.

Page 21 | 04/10/2023

HDFS

Technology

An introduction to Hadoop for large scale data analysis