Upload
abhijit-sharma
View
1.580
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Hadoop – Large scale data analysis
Abhijit Sharma
Page 1 | 04/10/2023
Unprecedented growth in ◦ Data set size - Facebook 21+ PB data
warehouse, 12+ TB/day◦ Un(semi)-structured data – logs, documents,
graphs◦ Connected data web, tags, graphs
Relevant to enterprises – logs, social media, machine generated data, breaking of silos
Page 2 | 04/10/2023
Big Data Trends
Page 3 | 04/10/2023
Putting Big Data to work
Data driven Org – decision support, new offerings◦ Analytics on large data
sets (FB Insights – Page, App etc stats),
◦ Data Mining – Clustering - Google News articles
◦ Search - Google
Embarrassingly data parallel problems◦ Data chunked & distributed across cluster◦ Parallel processing with data locality – task
dispatched where data is◦ Horizontal/Linear scaling approach using
commodity hardware◦ Write Once, Read Many
◦ Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links
Page 4 | 04/10/2023
Problem characteristics and examples
Open source system for large scale batch distributed computing on big data◦ Map Reduce Programming Paradigm & Framework ◦ Map Reduce Infrastructure◦ Distributed File System (HDFS)
Endorsed/used extensively by web giants – Google, FB, Yahoo!
Page 5 | 04/10/2023
What is Hadoop?
Map Reduce is a programming model and an implementation for parallel processing of large data sets
Map processes each logical record per input split to generate a set of intermediate key/value pairs
Reduce merges all intermediate values
associated with the same intermediate key
Page 6 | 04/10/2023
Map Reduce - Definition
Map : Apply a function to each list member - Parallelizable
[1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]
Reduce : Apply a function and an accumulator to each list member
[1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6
Map & Reduce
[1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14
Page 7 | 04/10/2023
Map Reduce - Functional Programming Origins
Page 8 | 04/10/2023
Word Count - Shell
cat * | grep | sort | uniq –cinput| map | shuffle & sort | reduce
Page 9 | 04/10/2023
Word Count - Map Reduce
mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each
occurrence of “the”
reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])
sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum)
Page 10 | 04/10/2023
Word Count - Pseudo code
Word Count / Distributed logs search for # accesses to various URLs◦ Map – emits word/URL, 1 for each doc/log split◦ Reduce – sums up the counts for a specific word/URL
Term Vector generation – term -> [doc-id]◦ Map – emits term, doc-id for each doc split◦ Reduce – Identity Reducer – accumulates the (term, [doc-id,
doc-id ..]) Reverse Links – source -> target to target->
source◦ Map – emits (target, source) for each doc split◦ Reducer – Identity Reducer – accumulates the (target,
[source, source ..])
Page 11 | 04/10/2023
Examples – Map Reduce Defn
Hides complexity of distributed computing
◦ Automatic parallelization of job◦ Automatic data chunking & distribution (via HDFS)◦ Data locality – MR task dispatched where data is◦ Fault tolerant to server, storage, N/W failures◦ Network and disk transfer optimization◦ Load balancing
Page 12 | 04/10/2023
Map Reduce – Hadoop Implementation
Page 13 | 04/10/2023
Hadoop Map Reduce Architecture
Very large files – block size 64 MB/128 MB Data access pattern - Write once read many Writes are large, create & append only Reads are large & streaming Commodity hardware Tolerant to failure – server, storage, network Highly available through transparent
replication Throughput is more important than latency
Page 14 | 04/10/2023
HDFS Characteristics
Page 15 | 04/10/2023
HDFS Architecture
Thanks
Page 16 | 04/10/2023
Page 17 | 04/10/2023
Backup Slides
Page 18 | 04/10/2023
Map & Reduce Functions
Page 19 | 04/10/2023
Job Configuration
Job Tracker tracks MR jobs – runs on master node
Task Tracker◦ Runs on data nodes and tracks Mapper, Reducer
tasks assigned to the node◦ Heartbeats to Job Tracker◦ Maintains and picks up tasks from a queue
Page 20 | 04/10/2023
Hadoop Map Reduce Components
Name Node ◦ Manages the file system namespace and regulates access to
files by clients – stores meta data◦ Mapping of blocks to Data Nodes and replicas◦ Manage replication◦ Executes file system namespace operations like opening,
closing, and renaming files and directories. Data Node
◦ One per node, which manages local storage attached to the node
◦ Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes
◦ Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.
Page 21 | 04/10/2023
HDFS