15
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu

CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

CPS216: Advanced Database

Systems (Data-intensive

Computing Systems)

Introduction to MapReduce

and Hadoop

Shivnath Babu

Page 2: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

Word Count over a Given Set of

Web Pages

see bob throw see 1

bob 1

throw 1

see 1

spot 1

run 1

bob 1

run 1

see 2

spot 1

throw 1

see spot run

Can we do word count in parallel?

Page 3: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

The MapReduce Framework

(pioneered by Google)

Page 4: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

Automatic Parallel Execution in

MapReduce (Google)

Handles failures automatically, e.g., restarts tasks if a

node fails; runs multiples copies of the same task to

avoid a slow task slowing down the whole job

Page 5: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

MapReduce in Hadoop (1)

Page 6: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

MapReduce in Hadoop (2)

Page 7: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

MapReduce in Hadoop (3)

Page 8: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

Data Flow in a MapReduce

Program in Hadoop • InputFormat

• Map function

• Partitioner

• Sorting & Merging

• Combiner

• Shuffling

• Merging

• Reduce function

• OutputFormat

1:many

Page 9: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing
Page 10: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as a

MapReduce job

Page 11: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as a

MapReduce job

Page 12: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

Map Wave 1

Reduce Wave 1

Map Wave 2

Reduce Wave 2

Input Splits

Lifecycle of a MapReduce Job

Time

How are the number of splits, number of map and reduce

tasks, memory allocation to tasks, etc., determined?

Page 13: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

Job Configuration Parameters

• 190+ parameters in

Hadoop

• Set manually or defaults

are used

Page 14: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

How to sort data using Hadoop?

Page 15: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing

Let us look at a complete

example MapReduce program

in Hadoop