19
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Scaling Up Hadoop Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray

CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Hadoop

Duen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray

Page 2: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

How to handle data that is really large?

Really big, as in...

• Petabytes (PB, about 1000 times of terabytes)

• Or beyond: exabyte, zettabyte, etc.Do we really need to deal with such scale?

• Yes!

2

Page 3: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

“Big Data” is Common...Google processed 24 PB / day (2009)

Facebook’s add 0.5 PB / day to its data warehouses

CERN generated 200 PB of data from “Higgs boson” experiments

Avatar’s 3D effects took 1 PB to store

So, think BIG!

3

http://www.theregister.co.uk/2012/11/09/facebook_open_sources_corona/http://thenextweb.com/2010/01/01/avatar-takes-1-petabyte-storage-space-equivalent-32-year-long-mp3/http://dl.acm.org/citation.cfm?doid=1327452.1327492

Page 4: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

How to analyze such large datasets?First thing, how to store them?

Single machine? 16TB SSD announced.

Cluster of machines?

• How many machines?

• Need to worry about machine and drive failure. Really?

• Need data backup, redundancy, recovery, etc.

4

3% of 100,000 hard drives fail within first 3 months

Failure Trends in a Large Disk Drive Populationhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdfhttp://arstechnica.com/gadgets/2015/08/samsung-unveils-2-5-inch-16tb-ssd-the-worlds-largest-hard-drive/

Page 5: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

How to analyze such large datasets?

How to analyze them?

• What software libraries to use?• What programming languages to learn?• Or more generally, what framework to use?

5

Page 6: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

Lecture based on Hadoop: The Definitive Guide

Book covers Hadoop, some Pig, some HBase, and other things.

6

http://goo.gl/YNCWN

Page 7: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

Open-source software for reliable, scalable, distributed computing

Written in Java

Scale to thousands of machines

• Linear scalability (with good algorithm design): if you have 2 machines, your job runs twice as fast

Uses simple programming model (MapReduce)

Fault tolerant (HDFS)

• Can recover from machine/disk failure (no need to restart computation)

7http://hadoop.apache.org

Page 8: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

Why learn Hadoop?Fortune 500 companies use it

Many research groups/projects use it

Strong community support, and favored/backed my major companies, e.g., IBM, Google, Yahoo, eBay, Microsoft, etc.

It’s free, open-source

Low cost to set up (works on commodity machines)

Will be an “essential skill”, like SQL

8http://strataconf.com/strata2012/public/schedule/detail/22497

Page 9: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

Elephant in the room

Hadoop created by Doug Cutting and Michael Cafarella while at Yahoo

Hadoop named after Doug’s son’s toy elephant

9

Page 10: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

How does Hadoop scales up computation?

Uses master-slave architecture, and a simple computation model called MapReduce (popularized by Google’s paper)

Simple explanation

1.Divide data and computation into smaller pieces; each machine works on one piece

2.Combine results to produce final results

10MapReduce: Simplified Data Processing on Large Clustershttp://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf

Page 11: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

How does Hadoop scales up computation?

More technically...

1.Map phase Master node divides data and computation into smaller pieces; each machine (“mapper”) works on one piece independently in parallel

2.Shuffle phase (automatically done for you)Master sorts and moves results to “reducers”

3.Reduce phase Machines (“reducers”) combines results independently in parallel

11

Page 12: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

An example Find words’ frequencies among text documentsInput

• “Apple Orange Mango Orange Grapes Plum”• “Apple Plum Mango Apple Apple Plum”

Output

• Apple, 4Grapes, 1Mango, 2Orange, 2Plum, 3

12http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html

Page 13: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

13

Master divides the data (each machine gets one line)

Each machine (mapper) outputs a key-value pair

Pairs sorted by key(automatically done)

Each machine (reducer) combines

pairs into one

A machine can be both a mapper and a reducer

Page 14: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

map(String key, String value): // key: document id // value: document contents for each word w in value: emit(w, "1");

How to implement this?

14

Page 15: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

15

How to implement this?

Page 16: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

What can you use Hadoop for?As a “swiss knife”.

Works for many types of analyses/tasks (but not all of them).

What if you want to write less code?

• There are tools to make it easier to write MapReduce program (Pig), or to query results (Hive)

16

Page 17: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

What if a machine dies?Replace it!

• “map” and “reduce” jobs can be redistributed to other machines

Hadoop’s HDFS (Hadoop File System) enables this

17

Page 18: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

HDFS: Hadoop File SystemA distribute file system

Built on top of OS’s existing file system to provide redundancy and distribution

HSDF hides complexity of distributed storage and redundancy from the programmer

In short, you don’t need to worry much about this!

18

Page 19: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

How to try Hadoop?Hadoop can run on a single machine (e.g., your laptop)

• Takes < 30min from setup to runningOr a “home-brew” cluster

• Research groups often connect retired computers as a small cluster

Amazon EC2 (Amazon Elastic Compute Cloud)

• You only pay for what you use, e.g, compute time, storage

• You will use it in our next assignment (tentative)

19http://aws.amazon.com/ec2/