60
Big Data Platforms Mihai Budiu , Oct 6 2014

Big Data Platforms

Embed Size (px)

DESCRIPTION

Big Data Platforms. Mihai Budiu , Oct 6 2014. My work. Ph.D. from Carnegie Mellon, 2003 H ardware synthesis Reconfigurable hardware Compilers and computer architecture Researcher at Microsoft Research Silicon Valley 2004-2014 Computer security - PowerPoint PPT Presentation

Citation preview

Page 1: Big Data  Platforms

Big Data Platforms

Mihai Budiu

, Oct 6 2014

Page 2: Big Data  Platforms

2

My work• Ph.D. from Carnegie Mellon, 2003• Hardware synthesis• Reconfigurable hardware• Compilers and computer architecture

• Researcher at Microsoft Research Silicon Valley 2004-2014• Computer security• Cloud computing infrastructure:

• distributed computation platforms • monitoring and debugging• performance analysis

• Big data analysis and visualization • Large scale machine learning

Page 3: Big Data  Platforms

3

500 Years Ago

Tycho Brahe(1546-1601)

Johannes Kepler(1571-1630)

Page 4: Big Data  Platforms

4

The Laws of Planetary Motion

Tycho’s measurements Kepler’s laws

Page 5: Big Data  Platforms

5

The Large Hadron Collider

25 PB/year WLHC Grid: 200K computing cores

Page 6: Big Data  Platforms

6

Genetic Code

Page 7: Big Data  Platforms

7

Astronomy

Page 8: Big Data  Platforms

8

Weather

Page 9: Big Data  Platforms

9

The Webs

Internet

Facebook friends graph

Page 10: Big Data  Platforms

10

Big Data

Page 11: Big Data  Platforms

11

Big Computers

Page 12: Big Data  Platforms

12

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Page 13: Big Data  Platforms

13

Design Space

Throughput(batch)

Latency(interactive)

Internet

Datacenter

Data-parallel

Sharedmemory

DryadSearch

HPC

Grid

Transaction

Sketch

Page 14: Big Data  Platforms

14

Dryad• Eurosys 2007• Continuously deployed in

Microsoft since 2006• Execution engine of Bing

analytics• > 105 machines•Many PB of data analyzed daily

Dryad painting by Evelyn de Morgan

Page 15: Big Data  Platforms

15

Dryad = Execution Layer

Job (application)

Dryad

Cluster

Pipeline

Shell

Machine≈

Page 16: Big Data  Platforms

16

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

Page 17: Big Data  Platforms

17

Virtualized 2-D Pipelines

Page 18: Big Data  Platforms

18

Virtualized 2-D Pipelines

Page 19: Big Data  Platforms

19

Virtualized 2-D Pipelines

Page 20: Big Data  Platforms

20

Virtualized 2-D Pipelines

Page 21: Big Data  Platforms

21

Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized

Page 22: Big Data  Platforms

22

Dryad Job Structure

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

ChannelsStage

Page 23: Big Data  Platforms

23

Dryad System Architecture

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS,Sched RE RERE

V V V

job manager cluster

Page 24: Big Data  Platforms

GM code

vertex code

Staging1. Build

2. Send .exe

3. Start manager

5. Generate graph

7. Serializevertices

8. MonitorVertex execution

4. Querycluster resources

Nameserver6. Initialize vertices

Remoteexecutionservice

Page 25: Big Data  Platforms

25

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Page 26: Big Data  Platforms

26

Distributed Collections

Partition

Collection

.Net objects

Page 27: Big Data  Platforms

27

LINQ

Dryad

=> DryadLINQ

Page 28: Big Data  Platforms

28

LINQ = .Net+ Queries

Collection<T> collection;bool IsLegal(Key);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

Page 29: Big Data  Platforms

29

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)Data

Page 30: Big Data  Platforms

30

Language Summary

WhereSelectGroupByOrderByAggregateJoin

Page 31: Big Data  Platforms

31

Very expressive

var result = input.SelectMany(r => Mapper(r)) .GroupBy(r => Key(r)) .Select(g => Reducer(g));

Map-Reduce

Distributed sorting

Iterative machine-learning (EM)

Page 32: Big Data  Platforms

32

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Page 33: Big Data  Platforms

33

Debugging DryadLINQ jobs

Page 34: Big Data  Platforms

34

Distributed performance counters

Page 35: Big Data  Platforms

35

Training Kinect

Depth map Body parts

Classifier

Xbox GPU

Page 36: Big Data  Platforms

36

Learn from Many Examples

DecisionTree

Classifier

Machine learning

Page 37: Big Data  Platforms

37

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Page 38: Big Data  Platforms

Bandwidth hierarchy

Page 39: Big Data  Platforms

39

Principles

• Visualizations are bounded data displays• All computations are sketches

• Sketch is a runtime for (1) running streaming (sketching) algorithms(2) implementing visualizations with bounded data renderings

Page 40: Big Data  Platforms

40

Streaming algorithms

• Sketches = randomized streaming algorithms • Input = set of size n• Result same independent of the order• Memory = O(log(n))• Multi-pass

• Linear input transformations

Page 41: Big Data  Platforms

4 billion rows on 155 machines

Page 42: Big Data  Platforms

42

Spreadsheet operations• Browsing/scrolling• Filtering• Using predicates• Heavy hitters• Sampling

• Searching• Sorting• Computing new columns• Set operations (intersection, union, etc.)• Charting

Page 43: Big Data  Platforms

Histograms

Page 44: Big Data  Platforms

Heat Maps

Page 45: Big Data  Platforms

Sketch distributed service

45data

Sketchservice

data

Sketchservice

data

Sketchservice

data

Sketchservice

Page 46: Big Data  Platforms

46

DataSets = distributed objects

Network

46

Client

Servers

DataSet<T>

Application

T T T T T T T T T T T

Page 47: Big Data  Platforms

47

Sketch Spreadsheet architecture

DataSet<Table>

SQL Server CSV Files Column store Cosmos Storage layer

Table operations

GUI

Distributed objects

Spreadsheet logic

Spreadsheet display

Page 48: Big Data  Platforms

48

DataSet API

interface IDataSet<T> { IDataSet<S> Map<S>(Func<T,S> f); IDataSet<Pair<T,S>> Zip(IDataSet<S> other); R Sketch(ISketch<T, R> sketch);}interface ISketch<T,R> {

R Create(T data);R Combine(List<R> parts);

}

Page 49: Big Data  Platforms

49

DataSet Implementations

Application

Network

Client Parallel

Proxy Proxy

GUI

Parallel

Local Local Local Local

Parallel

Local Local

Parallel

Datasetinterface

Rack aggregation

Core parallelism

Cluster parallelism

RMI layer

Proxy

ref ref ref

Parallel

Server 0 Server 1 Server n

Rack 0 Rack r

Address space

T T T T T T

Page 50: Big Data  Platforms

Proxy

Local Local

Parallel

Proxy

Local Local

Parallel

T T S Sff

Map(f)

Page 51: Big Data  Platforms

51

Sketch(s)

Proxy

Local Local

ParallelR R

R

R

s.Combine

T T

s.Create

interface ISketch<T,R> {R Create(T data);R Combine(List<R> parts);

}

Page 52: Big Data  Platforms

52

Zip

Proxy

Local Local

Parallel

Proxy

Local Local

Parallel

T T S S

Proxy

Local Local

Parallel

T,S T,S

Page 53: Big Data  Platforms

53

Histograms

CDF

2Dhistogram

Page 54: Big Data  Platforms

54

Compute

Computing a histogram

Client

Server 1

Server n

Histogram

1D + 2Dcomposit

esketch

Datarangesketch

Render

Displayhistogra

m

User click tr th

ta

Page 55: Big Data  Platforms

55

Some numbers

• Window Server 2012 R2 • 8-core 2.1GHz

AMD Opteron 2373 EE • > 16GB RAM• 3 x 1TB disks using RAID-0• 155 machines • 5 racks • 1Gbps Ethernet

Page 56: Big Data  Platforms

56

1 2 4 8 16 24 32 64 128

155

0

100

200

300

400

500

600 No aggregation network

With aggregation network

Null Sketch

Machines

Tim

e (m

s)

Page 57: Big Data  Platforms

57

Histogram computation

• 26M rows/machine• Scale-out

1 2 4 8 16 24 32 64 128

155

0200400600800

1000120014001600

machines

Tim

e (m

s)

Page 58: Big Data  Platforms

58

Conclusions

• Big data is here to stay• Better tools are needed• Quest for high-level abstractions for

building distributed systems• Execution graphs• Distributed collections• Higher-order transformations• Distributed stateful objects• Sketching algorithms

Page 59: Big Data  Platforms

59

Page 60: Big Data  Platforms

Execution

Application

Data-Parallel Computation

60

Storage

Language

Map-Reduce

GFSBigTable

CosmosAzure

SQL Server

Dryad

DryadLINQScope

Sawzall,FlumeJava

Hadoop

HDFSS3

Pig, Hive≈SQL LINQ, SQLSawzall, Java