Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data...

Herodotos Herodotou, Harold Lim, Fei Dong,

Shivnath Babu

Duke University

http://www.cs.duke.edu/starfish

Practitioners of Big Data Analytics

6/29/2011 Starfish 2

Data Size

Google

Yahoo!

Facebook

Journalists

Systems

researchers

Workload

Complexity

Biologists

Economists

Physicists

Counts &

Aggregates

Rollups &

Drilldowns

Statistical analysis,

Linear Algebra

Machine

learning

Text / Images / Video / Graphs

MapReduce/Hadoop Ecosystem

MapReduce Execution Engine

Distributed File System

Hadoop

Oozie Hive Pig Elastic

MapReduce

Java / Ruby /

Python Client

MADDER Principles of Big Data Analytics

Magnetic

Data-lifecycle-aware

Elastic

Robust

Magnetic

Easy to get data into the system

Make change (data/requirements) easy

Support the full spectrum of analytics

Write MapReduce programs in Java /

Python / R or use interfaces like Pig / Jaql

Data-lifecycle-

Data cycle at LinkedIn

Elastic

Adapt resources/costs to actual workloads

6/29/2011 Starfish 10

Robust Graceful degradation under undesirable

events

6/29/2011 Starfish 11

Magnetic

Elastic

Robust

Data can be opaque until run-time

of use

Hard to get good

performance

out of the box

Programs are a different beast

from SQL

Policies are nontrivial

Ease-of-Use Vs. Out-of-the-box Perf.

Starfish: MADDER + Self-Tuning

6/29/2011 Starfish 12

Dynamic instrumentation

Profile

what-if calls Relative

simulation

Mix of models &

Recursive random search

of use

Get good performance

automatically

Magnetic

Elastic

Robust

Starfish: MADDER + Self-Tuning

6/29/2011 Starfish 13

Goal: Provide good performance automatically

MapReduce Execution Engine

Distributed File System

Hadoop

Oozie Hive Pig Elastic MR Java Client …

Starfish

Analytics System

What are the Tuning Problems?

6/29/2011 Starfish 14

Job-level

MapReduce

configuration

Workflow

optimization

Workload

management

Data layout

tuning

Cluster sizing

Starfish Architecture

6/29/2011 Starfish 15

What-if

Engine

Workflow-level tuning

Workflow Optimizer

Workload-level tuning

Workload Optimizer Elastisizer

Data Manager

Metadata

Intermediate

Data Mgr.

Data Layout &

Storage Mgr.

Job Optimizer

Profiler

Job-level tuning

Sampler

6/29/2011 Starfish 16

What-if

Engine

Workflow Optimizer

Job Optimizer

Profiler

Job-level tuning

Sampler

Data Manager

Metadata

Intermediate

Data Mgr.

Data Layout &

Storage Mgr.

Map function

Reduce function

Run this program as a

MapReduce job

MapReduce Job Execution

Input Splits

job j = < program p, data d, resources r, configuration c >

Map Wave 1

Reduce Wave 1

Map Wave 2

Reduce Wave 2

Input Splits

MapReduce Job Execution

How are the number of splits, number of map and reduce

tasks, memory allocation to tasks, etc., determined?

Optimizing MapReduce Job Execution

6/29/2011 Starfish 19

Space of configuration choices include settings for:

Number of map tasks

Number of reduce tasks

Partitioning of map outputs to reduce tasks

Memory allocation to task-level buffers

Multiphase external sorting in the tasks

Whether output data from tasks should be compressed

Whether combine function should be used

Optimizing MapReduce Job Execution

190+ parameters in Hadoop

6/29/2011 Starfish 20

2-dim projection

of 13-dim surface

feature will greatly increase the utility

Hadoop has that can affect the latency and scalability of a job. For different types of jobs, different configurations will yield optimal results. For example, a job with no memory-intensive operations in the map phase but with a combine phase will want to set Hadoop's io.sort.mb quite high, to minimize the number of spills from the map.

Adding this of Pig for Hadoop users, as it will free them from needing to understand Hadoop well enough to tune it themselves for their particular jobs.

Case for Automated Hadoop Tuning (from wiki.apache.org/pig/PigJournal)

6/29/2011 Starfish 21

many configuration parameters

Starfish’s Core Approach to Tuning

Challenges: p is an arbitrary MapReduce program; c is

high-dimensional; …

6/29/2011 Starfish 22

),,,(minarg crdpFcSc

),,,( crdpFperf program p, data d,

resources r,

configuration c

Profiler

What-if Engine

Optimizer(s)

Runs MapReduce jobs to collect job

profiles (concise execution summary)

Given profile of j = <p,d,r,c>, estimates

virtual profile for job j' = <p,d’,r’,c’>

Enumerates and searches through the

optimization space efficiently

Profile of a MapReduce Job

6/29/2011 Starfish 23

split 0 map out 0 reduce

Two Map Waves One Reduce Wave

split 2 map

split 1 map split 3 map Out 1 reduce

Concise representation of program execution

Records information at the level of “task phases”

6/29/2011 Starfish 24

Memory Buffer

[Combine],

[Compress]

Serialize,

Partition

Spill Collect Map Read

Map Task Phases

split 0

6/29/2011 Starfish 25

Profile

Dataflow

Amount of data flowing though tasks & task phases

Dataflow Statistics

Statistical info about the dataflow

Execution time at level of tasks & task phases

Cost Statistics

Statistical info about the costs

Fields in a Job Profile Dataflow

Map output bytes

Number of map-side spills

Number of merge rounds

Number of records in buffer per spill

6/29/2011 Starfish 26

Read phase time in the map task

Map phase time in the map task

Collect phase time in the map task

Spill phase time in the map task

Dataflow Statistics

Map func’s selectivity (output / input)

Map output compression ratio

Combiner’s selectivity

Size of records (keys and values)

Cost Statistics

I/O cost for reading from local disk per byte

I/O cost for writing to HDFS per byte

CPU cost for executing Map func per record

CPU cost for uncompressing the input per byte

Generating Profiles Concise representation of program execution

Generated by Profiler through measurement or by the

What-if Engine through estimation

6/29/2011 Starfish 27

Generating Profiles by Measurement

Have zero overhead when off, low overhead when on

Require no modifications to Hadoop

Support unmodified MapReduce programs written in

Java/Python/Ruby/C++

Approach: Dynamic (on-demand) instrumentation

Event-condition-action rules are specified (in Java)

Leads to run-time instrumentation of Hadoop internals

Monitors task phases of MapReduce job execution

We currently use BTrace (Hadoop internals are in Java)

6/29/2011 Starfish 28

6/29/2011 Starfish 29

split 1 map

raw data

profile

reduce

profile

Use of Sampling

• Profile fewer tasks

• Execute fewer tasks

JVM = Java Virtual Machine, ECA = Event-Condition-Action

JVM JVM

Enable Profiling

ECA rules

Generating Profiles by Measurement

Overhead of Profile Measurement

Word Co-occurrence job running on a 16-node cluster of c1.medium EC2 nodes on the Amazon Cloud

6/29/2011 Starfish 30

Task Scheduler Simulator

What-if Engine

6/29/2011 Starfish 31

Job Oracle

What-if Engine

Profile

Input Data

Properties

Cluster

Resources

Configuration

Settings

Virtual Job Profile for <p, d2, r2, c2>

Properties of Hypothetical Job

Possibly Hypothetical

<p, d1, r1, c1> <d2> <r2> <c2>

What-if Questions Starfish can Answer

How will job j’s execution time change if the number of

reduce tasks is changed from 20 to 40?

What will the change in I/O be if map o/p compression

is turned on, but the input data size increases by 40%?

What will job j’s new execution time be if 5 more nodes

are added to the cluster, bringing the total to 20?

How will workload execution time & dollar cost change

if we move the production cluster from m1.xlarge nodes

to c1.medium nodes on Amazon EC2?

6/29/2011 Starfish 32

(Virtual) Profile for j'

Virtual Profile Estimation

6/29/2011 Starfish 33

Dataflow

Statistics

Dataflow

Statistics

Profile for j

Data d2

Confi-

guration

Resources

Dataflow

Statistics

Cardinality

Models

White-box Models

Statistics Relative

Black-box

Models

Given profile for job j = <p, d1, r1, c1>

Estimate profile for job j' = <p, d2, r2, c2>

Dataflow

White-box Models

Job Optimizer

6/29/2011 Starfish 34

Profile

Input Data

Properties

Cluster

Resources

Subspace Enumeration

Recursive Random Search

Job Optimizer

What-if

<p, d1, r1, c1> <d2> <r2>

Best Configuration

Settings <copt> for <p, d2, r2>

Experimental Setup

6/29/2011 Starfish 35

Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type

2 map slots & 2 reduce slots

300MB max memory per task

Cost-Based Job Optimizer Vs. Rule-Based Optimizer

Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type

2 map slots & 2 reduce slots

300MB max memory per task

Cost-Based Job Optimizer Vs. Rule-Based Optimizer

6/29/2011 Starfish 36

Abbr. MapReduce Program Domain Dataset

CO Word Co-occurrence NLP 30GB,Wikipedia

WC WordCount Text Analytics 30GB, Wikipedia

TS TeraSort Business Analytics 30GB, Teragen

LG LinkGraph Graph Processing 10GB, Wikipedia (compressed)

JO Join Business Analytics 30GB, TPC-H

Job Optimizer Evaluation

6/29/2011 Starfish 37

CO WC TS LG JO

MapReduce Job

Default

Settings

Rule-Based

Optimizer

Just-in-Time

OptimizerCost-Based Job

Insights from Profiles for WordCount

6/29/2011 Starfish 38

Many, small spills

Combiner gave smaller data reduction

Better resource utilization in Mappers

Few, large spills

Combiner gave high data reduction

Combiner made Mappers CPU bound

A: Rule-Based B: Cost-Based (2x faster)

Estimates from the What-if Engine

6/29/2011 Starfish 39

True surface Estimated surface

6/29/2011 Starfish 40

What-if

Engine

Workflow Optimizer

Job Optimizer

Profiler

Job-level tuning

Sampler

Data Manager

Metadata

Intermediate

Data Mgr.

Data Layout &

Storage Mgr.

Workflow Optimization Space

6/29/2011 Starfish 41

Job-level Configuration

Dataset-level Configuration

Vertical Packing

Intra-job Inter-job

Partition Function Selection

Optimization Space

Physical Logical

Optimizations on TF-IDF Workflow

6/29/2011 Starfish 42

Logical

Optimization

… D0 <{D},{W}>

… <{D, W},{f}>

… <{D},{W, f, c}>

J3, J4

… <{W},{D, t}>

Partition:{D}

Sort: {D,W} M1

… D0 <{D},{W}>

J1, J2

… <{D},{W, f, c}>

J3, J4

… <{W},{D, t}>

Physical

Optimization

Reducers= 50 Compress = off Memory = 400 …

Reducers= 20 Compress = on Memory = 300 …

Legend

D = docname f = frequency

W = word c = count

t = TF-IDF

6/29/2011 Starfish 43

What-if

Engine

Workflow Optimizer

Job Optimizer

Profiler

Job-level tuning

Sampler

Data Manager

Metadata

Intermediate

Data Mgr.

Data Layout &

Storage Mgr.

6/29/2011 Starfish 44

Cloud enables users to provision Hadoop clusters in minutes

Can avoid the man (system administrator) in the middle

Pay only for what resources are used

0200400600800

1,0001,200

m1.small m1.large m1.xlarge c1.medium c1.xlarge

EC2 Instance Type

Multi-objective Cluster Provisioning

6/29/2011 Starfish 45

0200400600800

1,0001,200

EC2 Instance Type for Target Cluster

Actual

Predicted

EC2 Instance Type for Target Cluster

Actual

Predicted

Instance Type for Source Cluster: m1.large

Multi-objective Cluster Provisioning

More Info: www.cs.duke.edu/starfish

6/29/2011 Starfish 46

Job-level

MapReduce

configuration

Workflow

optimization

Workload

management

Data layout

tuning

Cluster sizing

Duke University //dsg.uwaterloo.ca/seminars/notes/babu-starfish.pdfPractitioners of Big Data...

Documents

Starfish fx presentation

Soble’s starfish :

Starfish CMS

Efficient(Concurrency(Supportroblkw.com/likamwa2015starfish-slides.pdf · faceDetect(img) Vision* Library Starfish Starfish Library Starfish Library. Starfish*Core Function Cache

Starfish work power_point

“Starfish” imepotea wapi

Introduction to Starfish - Everett Community College · 2020-04-15 · 1 Introduction to Starfish Welcome to Starfish, Everett Community College’s student success technology! Starfish

Super Starfish

Brine School Starfish

Welcome to Starfish - University of Montana€¦ · Web viewStudent Getting Started Guide Starfish ® Version 6.3 Welcome to Starfish ® Starfish provides you with a central location

Starfish Strategic Websites

Starfish K2 ApNote

STARFISH DISSECTION

January 2010 PONCA PUBLIC SCHOOLStarfish 7 Angelfish 8 Angelfish 11 Starfish 12 Angelfish 13 Starfish 14 Angelfish 15 Starfish 18 Starfish 19 Angelfish 20 Starfish 21 Angelfish 22

Starfish Holiday Release

Catálogo STARFISH

Starfish Getting Started - University of Findlay · Starfish%on%any%single%student.%%In%the%student%folder%you% ... Starfish%are%not%viewable%in%the%student%folder%for% ... Starfish

Starfish Troy University

Starfish: App Details

ExperienceChange - Starfish Learning