27
Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) 2011

Monitoring and Debugging Dryad(LINQ) Applications with Daphne

  • Upload
    freira

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

Monitoring and Debugging Dryad(LINQ) Applications with Daphne. Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) 2011. Programming Clusters: Marketing. - PowerPoint PPT Presentation

Citation preview

Page 1: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Monitoring and Debugging Dryad(LINQ) Applications

with Daphne

Vilas Jagannath, Zuoning Yin, Mihai BudiuUniversity of Illinois, Microsoft Research SVC

International Workshop onHigh-Level Parallel Programming Models and

Supportive Environments (HIPS) 2011

Page 2: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Programming Clusters: Marketing

Map-Reduce

Page 3: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Programming Clusters: Reality

Page 4: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Complexity Exposed

Correctness or performance bugsbreak the single-system abstraction

Page 5: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Outline

• Motivation• Job structure• The Job Object Model• Tools for job understanding• Conclusions

Page 6: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Execution

Application

Data-Parallel Computation

6

Storage

Language

Map-Reduce

GFSBigTable

CosmosAzureHPC

Dryad

DryadLINQScope

Sawzall,FlumeJava

Hadoop

HDFSS3

Pig, Hive≈SQL LINQ, SQLSawzall, Java

Page 7: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

7

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

Page 8: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

8

Dryad Job Structure

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

ChannelsStage

Page 9: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

9

Dryad System Architecture

Networkjob schedule

data plane

control plane

NS,Sched Exec ExecExec

V V V

Job manager cluster

Page 10: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Fire

wal

l

How does it work in detail?

Cluster/Cloud

Cluster Scheduler

Job Manager(JM)

Exec

Storage

Localhost

Job Submission

Compiler

Application

IDE Vertex

Exec

Storage

Vertex

Exec

Storage

L: Logs, IO: Input/Output, R: Resources

L R IO L R IO L R IO

Page 11: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Logs – lots of them

• Job-related – Plan (xml), status, resources

• Job-manager– stdout.txt, stderr.txt, *.log

• Vertex– stdout.txt, *.log, *.xml, *.cmd

Page 12: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Monitoring Tools Structure

Cosm

os

Scop

e

HPC

v2

HPC

v3

Cluster abstraction

Job Object Model

Monitoring,Profiling,

Debugging

GUIs

Page 13: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Job Object Model

Logs

JOM

Views

JobVerticesPlan

Tools

Page 14: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Outline

• Motivation• Job structure• The Job Object Model• Tools for job understanding• Conclusions

Page 15: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

The Job BrowserJob Stage Vertex

Page 16: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Job Schedule

Page 17: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Failure diagnosis

Page 18: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Diagnosis decision tree

• “Hand-made”• Least portable tool• Incomplete• High-coverage• Bug types:– User level– System-level– Cluster malfunction

Page 19: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Powershell = Interactive Queries

$cluster = get-cluster X $job = $cluster | select-AllJobs | sort-object Date | select-object -last 1 | select-DryadJob$failed = $job.Vertices | where-object { $_.State -eq "Failed" }

Page 20: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Vertex Debugging on Client

Page 21: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Vertex Profiling on Client

Page 22: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Debugging on Cluster

Collection<T> collection;var results = from c in collection

where c.name.length > 10 orderby c.age

select c.name;

where c.name.length > 10

Program Job

Breakpoint

Page 23: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Fire

wal

l

Cluster/Cloud

Storage

L R

Remote debugging

Cluster Scheduler

Job Manager(JM)

Localhost

Job Submission

DryadLINQ

Application

Visual Studio Vertex 1 Vertex 2

Breakpoint hit…

Breakpoint

L: Logs, IO: Input/Output, R: Resources

attach

Exec

Storage

Exec

Storage

Exec

L R IO L R IO IO

Page 24: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Fire

wal

l

Cluster/Cloud

Exec Exec

Storage Storage Storage

L L L

Notifications: Our Implementation

Cluster Scheduler

Job Manager(JM)

Localhost

Job Submission

DryadLINQ

Application

Visual Studio Vertex 1 Vertex 2

Daphne

L: Logs, IO: Input/Output, R: Resources

Exec

R IO R IO R IO

attach

Page 25: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Remote debugging

Page 26: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Open Problems

• What happens when 100,000 processes hit a breakpoint?

• How to evaluate expressions in the debugger when state is distributed?

• How to do large-scale performance debugging?• How to preserve map between distributed state

and original program state?• How much can the illusion of a

single system be preserved?

Page 27: Monitoring and Debugging  Dryad(LINQ)  Applications  with Daphne

Conclusions

• Single-machine abstractions break down in the presence of (performance/correctness) bugs

• Job Object Model insulates tools from messy details

• Design the cluster runtime to make iteasy to build a JOM

• Rich interactive tools easily built on top of JOM• Much more work needed for debugging at scale