MPlane – an Intelligent Measurement Plane for Future Network and Application Management Grant Agreement n. 318627 WP3 overview Marco Milanesio - EURECOM

mPlane – an Intelligent Measurement Plane for Future Network and Application Management

Grant Agreement n. 318627

WP3 overview

Marco Milanesio - EURECOM

mPlane Final WorkshopHeidelberg, Nov 30th, 2015

Outline

Role of the Repository Main achievements Highlights

DBStream HFSP


WP3: Role of the Repository

Receive and store data from the probes Large amount of data Data diversity

Provide aggregated and pre-processed data to the reasoners Specific for each use case Generic data processing





Store probe data in the repository Interactions with WP2 Distributed file-system and/or database layer

Pre-process data Interactions with WP4 Batch (MapReduce), near real-time (DBStream) and real-time

(Blockmon)

Serve data and pre-processing results to the Reasoner Interactions with WP4 Database layer + efficient storage and indexing


Aggregate

WP3: Main objectives

Scalable algorithm design Use-case driven (top-down approach) Workload specification: data, I/O

Scheduling analytic jobs Resource allocation related to concurrent analytic jobs Target systems

Batch processing engine(s) Stream processing engine(s)

Access to analytic and external data Database layer tailored to probe data Access control


WP3: Main Achievements


Deliverables D.3.1: Basic Network data analysis D.3.2: Database layer design D.3.3: Algorithm and Scheduler design and implementation D.3.4: Final Implementation of the data processing and

storage layer

Publications 37 scientific articles

www.ict-mplane.eu/public/software

WP3: Software and tools


Query Engines Blockmon controller – stream processing DBStream – flexible data stream warehouse EZRepo – measurement data preprocessing for RCA MATH – export bulk data to DBStream mPlane interfaces for Tstat – RI-based to import RDDs MongoDB proxy repoSim – NS2-based simulator for fine-tuning

Schedulers HFSP – Hadoop Fair Sojourn Protocol Schedule – Cache-oblivious scheduling

WP3: Highlights


Query Engines Blockmon controller – stream processing DBStream – flexible data stream warehouse EZRepo – measurement data preprocessing for RCA MATH – export bulk data to DBStream mPlane interfaces for Tstat – RI-based to import RDDs MongoDB proxy repoSim – NS2-based simulator for fine-tuning

Schedulers HFSP – Hadoop Fair Sojourn Protocol Schedule – Cache-oblivious scheduling

DBStream in a Nutshell

Store and analyze large amounts of network monitoring data

DBStream is a flexible and easy to use Data Stream Warehouse (DSW)

Implemented as a middleware layer on top of PostgreSQL

Receive, store and process multiple data streams in parallel


DBStream

Continuous analytics system process and combine data from multiple sources

as they are produced create aggregations store query results for further processing by

external analysis modules Target: continuous network monitoring

but not limited to this context (smart grid, intelligent transportation systems, …)


DBStream: Architecture


General Database Approach

Database QueriesInsert

Analysis

DBStream approach

Queries

ImportModule

View GenerationFramework

Views

PostgreSQL

Analysis

Query Workload – analysis jobsWe consider 7 different analysis tasks (jobs) normally performed in traffic monitoring (the graph shows job-dependencies):

J1: RTT stats per Orgname J2: Akamai stats J3: Top 10 Orgname J4: Top 10 /24 subnets J5: Up/download per source IP J6: IPs active in the last hour

Updated every minute J7: Avg. up/download last hour

Updated every minute

Performance comparison w/SPARK

SPARK performance details

J6 is a “rolling query” continuously updatedmPlane Final Workshop

Heidelberg, Nov 30th, 2015

Conclusions

1 node DBStream up to 2.6 faster than 10 Spark nodes for specific analysis jobs

Result Projections 446 minutes for 4 VP 12 VP in one day Each VP is 5 days

DBStream can process a equivalent of 60 VP or 1 VP with 60 GBit/s

HW can be updated, more disks, SSDs? Running on top of parallel databases (e.g.,

Greenplum)


HFSP in a Nutshell

Objectives Size-based scheduler that achieves

both fairness and small response times Scalable design: no scheduling bottlenecks Small number of knobs to tune

Challenges Job size estimation while making progress Virtual time for a multi-processor setting Lack of efficient preemption primitives

HFSP: size estimation

Offline Use history as a training set Initial guess, to bootstrap the system

Online Task runtime measurements (source of errors) Reserve training slots from resource pool No updates after training

HFSP: virtual time

Job Aging: Progress in virtual time even if job is not scheduled in the

real time

Virtual time: Max-min processor sharing (GPS version as well) Takes into account failures

Job size Estimated, aggregate sequential task runtime Map and reduce phases treated separately

HFSP: scheduling

When a job is submitted If tiny: assign null size fast track to execution Else: initial guess schedule training phase

When a resource becomes available If training phase is not empty, schedule job with

smallest initial guess Else assign a task from the job with the smallest

virtual size

HFSP: preemption

HFSP is a preemptive scheduler Jobs with small sizes can “steal” resources

Task preemption in Hadoop-like systems KILL tasks waste work Wait tasks sub-optimal allocation

Our contribution: OS-assisted preemption Introduce new primitives/state to the scheduler Uses low-level UNIX SIG-* No trashing!

Experimental evaluation

Setup 20-nodes cluster (in our own private cloud) 100-nodes AWS cluster (not shown here, available on our

papers) Workload: PigMix, 100 jobs

HFSP: results for “responsiveness”

Overall benchmark: MST is ~30% smaller with HFSP than with FAIR

Tiny jobs receive the same treatment with HFSP and FAIR

Medium, large and huge jobs are consistently treated better with HFSP because they are scheduled in “sequence”

HFSP: results for “fairness”

DEV workload TEST workload PROD workload

HFSP dominates FAIR sharing

The “heavier” the workload, the better HFSP is compared to FAIR

For the PROD workload, the median gap is one order of magnitude in favor of HFSP

Conclusion

“Long-live” to size-based scheduling Previous “theoretical” results too negative Accurate size information is not required

The devil is in the details Under-specification can cause severe problems Practical implementation is tricky on multi-

processors

Thank you!


BACKUP


Size-based schedulers example


Impact of Errors

MST = mean sojourn timeMST(PS) = MST achieved by processor sharing

Major problems with heavy-tailed job size distributions

SRPT with errors

FSP with errors

What’s the problem?

Only one job is affected

All jobs are affected

Continuous Execution Language (CEL)

CEL Job: Multiple input time

windows SQL query Data schema Single output

Incremental Queries Use past output as input Rolling set analysis

SQL

…

Output

Inputs

Job

Documents

MPlane – an Intelligent Measurement Plane for Future Network and Application Management Grant Agreement n. 318627 WP3 overview Marco Milanesio - EURECOM