Software Design Practices for Large-Scale Automation

Design for Large-Scale Automation 12/30/2015

Ongoing...

Design for large-scale, high-performance, distributed software systems for

complex algorithms such as graph, optimization, prediction, and machine

learning.

Corrections/improvements are very welcome at [email protected] (Hao Xu)

mailto:[email protected]

Topics

● Large-scale Automation: Why Challenging?

● Design Principles: Coping with Complexity and Physicality

● Computation Paradigms: HPC, Spark, Tensorflow

● Designs: Logical, Physical, System levels

● Distributed and Iterative Algorithms: Partition, Sync, Iteration Trade-offs

● Smart QA: Protection, Auditing, Debug codes

Design Objectives for Large-scale Automation

● Scalability (growing)

● Extensibility (evolving)

● Performance (fast)

● Maintenance (controllable)

Scalability: Name of the Game

● Electronics simulation: mandatory for simulation software to scale with

Moore’s law

● Internet Applications: systems need to be ready for next 10x user growth and

feature evolution

● Knowledge Base: bigger system improves cross referencing and hence quality

of learning new knowledge

● Deep learning: capacity of system affects quality of latent features learned

and hence the prediction capability

● Internet of Things: as the name suggests...

What make it difficult? #1 Complexity

● Complexity is the TOP challenge for software engineering

● Usually grows with the scale of the system

○ exhibits different patterns at different scale

○ explodes with the number of software features

● The only way to handle complexity

○ “Divide and Conquer”

○ realized by various Design Principles

What make it difficult? #2 Physicality

● Software is physical, just like human

○ Results are stored in physical memory (RAM/ROM/Disk)

○ Computation is done in physical processing units (CPU/GPU/FPGA)

● Not feasible to build one gigantic machine that solves everything

○ System should live on machine farms

○ Data / Computation should be distributed

● Physicality complicates the design of systems

○ Data partition

○ Computation partition

Design Principles

Abstraction and Decoupling

Design Principles: The Philosophy

Design Principles for Coping with Complexity

● Abstraction (Vertical Divide & Conquer)

○ Core Abstractions

○ Hierarchization

● Decoupling (Horizontal Divide & Conquer)

○ Encapsulation

○ Layerization

Decoupling

Centerpiece of large-scale system design

Abstraction

Abstraction: Vertical Divide and Conquer

● Core Abstractions

○ the soul of large-scale systems

○ the root of abstraction hierarchy

○ higher level abstraction = better extensibility

● Hierarchization

○ simplification of system functionality graph

○ ideally mapped into tree structures (no loop)

○ the template for Object Oriented Design

○ need a balance b/w delegation & check

Decoupling: Horizontal Divide and Conquer

● Encapsulation

○ components encapsulate complex logic

○ API design for minimal interface

● Layerization

○ algorithms divided into layers

○ each layer handles a feature/algorithm

■ layer 1: Graph partition and communication

■ layer 2: Graph node property analysis

■ layer 3: User operation on Graph nodes

■ ...

The Priority of Abstractions for Project Management

● Core abstractions (1st Priority)

○ Determines functionality/scalability

● Library abstractions (2nd Priority)

○ Determines performance

● Logic abstractions (low priority)

○ Flows

○ Apps

○ Business logics

1

2

3

Computation Paradigms

Language level, Flow level, System level

Computation Paradigms: The Framework

Computation Paradigms

● What is Computation Paradigm?

○ Computation abstraction at different levels

○ Offers encapsulation and parallelism at different levels

○ Crucial to choose the right computation paradigm

● Computation Paradigm at different levels

○ Language level: Python, C, Scala

○ Flow level: Imperative, Symbolic, Functional programming

○ System level: Computation-centric (HPC) or Data-centric (e.g. Spark)

Flow level: Imperative Programming

● Imperative Programming: No native abstraction

○ C++ / Python / Java

○ Computation at instruction level

○ Task level parallel

Flow level: Functional Programming

● Functional Programming: Data abstraction

○ Scala / MapReduce

○ Immutable, Stateless function

● Pros

○ Offers Data level parallel

● Cons

○ Data read only, need to make another copy if update.

○ More memory consumption. Potential performance overhead.

Flow level: Symbolic Programming

● Symbolic Programming: Operator abstraction

○ Theano / TensorFlow

○ Operator level parallel

○ Graph model as base engine

● Pros

○ Offers high operator parallelism through graph propagation

● Cons

○ Not flexible for all programming tasks

○ May incur overhead handling with fine-grained operators

System level: Computation-Centric System (typical HPC 1)

● What is HPC

○ HPC is extreme parallel computing

○ Computation Partition

■ Communication delay aware

● Inter-node L3/L2/L1

● Intra-node interconnect 100gb/s

● Inter-cluster ethernet 1gb/s + Ram to Disk time

■ Physical architecture ware

● Register size etc


● Parallel at different levels

○ Multi-threading

○ Multi-process

○ Distributed cluster

○ Mainstream communication: MPI

● Partition based on needs of communication

○ Minimize communication

○ Algorithm partition

○ Data partition


● Exploit Heterogeneous Components

○ GPU acceleration (many small cores)

■ Model is too small; too much overhead; stays on CPU

■ Model is too large; exceeds GPU memory; do partial acceleration

■ Exchange memory with CPU through memory copy

○ FPGA (millions of gates)

○ SSD, RAID 0/1,5/10

● Disk IO

○ HDF5 parallel read/write

System level: Data-Centric System (Spark-like)

● Data partition: Physically distributed central DB

○ Serialization: boost:serialization(c++), pickling(Python)

● Scalable computation

○ Usually has a scheduler

○ Explicit scheduling: user defines computation graph nodes

○ Implicit scheduling: engine analyzes the computation graph

● Stateless

○ Good for debug, easy recover from failure

System level: Hybrid Architecture

● Hybrid Architecture Example: TensorFlow

○ Stochastic algorithms → use Data-centric model

■ E.g. Back propagation: Parameter Server

○ Deterministic algorithms → use Computation-centric (HPC) model

■ E.g. Common data sync among model partitions: Bulk Synchronous

Parallel

Designs: The Quality

Logical Design

Objectify, Modularize, Standardize

Logical Design

● Objectify everything

○ an object can have multiple copies for parallel computing

○ avoid singleton / global / static variables

○ top level should fall through, should not execute anything

Logical Design

● Standardize everything

○ Base Class for any task = function(data, parameters, executor_id)

○ schema (base class) for task

○ scheme for any data

○ schema for any function

○ schema for any parameter

● Benefits

○ higher level automation

○ potentially more intelligent system

Logical Design

● Modularize everything

○ encapsulate data by using setter / getter

○ encapsulate atomic or repeated functionality

○ #define any hard number

○ factorize long function or class

○ build shared libraries from bottom-up

■ communication lib

■ parallel computing lib

■ debug / reporting lib

Physical Design

Code, Memory, Performance

Physical Design: Code

● Source Code

○ component level decouple by folder

○ module level decouple by file

○ variable space decouple by namespace

● Code change

○ physical change (files/folders touched) should reflect logical change

○ change scope should narrow down as development goes

○ diff mangement

Physical Design: Memory 1)

● Memory is the #1 factor for performance

○ Code runs in memory, not in the air

● OS Memory Handling

○ Memory allocation, fragmentation, release etc

○ Tcmalloc VS jemalloc

■ Improves allocation/fragmentation

■ Still has issue on release


● Interpreter Memory Handling

○ Garbage Collection

● Manual Memory Management

○ memory pooling is mandatory

○ memory lifecycle management for any large usage


● Trade-offs

○ Depends on application

■ Memory critical: TC/JEmalloc

■ Memory and Performance critical: MMU

○ HPC is memory and performance critical

■ Parallel does not solve all the problem. Single machine performance is

still dominant factor

■ You should know the code very well to design manual MMU

○ Spark replacing JVM memory management with Tungsten project

Physical Design: Performance

● Performance Tuning

○ profiling, profiling, profiling...

○ lazy initialization / write / read

○ cache-aware design

■ cache-friendly data structure

● linked structure locality

■ cache-friendly algorithm

● read / write locality

System Design

Distributed, Parallel, Resilient

System Design

● Scalable Distributed System

○ DB Service: Data and Computation decouple

○ Task/Scheduler: Computation and Execution decouple

○ Query/Queue: Producer and Consumer decouple

System Design

● DB Service

○ Logically Centralized

■ Parameter Server

○ Physically distributed

■ Only routing / bookkeeping service on Master

■ Master capacity is not an issue

■ Computation locality on Slaves

System Design

● Parallel Computing

○ multi-threading

■ light overhead

■ shared memory, data exchange OK

○ multi-process

■ heavy overhead

■ separated memory space, more difficult data exchange

○ distributed multiple machine

■ balance between computation VS. communication

System Design

● TensorFlow Example

○ Multi-threading: Graph Execution Engine

■ BFS

■ DFS

○ Multi-machine: Graph partition

■ Edge-cut?

■ Vertex-cut?

System Design

● Fault Tolerance

○ Monitor granularity

■ system level: module behavior

■ flow level: major steps

■ algorithm level: major checkpoints

○ Persistence granularity

■ recovery depth

■ recovery contents

Distributed and Iterative Algorithms

Partition, Sync, Iterate, Global/Local Optimum

Distributed and Iterative Algorithms: The Lifeblood

Key Issues of Distributed Algorithms

● Data / Model partition

○ inference data partition; graph partition; datastore sharding

● Communication paradigm

○ Spark RDD; MPI; RPC

● Computation locality

○ locality-aware job scheduling; Yarn; Drill

● Parallel algorithm paradigm

○ Map/Reduce; Spark

● Multi-stage distributed flow

Distributed Deterministic Algorithms 1)

● What to sync?

○ what is the key information to stitch each pieces together

○ sync data to resemble single machine algorithm (rare but can be useful)

○ keep data local, sync results (map/reduce)

● When to sync?

○ lazy sync (e.g. Bulk Synchronous Parallel)

○ async (e.g. Parameter Server)

● Where to sync?

○ refactor algorithm by optimal sync point

Distributed Deterministic Algorithms 2)

● Trade-offs

○ performance

■ computation VS. communication

○ scalability

■ need scalable communication pattern

■ avoid point-to-point communication

Distributed Approximate Algorithms 1)

● QoR loss in distributed computing

○ for many algorithms, lack of global sync leads to QoR loss

○ full global sync is very expensive in communication cost

○ carefully choose sync points to maximize Performance / QoR Loss

● Self-healing Algorithms

○ some algorithms have less dependency on global sync

○ e.g. in Stochastic Optimization

■ global sync may be postponed to allow local optimum explored

■ however this nice feature is data / model dependant


● Major challenges 1)

○ Trade-off on QoR?

■ approximation is inevitable, so what can be approximated?

■ not just an engineering problem

■ usually needs assessment on business impact

○ Solutions

■ for each approximation candidates, detail profiling on QoR loss

VS. Performance Gain VS. Business impact


● Major challenges 2)

○ Hard to maintain?

■ Stochastic Algorithms: find deterministic in probability values

■ Graph algorithms: hard to trace in large-scale graph

○ Solutions

■ develop single machine algorithm first as golden

■ detailed testing and correlation for each parallelization step

■ detailed testing to understand result/error pattern on small data

Distributed Iterative Algorithms 1)

● Many algorithms for large-scale problem are iterative

○ Simulated Annealing; Genetic Algorithm; Graph Partition; PageRank;

Expectation Maximization; Loopy Belief Propagation etc

● Two Common approaches

○ Local computation + lazy Sync

○ Global computation with graph propagation

Distributed Iterative Algorithms 2)

● Distributed environment adds another layer of complexity

○ iterations need to be tuned, or completely re-designed

○ may become harder to converge

● Tuning iterations

○ Again, where to iterate?

■ spend runtime on key gainer

■ profiling of iterations VS. QoR gain

○ Tuning knots for convergence

■ iteration knots have very high impact on convergence

■ profiling of convergence parameters VS runtime VS QoR

Multi-stage Distributed Flow

● Data re-partition problem (“Shuffle” in Spark Language)

“In these distributed computation engines, the shuffle refers to the

repartitioning and aggregation of data during an all-to-all operation.

Understandably, most performance, scalability, and reliability issues that we

observe in production Spark deployments occur within the shuffle.”

http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-

its-a-double

Multi-stage Distributed Flow 1)

● Data re-partition problem (“Shuffle” in Spark Language)

○ unified partition VS. per-stage partition

■ per-stage partition fits algorithm better, but requires data

migration

○ global partition VS. stream partition

■ global partition fits algorithm better, but requires single machine to

hold all data for partition

■ stream partition + post-partition adjustment

Multi-stage Distributed Flow 2) ● Data re-partition problem (“Shuffle” in Spark Language)

○ QoR numerical dependence on the number of partitions

■ direct partitioning has numerical stability problem

■ fine-grained partition + post-partition coarsening is better

● Solutions

○ Hard to use standard library for high performance system

○ Best performance system is customized on:

■ Data volume

■ Computation intensity

■ (Multiple-stage) Algorithm parallelism

○ Always, keep a golden of single machine run, even for small input data!

Smart QA

cannot fix a bug unless you can reproduce it

cannot build a system unless you can test it

…...

Smart QA: The Guardian

Smart QA: Why

● Successful software must have good QA

○ A high level model of the system

○ Save time in debug

○ Save business in crisis

● Throughout Software Lifecycle

○ Development: test-driven development

○ Deployment: handles discrepancy b/w user env and dev env

○ Maintenance: predicts error, learns from failures, improves system

Protection Code

● Assert / Try, Except / Raise…

● Good to have:

○ Cases run through

○ Information on internal data, sometimes

● Too much of it?

○ hurts performance

● Need a balance

○ Input of external data → sanity check

○ Internal data → no check on high performance engine. System design and code

should ensure that

Auditing Code

● Check correctness from another angle

○ Rule based

■ Simply adds up the numbers to see if match

■ Use another algorithm, simpler, but does rough check

○ Data driven

■ Samples intermediate data from normal runs, issues alert when

runtime data distribution is different

Debug Code

● As important as functional code! (if not more)

● Essentially a high level abstraction on code OUTPUT

○ Not just debug

○ A reversed tree structure, with samples on key nodes

○ Grows intelligently with field practice

● Maintenance effort should decrease over time

○ Error handling/messaging system should mature through time

○ Bugs should be fixed in the right direction, not just workaround