Parallel Zero-Copy Algorithms for Fast Fourier Transform and … · 2018-02-24 · Quick MPI...

Parallel Zero-Copy Algorithms for Fast Fourier

Transform and Conjugate Gradient using MPI

Datatypes

Torsten Hoefler, Steven Gottlieb

EuroMPI 2010, Stuttgart, Germany, Sep. 13th 2010

Quick MPI Datatype Introduction• (de)serialize arbitrary data layouts into a

message stream

– Contig., Vector, Indexed, Struct, Subarray,

even Darray (HPF-like distributed arrays)

• Recursive specification possible

– Declarative specification of data-layout

• “what” and not “how”, leaves optimization to

implementation (many unexplored possibilities!)

– Arbitrary data permutations (with Indexed)

Datatype Terminology• Size

– Size of DDT signature (total occupied bytes)

– Important for matching (signatures must match)

• Lower Bound

– Where does the DDT start

• Allows to specify “holes” at the beginning

• Extent

– Size of the DDT

• Allows to interleave DDT, relatively “dangerous”

What is Zero Copy?• Somewhat weak terminology

– MPI forces “remote” copy

• But:

– MPI implementations copy internally

• E.g., networking stack (TCP), packing DDTs

• Zero-copy is possible (RDMA, I/O Vectors)

– MPI applications copy too often

• E.g., manual pack, unpack or data rearrangement

• DDT can do both!

Purpose of this Paper

• Demonstrate utility of DDT in practice

– Early implementations were bad → folklore

– Some are still bad → chicken+egg problem

• Show creative use of DDTs

– Encode local transpose for FFT

• Create realistic benchmark cases

– Guide optimization of DDT implementations

2d-FFT State of the Art

2d-FFT Optimization Possibilities

1. Use DDT for pack/unpack (obvious)

– Eliminate 4 of 8 steps

• Introduce local transpose

2. Use DDT for local transpose

– After unpack

– Non-intuitive way of using DDTs

• Eliminate local transpose

The Send Datatype1. Type_struct for complex numbers

2. Type_contiguous for blocks

3. Type_vector for stride

• Need to change extent to allow overlap (create_resized)

– Three hierarchy-layers

The Receive Datatype

– Type_struct (complex)

– Type_vector (no contiguous, local transpose)

• Needs to change extent (create_resized)

Experimental Evaluation• Odin @ IU

– 128 compute nodes, 2x2 Opteron 1354 2.1 GHz

– SDR InfiniBand (OFED 1.3.1).

– Open MPI 1.4.1 (openib BTL), g++ 4.1.2

• Jaguar @ ORNL

– 150152 compute nodes, 2.1 GHz Opteron

– Torus network (SeaStar).

– CNL 2.1, Cray Message Passing Toolkit 3

• All compiled with “-O3 –mtune=opteron”

Strong Scaling - Odin (80002)

• 4 runs, report smallest time, <4% deviation

Reproducible

peak at P=192

Scaling stops

w/o datatypes

Strong Scaling – Jaguar (20k2)

Scaling stops

w/o datatypes

DDT increase

scalability

Negative Results

• Blue Print - Power5+ system

– POE/IBM MPI Version 5.1

– Slowdown of 10%

– Did not pass correctness checks

• Eugene - BG/P at ORNL

– Up to 40% slowdown

– Passed correctness check

Example 2: MIMD Lattice Computation

• Gain deeper insights in

fundamental laws of physics

• Determine the predictions of

lattice field theories (QCD &

Beyond Standard Model)

• Major NSF application

• Challenge:

– High accuracy (computationally intensive) required for

comparison with results from experimental programs in

high energy & nuclear physics

14 Performance

Modeling and

Simulation on Blue

Waters

Communication Structure• Nearest neighbor communication

– 4d array → 8 directions

– State of the art: manual pack on send side

• Index list for each element (very expensive)

– In-situ computation on receive side

• Multiple different access patterns

– su3_vector, half_wilson_vector, and su3_matrix

– Even and odd (checkerboard layout)

– Eight directions

– 48 contig/hvector DDTs total (stored in 3d array)

MILC Performance Model• Designed for Blue Waters

– Predict performance

of 300000+ cores

– Based in Power7

MR testbed

– Models manual

pack overheads

>10% pack time

• >15% for small L

Experimental Evaluation

• Weak scaling with L=44 per process

– Equivalent to NSF Petascale Benchmark

on Blue Waters

• Investigate Conjugate Gradient phase

– Is the dominant phase in large systems

• Performance measured in MFlop/s

– Higher is better

MILC Results - Odin

• 18% speedup!

MILC Results - Jaguar

• Nearly no speedup (even 3% decrease)

Conclusions

• MPI Datatypes allow zero-copy

– Up to a factor of 3.8 or 18% speedup!

– Requires some implementation effort

• Tool support for datatypes would be great!

– Declaration and extent tricks make it hard to debug

• Some MPI DDT implementations are slow

– Some nearly surreal

– We define benchmarks to solve chicken+egg problem

Acknowledgments & Support

• Thanks to • Bill Gropp

• Jeongnim Kim

• Greg Bauer

• Sponsored by

Backup

Backup Slides

2d-FFT State of the Art

Parallel Zero-Copy Algorithms for Fast Fourier Transform and … · 2018-02-24 · Quick MPI...

Documents

ActionScript - files.cnblogs.com · Legend Superclass «interface» Interface property: dataType method(): returnType implements Superclass package::Class property: dataType method():

UNIT-1 What is Data Structure? Types of Data Structure · 1. Integer Datatype 2. Floating Point Datatype 3. Double Datatype 4. Character Datatype Integer Datatype Integer datatype

Purge Haplotigs: allelic contig reassignment for third-gen

CSCI.6962/4962 Software VericationŠ Fundamental Proof ... · Datatype for natural numbers datatype N := zero | (S N) This denes a datatype Nwith two constructors: zerois a constant

Semantic Web Technologies I...– owl:sameAs requires explicit mappings (rare on the Web) – OWL DL disallows inverse functional datatype properties" (complicated interplay with datatype

Conversion Functions Implicit datatype conversion Explicit datatype conversion Datatypeconversion In some cases, Oracle Server allows data of one datatype

Why Serialize Your Pharma Supply Chain in 2015

Stories, Not Words: Abstract Datatype Instruction Setsmartha/slides/ndca11.pdf · Stories, Not Words: Abstract Datatype Instruction Sets Martha Kim Columbia University Workshop on

The Class Chain. next (datatype ChainNode) element (datatype Object) Use ChainNode abcde null firstNode size = number of elements

Lecture 10 comments and basic datatype

Bachelor Thesis: Performance and Interfaces of Datatype ...klode/thesis.pdf · Programming Languages and Software Technology Group Prof. Ostermann Performance and Interfaces of Datatype-Generic

Introduction to Bioinformatics - Home | Lehigh …inbios21/PDF/Fall2008/Lopresti_11142008.pdftarget contig contig gap Introduction to Bioinformatics Lopresti BioS 95 November 2008

Worldscope Database Datatype Definitions Guide

Inducing Predictive Clustering Trees for Datatype properties Values

Purge Haplotigs: allelic contig reassignment for third-gen ... · SOFTWARE Open Access Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies Michael

Lecture 27a: MPI Datatypeswgropp.cs.illinois.edu/courses/cs598-s16/lectures/... · 2015. 3. 31. · datatype), where • An MPI datatype is recursively defined as: ... ♦ Strided

Datatype in c++ unit 3 -topic 2

Video As A Datatype

Annotation of DGA06H06, Contig 1

MySQL's JSON Datatype