37
Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey P. Gardner Jeffrey P. Gardner Andrew Connolly Andrew Connolly Cameron McBride Cameron McBride Pittsburgh Supercomputing Center Pittsburgh Supercomputing Center University of Pittsburgh University of Pittsburgh Carnegie Mellon University Carnegie Mellon University

Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Embed Size (px)

Citation preview

Page 1: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Enabling Rapid Development of Parallel

Tree-Search Applications

Harnessing the Power of Massively Parallel Platforms for Astrophysical

Data AnalysisJeffrey P. GardnerJeffrey P. GardnerAndrew ConnollyAndrew ConnollyCameron McBrideCameron McBride

Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterUniversity of PittsburghUniversity of Pittsburgh

Carnegie Mellon UniversityCarnegie Mellon University

Page 2: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

How to turn astrophysics simulation output into scientific

knowledge

Step 1: Run simulation

Step 2: Analyze simulationon workstation

Step 3: Extract meaningfulscientific knowledge

(happy scientist)Using 300 processors:(circa 1995)

Page 3: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

How to turn astrophysics simulation output into scientific

knowledge

Step 1: Run simulation

Step 2: Analyze simulationon server (in serial)

Step 3: Extract meaningfulscientific knowledge

(happy scientist)Using 1000 processors:(circa 2000)

Page 4: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

How to turn astrophysics simulation output into scientific

knowledge

Step 1: Run simulation

Step 2: Analyze simulationon ???

(unhappy scientist)Using 4000+ processors:(circa 2006)

X

Page 5: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Exploring the Universe can be (Computationally)

Expensive

The size of simulations is no longer limited by computational power

It is limited by the parallelizability of data analysis tools

This situation, will only get worse in the future.

Page 6: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

How to turn astrophysics simulation output into scientific

knowledge

Step 1: Run simulation

Step 2: Analyze simulationon ???

Using ~1,000,000 cores?:(circa 2012)

X

By 2012, we will have machines that will have many hundreds of thousands of cores!

Page 7: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

The Challenge of Data Analysis in a Multiprocessor Universe

Parallel programs are difficult to write! Steep learning curve to learn parallel programming

Parallel programs are expensive to write! Lengthy development time

Parallel world is dominated by simulations: Code is often reused for many years by many

people Therefore, you can afford to invest lots of time

writing the code. Example: GASOLINE (a cosmology N-body

code) Required 10 FTE-years of development

Page 8: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

The Challenge of Data Analysis in a Multiprocessor Universe

Data Analysis does not work this way: Rapidly changing scientific

inqueries Less code reuse

Simulation groups do not even write their analysis code in parallel!

Data Mining paradigm mandates rapid software development!

Page 9: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

How to turn observational data into scientific knowledge

Step 1: Collect data

Step 2: Analyze dataon workstation

Step 3: Extract meaningfulscientific knowledge

(happy astronomer)Observe at Telescope(circa 1990)

Page 10: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Use Sky Survey Data(circa 2005)

How to turn observational data into scientific knowledge

Step 1: Collect data

Step 2: Analyze dataon ???

Sloan Digital Sky Survey(500,000 galaxies)

X

(unhappy astronomer)

3-point correlation function:~200,000 node-hours

of computation

Page 11: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Use Sky Survey Data(circa 2012)

How to turn observational data into scientific knowledge

Large Synoptic Survey Telescope(2,000,000 galaxies)

3-point correlation function:~several petaflop weeks

of computation

Page 12: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Tightly-Coupled Parallelism(what this talk is about)

Data and computational domains overlap

Computational elements must communicate with one another

Examples: Group finding N-Point correlation functions New object classification Density estimation

Page 13: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

The Challenge of Astrophysics Data Analysis in a

Multiprocessor Universe

Build a library that is: Sophisticated enough to take care

of all of the nasty parallel bits for you.

Flexible enough to be used for your own particular astrophysics data analysis application.

Scalable: scales well to thousands of processors.

Page 14: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe

Astrophysics uses dynamic, irregular data structures: Astronomy deals with point-like data in an N-dimensional

parameter space Most efficient methods on these kind of data use space-

partitioning trees. The most common data structure is a kd-tree.

Build a targeted library for distributed-memory kd-trees that is scalable to thousands of processing elements

Page 15: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Challenges for scalable parallel application development:

Things that make parallel programs difficult to write Work orchestration Data management

Things that inhibit scalability: Granularity (synchronization,

consistency) Load balancing Data locality

Structured dataMemory consistency

Page 16: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Overview of existing paradigms: DSM

There are many existing distributed shared-memory (DSM) tools.

Compilers: UPC Co-Array Fortran Titanium ZPL Linda

Libraries Global Arrays TreadMarks IVY JIAJIA Strings Mirage Munin Quarks CVM

Page 17: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Overview of existing paradigms: DSM

The Good: These are quite simple to use.

The Good: Can manage data locality pretty well.

The Bad: Existing DSM approaches tend not to scale very well because of fine granularity.

The Ugly: Almost none support structured data (like trees).

Page 18: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Overview of existing paradigms: DSM

There are some DSM approaches that do lend themselves to structured data: e.g. Linda (tuple-space)

The Good: Almost universally flexible The Bad: These tend not to scale even

worse than simple unstructured DSM approaches. Granularity is too fine

Page 19: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Challenges for scalable parallel application development:

Things that make parallel programs difficult to write Work orchestration Data management

Things that inhibit scalability: Granularity Load balancing Data locality

DSM

Page 20: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Overview of existing paradigms: RMI

rmi_broadcast(…, (*myFunction));

RMI layer

RMI layer

myFunction()

RMI Layer

myFunction()

RMI Layer

myFunction()

RMI Layer

myFunction()

Proc. 0 Proc. 1 Proc. 2 Proc. 3

MasterThread

Computational Agenda

myFunction() is coarsely grained

“Remote Method Invocation”

Page 21: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

RMI Performance Features

Coarse Granulary Thread virtualization

Queue many instances of myFunction() on each physical thread.

RMI Infrastucture can migrate these instances to achieve load balacing.

Page 22: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Overview of existing paradigms: RMI

RMI can be language based: Java CHARM++

Or library based: RPC ARMI

Page 23: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Challenges for scalable parallel application development:

Things that make parallel programs difficult to write Work orchestration Data management

Things that inhibit scalability: Granularity Load balancing Data locality

RMI

Page 24: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

N tropy: A Library for Rapid Development of kd-tree Applications

No existing paradigm gives us everything we need.

Can we combine existing paradigms beneath a simple, yet flexible API?

Page 25: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

N tropy: A Library for Rapid Development of kd-tree Applications

Use RMI for orchestration Use DSM for data management Implementation of both is targeted

towards astrophysics

Page 26: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

A Simple N tropy Example:N-body Gravity Calculation

Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

100 million light years

Proc 0 Proc 1 Proc 2

Proc 5Proc 4Proc 3

Proc 6 Proc 7 Proc 8

Page 27: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

A Simple N tropy Example:N-body Gravity Calculation

ntropy_Dynamic(…, (*myGravityFunc));

N tropy master RMI layer

N tropy thread RMI layer

myGravityFunc()

N tropy thread RMI layer

myGravityFunc()

N tropy thread RMI layer

myGravityFunc()

N tropy thread RMI layer

myGravityFunc()

Proc. 0 Proc. 1 Proc. 2 Proc. 3

MasterThread

Computational Agenda

Particles on which to calculate gravitational force

P1 …P2 Pn…

Page 28: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

A Simple N tropy Example:N-body Gravity Calculation

Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

100 million light years

To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset

To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset

Proc 0 Proc 1 Proc 2

Proc 5Proc 4Proc 3

Proc 6 Proc 7 Proc 8

Page 29: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

A Simple N tropy Example:N-body Gravity Calculation

N tropy thread RMI layer

myGravityFunc()

N tropy thread RMI layer

myGravityFunc()

N tropy thread RMI layer

myGravityFunc()

N tropy thread RMI layer

myGravityFunc()

Proc. 0 Proc. 1 Proc. 2 Proc. 3

N tropy DSM layer N tropy DSM layer N tropy DSM layer N tropy DSM layer

0

43

1

2

76

5

1110

8

9

1413

12

DataWork

Page 30: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

N tropy Performance Features

DSM allows performance features to be provided “under the hood”: Interprocessor data caching for both reads

and writes < 1 in 100,000 off-PE requests actually result in

communication. Updates through DSM interface must be

commutative Relaxed memory model allows multiple writers

with no overhead Consistency enforced through global

synchronization

Page 31: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

N tropy Performance Features

RMI allows further performance features Thread virtualization

Divide workload into many more pieces than physical threads

Dynamic load balacing is achieved by migrating work elements as computation progresses.

Page 32: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

N tropy Performance10 million particlesSpatial 3-Point3->4 Mpc

No interprocessor data cache,No load balancing

Interprocessor data cache,No load balancing

Interprocessor data cache,Load balancing

Page 33: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Why does the data cache make such a huge difference?

myGravityFunc()

Proc. 0

0

43

1

2

76

5

1110

8

9

1413

12

Page 34: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

N tropy “Meaningful” Benchmarks

The purpose of this library is to minimize development time!

Development time for:1. Parallel N-point correlation function

calculator 2 years -> 3 months

2. Parallel Friends-of-Friends group finder

8 months -> 3 weeks

Page 35: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Conclusions

Most approaches for parallel application development rely on providing a single paradigm in the most general possible manner Many scientific problems tend not to

map well onto single paradigms Providing an ultra-general single

paradigm inhibits scalability

Page 36: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Conclusions

Scientists often borrow from several paradigms and implement them in a restricted and targeted manner.

Almost all current HPC programs are written in MPI (“paradigm-less”): MPI is a “lowest common denominator”

upon which any paradigm can be imposed.

Page 37: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey

Conclusions N tropy provides:

Remote Method Invocation (RMI) Distributed Shared-Memory (DSM)

Implementation of these paradigms is “lean and mean”

Targeted specifically for problem domain This approach successfully enables

astrophysics data analysis Substantially reduces application development time Scales to thousands of processors

More Information: Go to Wikipedia and seach “Ntropy”