25
From the latency to the throughput age Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC ETP4HPC Post-H2020 HPC Vision Frankfurt, June 24 th 2018

From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

From the latency to the throughput age

Prof. Jesús LabartaDirector Computer Science Dept (BSC)UPC

ETP4HPC Post-H2020 HPC Vision

Frankfurt, June 24th 2018

Page 2: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

2

To exascale ... and beyond

Page 3: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

3

VisionThe multicore and memory revolution– ISA leak … – Plethora of architectures

• Heterogeneity• Memory hierarchies

Complexity + variability = Divergence– Between our mental models and actual

system behavior

ApplicationsApplications

ISA / API

The power wall made us go multicore and the ISA interface to leak our world is shaking

What programmers need ? HOPE !!!

Page 4: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

4

Vision• … similar effect at system level/coarse grain

• Plethora of architectures• Heterogeneity• Memory hierarchies

• New usage practices• Online simulation, analytics and visualization• Interactive supercomputing, response time• Value based computing• Urgent computing

• Important• Integration of concurrency and data• Dynamic resource sharing

data1

Simulation1

Simul2

dat a2

dat a2

BSC vision. BDEC. Fukuoka. Feb 2014

Page 5: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

Evolution vs. revolution

• Revolutions• Change of mindset before after

• Do we think outside the box ?

Page 6: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

6

Do we think outside the box ?• Very strong walls in the HPC box !!!

Page 7: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

7

Do we think outside the box ?• Very strong walls in the HPC box !!!• Sometimes we try to blow them up

Page 8: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

8

Do we think outside the box ?• Very strong walls in the HPC box !!!• Sometimes we try to blow them up• But the walls are in our mind !!!

Page 9: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

9

Do we think outside the box ?

• We do (I may be exaggerating … or may be not that much)

• Proudly show the performances we achieve and not the code we write• Use variables about resources (cores, GPUs)

• omp_get-num_threads(), …• Run sequences of jobs with 5K core because each of them takes 20% less time

than with 2K cores• Believe that overlap == changing sends isends or using one sided calls• Burn million hours to estimate good configuration• Integrate simulation, analytics, visualization in a single MPI binary

Page 10: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

10

Do we think outside the box ?

• Do we ?• Interleave processes ?• Think of using MPI + OpenMP with just 1 OpenMP thread ?• Share nodes among jobs ?• Serialize (and overlap) reductions?• Taskify MPI calls to allow their out of order execution?• Spawn packing and unpacking tasks to allow for fast draining of incoming

messages by main process?• Parallelize packing and unpacking of messages? Depending on message size ?

Page 11: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

11

Do we think outside the box ?

• Why?• Follow “recommended best practices”• Never thought of ?• Some bad experience never again• I can do it better !!!!!• Dazzled by performance !!

Page 12: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

12

All about the mindset• The real parallel programming revolution

• … is in the mindset of programmers• From the latency to the throughput age !!!

• … and can/should be achieved productively• Incrementally• On a standard programming model/language (MPI+OpenMP, Python, …)

• Real revolution, real effort• Issue everywhere. At home first.• Shape minds vs. reshape minds

Page 13: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

13

Key aspects

• Actual behavior/Performance analysis• Avoid flying blind !!• Towards insight and understanding of fundamental issues• For application & system developers

• Programming practices and models• Decouple programmer from machine

• Programs to convey ideas to humans … that happen to be executable by machines

• Enable productive/evolutionary/composable approaches• Can we avoid/contain the complexity explosion ?• Dynamic resource sharing

Page 14: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

14

Behavior awareness

• A common language about fundamental issues

• Evolution of bottlenecks

• Methodology • 195 studies:

• ~25% industry• Awareness• Opportunity to improve

• And examples how• Co-design input

Page 15: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

1515

Behavior awareness15

Tracking scaling behavior of computation regions(Strong scaling MPI+OpenMP example)

Page 16: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

16

• Coupled codes• Multiple physics, domains• Compute & I/O

16

Behavior awareness

26.7MB traceEff: 0.43; LB: 0.52; Comm:0.81

1600

cor

es

2.5 sEC-EARTH

Atmosphere

Ocean

Page 17: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

1717

Vision in the programming revolution

ISA / API

Applications

Power to the runtime

PM: High-level, clean, abstract interface

General purpose

Decouple

Forget about resources

Minimal & sufficient permeability?

Intelligence&

Resource management

“Reuse & expand” old architectural ideas under

new constraints

Moderador
Notas de la presentación
Membrane C by rows
Page 18: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

1818

Vision in the programming revolution

ISA / API

Special purpose

Must be easy to develop/maintain

Fast prototyping

Applications

Power to the runtime

PM: High-level, clean, abstract interface

DSL1 DSL2 DSL3

Page 19: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

19

Integrate concurrency and dataSingle mechanism

Concurrency:Dependences built from data accessesLookahead: About instantiating work

Locality & data managementFrom data accesses

Task based parallel programming

Page 20: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

20

Task based parallel programming• Some important features

• Dependences, Lookahead• Taskloops• Nesting• Array sections / Regions• Exploiting malleability:

• Dynamic Load Balance (DLB)• Within App, across apps

• MPI+OpenMP interoperability

• Think global, specify local

Page 21: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

21

Towards the throughput age• By

• Express potential concurrency

• Malleability• Dynamic resource

sharing/management

• Configuration independence

• Amount of resources is what really matters

• Side effects• Nx1 can be better than

pure MPI !!!• hope for lazy programmers

Page 22: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

Infrastructures for new usage modes• Persistent KVS

• Alternative for parallel programs I/O?• Flexible querying: 3D indexing, Data-thinning

• Need/opportunity of clean integration of concurrency and data

• Within one app• Shared communication space between multiple apps.

• Malleable/Elastic/opportunistic resource management/sharing

Page 23: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

23

Impact on architecture ?• High throughput devices

• Long Vectors• Decouple Front end - Back end engines, reduce front end pressure, optimize memory

throughput, explicit locality management• Specialized compute and data motion engines • Tuned numerical precision

• ISA is important• Decouple/hide again hardware details, reuse SW technologies (compilers, OS,…), • Specific instructions?• “limited” number of control flows

• Hierarchical Acceleration• Nesting• Homogenize heterogeneity

• Runtime aware architectures (RAA)

Page 24: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

24

Age before beauty• Behavior (insight/models) before syntax• Detail performance analytics before aggregated profiles• Work instantiation and order before overhead• Malleability before fitted rigid structure• Possibilities before how tos• Elegance before one day shine

Page 25: From the latency to the throughput age - Post H2020 Vision - Jesus Labarta.pdf · • From the latency to the throughput age !!! • … and can/should be achieved productively •

25

The challenge• Think of fundamentals, think out of the box

• Revolution: change everything so that nothing changes

• Should we: change as little as possible so that everything is different ?

• Programmers !!!!

• Develop a culture of• Efficiency awareness• Latency throughput mindset• Dynamic sharing of resources

• To exascale … and before