19
3 rd SRA -4 work-session-summary charts May 20 th , Frankfurt

3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

3rd SRA -4 work-session-summary charts

May 20th, Frankfurt

Page 2: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

28/03/2019Photonics PPP Annual Meeting 20192

Research clusters for SRA-4 – the result:

System Architecture

For each research cluster define:

• relevance & impact (why chosen?)

• maturity (time to market)

• Hurdles to overcome

• Driving competence in Europe

• Cost of research to gain sign. uptake

System Hardware Components

System Software & Management

Programming Environment

IO & Storage

Math & Algorithms

Application co-design

Centre to edge framework

Research - domains

Data

everywhere

AI

everywhere

Energy

efficiency ResilienceDevelopm. methods

and standards

HPC and the

digital continuum

Page 3: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

3

How do we get to the final set of research clusters for SRA –4 ?

cluster #12nd level cluster x

2nd level cluster y

2nd level cluster z

..

STEP1: 44 suggested cluster elements

cluster #X2nd level cluster a

2nd level cluster

2nd level cluster

..

.

.

.

STEP2: bundle & select

STEP3: review / correct 1st to 2nd level grouping

X< 10

STEP5: name “champions” for selected 1st lvl clusters (2..3)

1-3 champions per cluster, resp. for “5 parameters and intro”

STEP4: review / define “anchoring“ of 1st level clusters

STEP6: review “5 cluster parameters” : keep/change?

• Relevance & impact (why chosen?)

• Maturity (time to market)

• Hurdles to overcome

• Driving competence in Europe

• Cost of research to gain sign. uptake

“size”, naming?

total number?

any last minute additions?

Page 4: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

4

#1, Champions: Adrian Tate/CRAY, Francois Bodin (University of Rennes)

Research clusters Cluster anchoring Cluster elements why chosenCo-design of application/runtime/ architecture, including proof-of-

concept and best practice demonstrations improving scalability and

performance of applications

Modelling of system components, data and whole system

Development methods &

standardsFull system architectures (including technology integration)

Standards' convergence in order to achieve portablity and 'technology

islands'

Performance Analysis and Programming best practices

Performance portability and future-proofing

Enabling applications to perform across varying and heterogeneous architectures is

essential for ensuring sustainable performance on emerging Exascale computing

systems, and to prevent development investments from binding applications into a

specific solution path. When applied correctly the approach should enable optimal

design choices between multiple hardware and software options. Frameworks that

consist of domain-specific languages, libraries, programming and abstraction

frameworks, models and toolchains have proven to provide a good practical approach

to sustain highest possible performance with complex applications on multiple

computing architectures.

independent of specific

scietific/industrial/societal problems-to-

solve, holitic approaches based on co-

design, an "end-to-end" design incl.

modelling and integration aspects are

widely accepted as a necessary

prerequisite for success.

Page 5: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

5

#2, Champions: Laurent Cargamel (Atos), Paul Carpenter (BSC) and BSC volunteer

Research clusters Cluster anchoring Cluster elements why chosen

Heterogeneous Acceleration

reaching exascale-class performance on real world applications requires applications

to be able to exploit different acceleration techniques (GPUs, FPGAs, etc). Proving

that this assumption is true and that there applications exploiting programming

techniques enabling the use of different acceleration components seta a proof point

in the future work programs

Hetergenous computing Exascale HW will be heterogeneous - Need to address this for appl. dev. anmd

runtime

Scalable energy efficient solvers

Algorithmic changes

Auto-tuning systems

The idea behind this title is to investigate what could be done to have the HPC system

(with the meaning of the continuum) that improves automatically its behavior (more

generic than performance). This includes the use of AI for the HPC system but also of

older technics.

Innovation in Energy efficiency is

essential across all scientific and

industrial use scenrious as a

prerequiste for gaining the desired

scalablity, reducing TCO, etc.

Energy efficiency

Page 6: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

6

#3, Champions: Maria Perez (UPM), Benny Koren (Mellanox)

Application of AI AI is one of the main technologies driving industry today. Its impact will only increase

in the near future, making this a priority target for HPC tools and infrastructure.

AI everywhere

Learning across the continuumthere is "AI for HPC" and "HPC for AI" - both ways will be needed (this will be the

focus of one of the two BDEC demonstrators)

AI & Data Analytics

Distributed AI Network Accelaration

There will be no future for HPC

infrastructure w/o a strong support

for AI, and a limited future for AI

without HPCAI everywhere

Research clusters Cluster anchoring Cluster elements why chosen

Page 7: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

7

#4, Champions: Sai Narasimhamurthy (Seagate), Hans-Christin Hoppe (Intel), Gabriel Antoniu (Inria)

Research clusters Cluster anchoring Cluster elements why chosen

Data centric computing

NVM and its use as persistent memory and for persistent I/O

Data life cycle Management in distributed scientific environments

Data sharing (data flexibility)

Data sharing or flexibility is the ability for different users to have access to the same

data and for the data to be useable from heterogeneous framework. It is fundamental

in order to develop a federation, as well as to optimise the usage and efficiency of

such a federation

new addressing schemes for persistent memory

Byte adressable versus block model

Understanding data centric

requirements (and data logistics) is

essential for "HPC in a digital

continuum"

Data everywhere

Page 8: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

8

#5, Champions: Marc Duranton (CEA), Gabriel Antoniu (Inria), Francois Bodin (University of Rennes)

Unified data storage and processing across the digital continuum: edge-

cloud-HPC system

HPC on the edge

HPC in the loopThis makes HPC more relevant as it integrates deeply into larger society-wide

workflows / industry wide workflows. It will increase HPC simulations value by

making them more available to stakeholders.

Support workflows on heterogenous systems

Seamless heterogeneous architectures (and software for them)

Privace and Security in the edge-HPC Centre continuum

Orchestration and mediation on ressource and workflow

Coupling of HPC & HTC; Ensembles

Full system design of converged HPC/Cloud architectures

HPC (or HTC) in the cloud/edge

Containers

simulation and data experimentation assimilation model

HPC and the digital

continuum

link and motivation is obvious ( see

blueprint)

industrial and scientific use cases

(CoEs)

Research clusters Cluster anchoring Cluster elements why chosen

Page 9: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

9

#6, Champions: Manolis Marazakis, FORTH-ICS, Petar Radjkovic (BSC)

HPC resilience

System res i l ience is one of the most important Exascale requirements . The European HPC

community, however, lacks the s trong research effort in res i l ience, which makes res i l ience one of

the greatest chal lenges of the EU HPC ini tiative (ETP4HPC Strategic Research Agenda: Achieving

HPC Leadership in Europe, 2013. page 42). Assuring the res i l iency of large-sca le HPC systems is

complex and requires research and engineering effort for the analys is , development and

evaluation of rel iabi l i ty features . The additional problem is that res i l ience is a vertica l problem

that needs hol is tic solutions . For a l l these reasons , we have to make sure that HPC system

res i l ience is properly represented in future work programmes.

Resilience in exascale HPC

Res i l ience is widely recognized as a cri tica l chal lenge for high performance computing (HPC)

systems, as a result of the increas ing complexi ty, both at the level of individual hardware and

software components and at the level of subsystems and complete heterogeneous system

configurations . At sca le, we can no longer assume faults , errors , and fa i lures to be uncommon

events . Moreover, even more chal lenging fa i lure modes have emerged, beyond the assumptions

of the commonly assumed fa i l -s top model , ra is ing concern about the integri ty of computations

and data at-rest and in-trans i t. Appl ication correctness and execution efficiency, in spi te of

frequent faults , errors , and fa i lures , i s therefore essentia l to ensure the success of the extreme-

sca le HPC systems, and more broadly for data center-sca le systems such as cloud infrastructure.

Further chal lenges arise from the interplay between res i l iency and energy consumption:

Improving res i l ience often rel ies on redundancy (repl ication and/or checkpointing, rol lback and

recovery), which consumes extra energy. Based on these observations , I am propos ing a cross -

cutting activi ty focused on res i l ience concerns in future exascale HPC systems.

Adaptivity, Uncertainty Quantification

Resilience

improved system level resiliency was

always on the top requirements list of

users we interviewed and will stay that

wayin the persuit of extreme scale

computing, Included as well is

algoritmic resilience.

Research clusters Cluster anchoring Cluster elements why chosen

Page 10: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

10

Proposed topics to be moved to working groups:

Math and Algo-WG Scalable Applications Guy Londdale - scapos AG

Infrastructure Computing

The emergence of smart nics in the HPC and cloud calls for different partition of the

functionalities that where all running on the hosts. Some of the infrastructure sevices

( e.g. firewalls, storage ) will migrate from the main CPU into the smart nic processing

engines. This is a major change in the compute Node architecture that opens the door

to many innovations and new capabilities.

Benny Korben - Mellanox

Adoption of upcoming hardware architectures Dirk Pleiter - Forschungszentrum Juelich

Architecture-WGConfigurable architectures Laurent Cargamel

Reconfigurable computing Xavier Martorell - Barcelona Supercomputing Center

Urgent Computing CoE- Cheese

Active NetworkingIn future networking devices (such as Smart NICs) allow us to prepare data (i.e.

filtering, compression, aggregation) before sending it. Such approaches can reduce

network congestion, applications (such as AI) can immediately reap benefits.

Valeria Bartsch - Fraunhofer ITWM

GPU disintegration and scaling across the network Benny Korben - Mellanox

Applications that provide decision makers with information during critical

emergencies cannot waste time waiting in job queues and need access to

computational resources as soon as possible

Page 11: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

Structuring the SRA text contributions:Working group overviews (NOW to Septemper 10th )

▪ From working groups: “State of the art” & “challenges” – overview …. 2-3 pages max

➢ Current state of the art in ….WGx-domain

➢ Challenges for 2021 – 2024 in the area of….. WGx-domain

Research clusters (agreed upon set of 1st level clusters)

▪ From Cluster Champions: (NOW to August 1st)

❖ Intro which topics are covered by this cluster? ( refer to set of 2nd level clusters)……1 page max

❖ “5 parameters” …..1 page max (not all parameters might be relevant for every cluster)

▪ From working groups: Describe the WGx-domain – facets of the cluster: ……(3 pages max) (August 2nd –September 10th)

➢ What is the …WGx –domain – overlap with the cluster ?

➢ What are the specific technical challenges ?

➢ Approaches and options for solutions

➢ What should be further researched ?

11

• Relevance & impact (why chosen?)

• Maturity (time to market)

• Hurdles to overcome

• Driving competence in Europe

• Cost of research to gain sign. uptake

e.g. “Programming Environment”

Page 12: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

SRA-4 timeline end-to-end:● March 19th : SRA4 process communicated at General Assembly

● March 21st : Invitation to apply for SRA-4 working groups to be sent out to ETP4HPC members

● March 31st : FP9 vision document electronic version available, registration for working groups incl. suggestions for research clusters

● April 12th : Deadline for working group registration and collection of proposals for research clusters

● April 15th - April 19th : we analyse your input and set up working groups

● May 17th : SRA-4 working group leaders internal workshop during European HPC Summit week (May 13th to 17th)

● May 20th – June 14th: Kick-off calls with working groups (8 calls, set up by office)

● June 19th : SRA-4 working group leaders internal workshop during ISC 19, start 18:30, Citadines Hotel, Frankfurt

● July/August/early September: writing complete text, individual working group calls (organized by working group leaders)

● Sept. 19th : first integration of SRA-4 document, technical part

● Oct. 4th : all other doc-parts integrated, document complete (first rendition)

● Oct. 17th : work session during European Big Data Value Forum

● October/November: text refinements, reviews, corrections

● Dec. 9th week: closing SRA-4 workshop, IBM ZRL Rueschlikon

● December: language checks, document design, release

Page 13: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

Between now and September 10th

● “Keep things flowing”: a sync call in July, August and 1st week September between all sra-wgls

● Any question in between? : contact Michael, Maike or Marcin

● Any interlock between Working groups and Cluster –Champions: use email connections. (next page)

13

Page 14: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

How to connect with work group leaders….

28/03/2019Photonics PPP Annual Meeting 201914

1 - System Architecture led by Laurent

Cargemel (Atos) and Estela Suarez (Juelich

SC) [email protected] [email protected] [email protected]

2 - System Hardware Components led by

Marc Duranton (CEA) and Benny Koren

(Mellanox) [email protected] [email protected] [email protected]

3 - System Software & Management led by

Pascale Rosse-Laurent (Atos) and María S.

Pérez-Hernández (Universidad Politécnica de

Madrid) and Manolis Marazakis (FORTH) [email protected]

pascale.rosse-

[email protected] [email protected]

[email protected]

4 - Programming Environment led by Guy

Londsdale (Scapos), Paul Carpenter (BSC)

and Gabriel Antoniu (Inria) [email protected] [email protected] [email protected]

gabriel.antoniu@inria

.fr

5 - I/O & Storage led by Sai

Narasimhamurthy (Seagate) and André

Brinkman (Universität Mainz – JGU) [email protected]

sai.narasimhamurthy@seagat

e.com [email protected]

6 - Mathematics & Algorithms led by Dirk

Pleiter (Juelich SC) and Adrian Tate (Cray) [email protected] [email protected] [email protected]

7 - Centre-to-Edge Framework led by Jens

Krueger (Fraunhofer) and Hans-Christian

Hoppe (Intel) [email protected]

[email protected]

.de

hans-

[email protected]

m

8 - Application co-design led by Erwin Laure

(KTH) and Andreas Wierse (SICOS) [email protected] [email protected] [email protected]

Page 15: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

15

SRA-4 working group mailing lists

• System Architecture

[email protected]

• System Hardware Components

[email protected]

• System Software and Management

[email protected]

• Programming Environment

[email protected]

• I/O & Storage

[email protected]

• Mathematics & Algorithms

[email protected]

• Application co-design

[email protected]

• Centre-to-edge-framework

[email protected]

all workroup leaders only: [email protected],

Page 16: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

Thank you

@etp4h

[email protected]

www.etp4hpc.eu

Page 17: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

SRA-4: the increasing interplay of Simulation, AI, IoT and Analytics

Page 18: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

SRA-4 content and structure / size

18

• Intro based on updated “Blueprint” document (20 out of 36 pages)

• Strategic directions (with input from RIAG – Axel Auweter, Maria Perez) (3 pages)

• Technical Research priorities 2021 – 2024 (see also next page)

• “State of the art” & “challenges” per each working group – 2-3 pages max per working group

(appr. 16-20 pages)

• Examples of relevant use cases (4 pages)

• Research clusters (agreed upon set)

Intro & “5 parameters” (1-2 pages by ‘sponsors’(tbd)). Detailed descriptions by working group (2-3 pages)

(max 50 pages)

• Upstream Technologies – focus for 2021 – 2024

• Gen. recommendations for workprogramme 2021&2022 (focus calls, large scale pilots, collaborative aspects..)

• Non-technical topics:

• HPC and HPDA in Europe/China/US/Japan – (BDEC-2 , M. Asch)

• Gap Analysis proposed (WPs in H2020) vs actual research (J.F. Lavignon)

• Contributing European organizational eco-system :

• CoE, BDEC-2, PRACE, HiPEAC, BDVA, AIOTI, ECSO Eurolab4HPC

Page 19: 3rd SRA -4 work-session-summary charts€¦ · Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure

19

SRA-4 working groups and leads

• System Architecture

• Laurent Cargemel, ATOS

• Estela Suarez, JSC

• System Hardware Components

• Marc Duranton, CEA (HiPEAC)

• Benny Koren, MELLANOX

• System Software and Management

• Pascale Rosse-Laurent, ATOS

• Maria Perez, UPM (BDVA)

• Manolis Marazakis (FORTH)

• Programming Environment

• Guy Lonsdale, SCAPOS

• Paul Carpenter, BSC

• Gabriel Antoniu, INRIA (BDVA)

• I/O & Storage

• Sai Narasimhamurthy, SEAGATE

• Andre Brinkmann, JGU

• Mathematics & Algorithms

• Dirk Pleiter, JSC

• Adrian Tate, CRAY

• Application co-design

• Erwin Laure, KTH

• Andreas Wierse, SICOS

• Centre-to-edge-framework

• Jens Krueger, FRAUNHOFER

• Hans-Christian Hoppe, INTEL