14
PRACE WP4 – Distributed Systems Management Riccardo Murri, CSCS – Swiss National Supercomputing Centre

PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

PRACE WP4 – Distributed Systems ManagementRiccardo Murri, CSCS – Swiss National Supercomputing Centre

Page 2: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

2

PRACE WP4

• WP4 is the “Distributed Systems Management” activity– User administration and accounting

– Distributed data management

– Trust between sites and security

– Monitoring of distributed resources

– Resource management and allocation

– Grid access

• Provides tools for:– Consistent management of the Tier-0 systems

– Smooth interoperation of the PRACE infrastructure with the national, regional or institutional HPC services

– Seamless access for users to the PRACE infrastructure

Page 3: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

3

WP4 defines the PRACE Middleware Stack

• To connect all PRACE systems in a coherent whole– e.g., uniform interfaces for job submission and data

transfer• Yet allow users to take advantage of the diversity of

PRACE systems– Do not make all machines look equal, as they have different

characteristics which can be successfully exploited by computational jobs

• Iterative process to define the middleware stack– Second release in June, currently being deployed– Work towards final release starts now!

Page 4: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

4

Middleware selection

• HPC ecosystem integration– Wide range of tools to adapt to the manifold needs of users– Standards compliance a must– Client software must be readily available at end-user sites

• Leverage DEISA experience– DEISA has been running a distributed supercomputing

infrastructure since several years

– Strong cooperation between the two projects• User-centric view

– Assessments also based on survey and user feedback

Page 5: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

5

The PRACE Infrastructure, today

• Six prototype systems– In BSC (Spain), CEA (France), CSC (Finland), FZJ (Germany),

HLRS (Germany), SARA (The Netherlands)– Diversity of computational architectures

• Private high-speed interconnect– 10Gb/s dedicated links– Shared with the DEISA supercomputing federation

• Leverage DEISA experience and tools in running it

– Star topology, with hub in Frankfurt a.M.

• Public Internet access– Through the European GÉANT R&E network

Page 6: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

6

Two-tiered structure

• All PRACE services exposed to both the private network and the Internet

• Flexibility in service setup– Cater for diversity in systems capabilities and site policies

• “Inner circle” on the private network– Geared towards high-speed transfers and strong integration of

PRACE services

• “Outer circle” on the public Internet– Secures access to the PRACE services– “Door nodes”: bastion hosts that act as a gateway to the PRACE

private network

Page 7: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

7

A vision for access to PRACE systems

• X.509-based authentication and authorization– Uniform authentication for all PRACE services

• UNICORE, SSH, GridFTP/RFT, …

– Provides encryption and confidentiality of all network communications

– Widely-adopted standard– EUGridPMA/IGTF provides trust anchor

• Globus GSI support– Single Sign-On: seamlessly mix several Grid services– The Grid standard

Page 8: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

8

Data movement

• GridFTP supported at all sites– Standard for Grid high-speed data transfers

• Supported by all major Grid infrastructures: DEISA, EGEE, TeraGrid, OSG, GridAustralia, ...

– “Door nodes” act as gateway between private and public network– Command-line and graphical clients available for all major OSes

• Globus RFT for automated file transfer– Unattended and reliable transfer of large files– One server serves both the private and the public network

Page 9: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

9

Job submission and control: command-line

• Command-line access fully supported– Still the most popular way of accessing HPC systems

• Direct access to local batch execution and scheduling systems– Exploit system-specific features

• SSH provides command-line access– Well-known secure protocol– X.509- and GSI- authentication supported– “Door node” provides gateway between private and public

network

Page 10: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

10

Job submission and control: UNICORE 6

• UNICORE 6– UNIform interface to COmputing REsources

• Both command-line and Eclipse-based rich graphical client

• Extensible: use the API or embed GridBeans into the rich client to create submission interfaces tailored to specific needs

• X.509-authentication

– Workflow engine• Can coordinate data transfer with job execution, and stage files to the

target system

– Several data transfer protocols supported• Can be used as an alternative data transfer system,

• Does much more than just this!– See http://www.unicore.eu/ for reference

Page 11: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

11

Resource monitoring

• User-level testing with INCA– Executes tests as an unprivileged user and reports status

• e.g., to verify that a certain application is available

– Aggregates results and reports in a color-coded web page– Different views with X.509-authorization

• Network performance testing– Iperf data– Shared with DEISA

• Monitoring features still actively discussed and developed

Page 12: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

12

User management and Accounting

• Centralized user management– One account to access all PRACE services– Compatible with the DEISA system

• Might merge into a single database in the future

• Web client for accounting/reporting– X.509-authorization with different authorization levels: user,

project manager, site manager– Developed by DEISA and in daily production use since years

Page 13: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

13

Thank you!

• WP4 will start working on the new release of the middleware stack shortly– What features would you like to see added?– What other use cases do you think should be supported on

PRACE Tier-0 systems?– What modifications would best support your use of the PRACE

Tier-0 systems?

• We value your input!

Page 14: PRACE WP4 - Distributed Systems Managementdev.prace-ri.eu › IMG › pdf › 5-Presentation_WP4_PRACE... · • Globus RFT for automated file transfer – Unattended and reliable

14

References: WP4 deliverables so far

• D4.1.1 – “Requirement analysis for Tier-0 systems management”

• D4.1.2 – “Report on existing Tier-0 systems management solutions”

• D4.1.3 – “Deployment of initial software stack to selected sites”

• D4.2.2 – “Deployment of enhanced solutions”