18
Grid Computing: dealing with GB/s dataflows David Groep, NIKHEF Graphics: Real Time Monitor, Gidon Moont, Imperial College London, see http://gridportal.hep.ph.ic.ac.uk/rtm/ Jan Just Keijser, Nikhef [email protected] 21 March 2011

Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Grid Computing: dealing with GB/s dataflows

• David Groep, NIKHEF

Graphics: Real Time Monitor, Gidon Moont, Imperial College London, see http://gridportal.hep.ph.ic.ac.uk/rtm/

Jan Just Keijser, [email protected]

21 March 2011

Page 2: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

LHC Computing

Large Hadron Collider• ‘the worlds largest

microscope’

• 'looking at the fundamental forces of nature’

• 27 km circumference

• CERN, Genèveatom

10-15 m

nucleus

quarks

~ 20 PByte of data per year, ~ 60 000 modern PC style computers

Page 3: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It
Page 4: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It
Page 5: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Atlas Trigger Design

• Level 1– Hardware based, online– Accepts 75 KHz, latency 2.5 ms– 160 GB/s

• Level 2– 500 Processor farm– Accepts 2 KHz, latency 10 ms– 5 GB/s

• Event Filter– 1600 processor farm– Accepts 200 Hz, ~1 s per event– Incorporates alignment, calibration– 300 MB/s

From: The ATLAS trigger system, Srivas Prasad

Page 6: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

• Signal/Background 10-9

• Data volume● (high rate) X

(large number of channels) X (4 experiments)

20 PetaBytes new data per year

• Compute power● (event complexity) X

(number of events) X (thousands of users)

60.000 processors

Concorde(15 Km)

Balloon(30 Km)

Stack of CDs w/1 year LHC data!(~ 20 Km)

Mt. Blanc(4.8 Km)

Page 7: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It
Page 8: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Scientific Compute e-Infrastructure

From: Key characteristics of SARA and BiG Grid Compute services

Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes.

Data parallelism (also known as loop-level parallelism) is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes.

Page 9: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

What is BiG Grid?

• Collaborative effort of the NBIC, NCF and Nikhef.• Aims to set up a grid infrastructure for scientific research.• This research infrastructure contains compute clusters, data

storage, combined with specific middleware and software to enable research which needs more than just raw computing power or data storage.

• We aim to assist scientists from all backgrounds in exploring and using the opportunities offered by the Dutch e-science grid.

http://www.biggrid.nl

Page 10: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Nikhef (NDPF)

2500 processor cores2000 TByte disk160 Gbps network

SARA (GINA+LISA)

4800 processor cores1800 TByte disk2000 TByte tape160 Gbps network

RUG-CIT (Grid)

120 processor cores8 800 GByte disk10 Gbps network

Philips Research Ehv

1600 processor cores100 TByte disk1 Gbps network

Page 11: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Grid organisation

National Grid Initiatives & European Grid Initiative• At the national level a grid infrastructure is offered to national and international

users by the NGIs. BiG Grid is (de facto) the Dutch NGI.• The 'European Grid Initiative' coordinates the efforts of the different NGIs and

ensures interoperability• Circa 40 European NGIs, with links to South America and Taiwan• Headquarter of EGI is at the Science Park in Amsterdam

Page 12: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Cross-domain and global e-Science grids

The communities that make up the grid:• not under single hierarchical control, • temporarily joining forces to solve a particular problem at hand, • bringing to the collaboration a subset of their resources, • sharing those at their discretion and each under their own conditions.

Page 13: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Grid especially means scaling up:• Distributed computing on many, different computers,• Distributed storage of data,• Large amounts of data (Giga-, Tera-, Petabytes),• Large number of files (millions).

This gives rise to “interesting” problems:• Remote logins are not always possible on the grid,• Debugging a program is a challenge,• Regular filesystems tend to choke on millions of files,• Storing data is one thing, searching and retrieving turn

out to be even bigger challenges.

Challenges: scaling up

Page 14: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Challenges: security

Why is security so important for an e-Science Infrastructure?

• e-Science communities are not under a single hierarchical control;

• As grid site administrator you are allowing relatively unknown persons to run programs on your computers;

• All of these computers are connected to the internet using an incredibly fast network:

This makes the grid a potentially very dangerous service on the internet

Page 15: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Storaging Petabytes of data is possible, but...

• Retrieving data is harder than you would expect;• Organising such amounts of data is non-trivial;• Applications are much smaller than the data they need to

process always bring your application to the data, if possible;

• The “data about the data” (metadata) becomes crucial:

– location,

– experimental conditions,

– date and time• Storing the metadata in a database can be a life-saver.

Lessons Learned: Data Management

Page 16: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Lessons Learned: Job efficiency

A recurring complaint heard about grid computing islow job efficiency (~94%).

It is important to know that:• Failed jobs almost always did so due to data access issues;• If you remove data access issues, job efficiency jumps to

~99%, which is on par with cluster and cloud computing.

Mitigation strategies:• Replicate files to multiple storage systems;• Pre-stage data to specific compute sites;• “Program for failure”.

Page 17: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Lessons Learned: Network bandwidth

All data taken by the LHC in CERN is replicated out to 11 Tier-1 centres around Europe. BiG Grid serves as one of those Tier-1's.

We always thought and knew we have a good network, but

• Having a dedicated optical network (OPN) from CERN to the data storage centres (Tier-1s) turned out to be crucial;

• It turns out that the Network bandwidth

between storage and compute clustersis equally important

Page 18: Grid Computing: dealing with GB/s dataflowsjanjust/presentations/Shell-Grid... · 2011. 3. 21. · A recurring complaint heard about grid computing is low job efficiency (~94%). It

Questions?

http://www.nikhef.nl