22
Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Embed Size (px)

Citation preview

Page 1: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Caitriana NicholsonUniversity of Glasgow

Grid Data Management:

Simulations of LCG 2008

Page 2: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Outline

• Introduction– what will LHC data analysis and

management be like in 2008?

• The OptorSim grid simulator• OptorSim architecture• Experimental setup• Results• Conclusions

Page 3: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Introduction

• LHC raw data rate of ~15 PB/year• LCG to provide data storage and

computing infrastructure• Actual analysis behaviour still

unknown use simulation to investigate

behaviour investigate dynamic data replication

Page 4: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

OptorSim

• OptorSim is a grid simulator with a focus on data management

• Developed as part of EDG WP2– Thanks to all members of the Optimisation Team:

David Cameron, Ruben Carvajal-Schiaffino, Paul Millar, Kurt Stockinger, Floriano Zini

• Based on EDG architecture• Used to examine automated decisions about

replica placement and deletion

http://edg-wp2.web.cern.ch/edg-wp2/optimization/optorsim.html

Page 5: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Architecture

• Sites with CE and/or SE • Replica Optimiser

decides replications for its site

• Resource Broker schedules jobs

• Replica Catalogue maps logical to physical filenames

• Replica Manager controls and registers replications

Page 6: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Algorithms

• Job scheduling – Details not covered in this talk– “QueueAccessCost” scheduler used in these

results

• Data replication– No replication– Simple replication:“always replicate, delete

existing files if necessary”• Least Recently Used (LRU)• Least Frequently Used (LFU)

– Economic model: “replicate only if profitable”• Sites “buy” and “sell” files using auction mechanism• Files deleted if less valuable than new file

Page 7: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Experimental Setup - Jobs & Files

• Job types based on computing models

• “Dataset” for each experiment ~1 year’s AOD

• 2GB files• Placed at CERN and

Tier-1s at start• See experiment

computing TDRs for more details

Job Event Size (kB)

Total no. of files

Files per job

alice-pp 50 25000 25

alice-hi 250 12500 125

atlas 100 100000 50

cms 50 37500 25

lhcb-small

75 37500 38

lhcb-big 75 37500 375

Page 8: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Experimental Setup - Storage Resources

• CERN & T1 site capacities from LCG TDR• “Canonical” T2 capacity of 197 TB each

(18.8 PB / 95 sites)• Storage metric D = (average SE size)

(total dataset size)• Memory limitations -> scale down T2 SE

sizes to 500 GB– Allows file deletion to start quickly– Disadvantage of small D

Page 9: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Experimental Setup - Computing & Network

• Most (chaotic) analysis jobs run at T2s– T1s not given CE, except those running

LHCb jobs– CERN Analysis Facility with CE of 7840

kSI2k– T2s with averaged CE of 645 kSI2k each

(61.3 MSI2k / 95 sites)• Network based on NREN topologies

– Sites connected to closest router– Default of 155 Mbps if published value not

available

Page 10: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Network Topology

Page 11: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Parameters

• Job scheduler “QueueAccessCost”– Combines data location and queue

information

• Sequential access pattern• 1000 jobs per simulation• Site policies set according to LCG

Memorandum of Understanding

Page 12: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Evaluation Metrics

• Different grid users will have different criteria of evaluation

• Used in these summary results are:– Mean job time

• Average time taken for job to run, from scheduling to completion

– Effective Network Usage (ENU)• (File requests which use network resources) (Total number of file requests)

Page 13: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Results: Data Replication

• Performance of algorithms measured with varying D

• D varied by reducing dataset size

• 20-25% gain in mean job time as D approaches realistic value

Page 14: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Results: Data Replication

• ENU shows similar gain

• Allows clearer distinction between strategies

Page 15: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Results: Data Replication

• Number of jobs increased to 4000

• Mean job time increases linearly

• Relative improvement as D increases will hold for higher numbers of jobs

• Realistic number of jobs is >O(10000)

Page 16: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Results: Site Policies

• Vary site policies:– All Job Types

• Sites accept jobs from any VO

– One Job Type• Sites accept jobs

from one VO

– Mixed• default

• All Job Types is ~60% faster than One Job Type

Page 17: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Results: Site Policies

• All Job Types also give ~25% lower ENU than other policies

• Egalitarian approach benefits all grid users

Page 18: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Results: Access Patterns

• Sequential access likely for many HEP applications

• Zipf-like access will also occur – Some files accessed

frequently, many infrequently

• Replication gives performance gain of ~75% when Zipf access pattern used

Page 19: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Results: Access Patterns

• ENU also ~75% lower with Zipf access

• Any Zipf-like element makes replication highly desirable

• Size of efficiency gain depends on streaming model, etc

Page 20: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Conclusions

• OptorSim used to simulate LCG in 2008• Dynamic data replication reduces running time

of simulated grid jobs:– 20% reduction with sequential access– 75% reduction with Zipf-like access– Similar reductions in network usage

• Little difference between replication strategies– Simpler LRU, LFU 20-30% faster than economic

model

• Site policy which allows all experiments to share resources gives most effective grid use

Page 21: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

Replica optimiser architecture

• Access Mediator (AM) - contacts replica optimisers to locate the cheapest copies of files and makes them available locally

• Storage Broker (SB) - manages files stored in SE, trying to maximise profit for the finite amount of storage space available

• P2P Mediator (P2PM) - establishes and maintains P2P communication between grid sites

Page 22: Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008

Caitriana Nicholson, CHEP 2006, Mumbai

GridPP: Executive Summary

Tony Doyle