Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1

Alastair Dewhurst, Dimitrios ZilaskosRAL Tier1

Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams

Maximising job throughput using Hyper-Threading

RAL Tier1 SetupThe RAL TIER1 batch farm consists of several multicore, hyperthreading capable CPUs. Increases in the amount of memory per node combined with experiences from other sites made hyperthreading an attractive option for increasing job throughput.RAL supports all LHC VOs, with prime users being Atlas, CMS and LHCB, and a 10% of resources is devoted to non-LHC VOsThe virtual cores provided by hyperthreading could double the batch farm capacity, however the amount of memory available in the batch nodes did not permit that. LHC jobs require more than 2GB RAM to run smoothly.

With all HT cores enabled total job slots capacity could double. In practice memory constrains resulted in an increase of 30%

Method• Each generation of batch farm hardware with

hyperthreading capability was benchmarked with HEPSPEC, progressively increasing the number of threads up to the total number of virtual cores

• Benchmarks at that time were conducted using Scientific Linux 5. Scientific Linux 6 benchmarks were run later as the batch farm was set to be updated.

• Scientific Linux 6 performed slightly better than Scientific Linux 5. The overall trend was identical

• Power, temperature, disk I/O and batch server performance were closely monitored

• The results indicated a nearly linear increase in the hepspec scores, flattening at about 14 threads for 2 CPU 4 core nodes and 20 threads for 2 CPU 6 core nodes

• The revealed sweet spots were then configured for use in the batch farm to discover where production VO jobs would perform optimally

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 240

50

100

150

200

250

HEPSPEC results

2009-streamline

2009-viglen

2010-clustervision

2010-viglen

2011-viglen

2011-dell

Number of threads

HEPS

PEC

Results• Overall, 2000 extra job slots and 9298 extra hepspec were added in the batch farm using already available

hardware• Average job time increases as expected, but overall job throughput increased• Network/disk/power/temperature usage did not increase in a way that could negatively affect the overall

throughput or require additional maneuvers• Batch server was able to handle the extra job slots• Of critical importance is the sharp drop in job efficiency as job slots approach the upper

hyperthreading limit. This means that real world VO jobs would suffer if we went for full batchfarm HEPSPEC performance!

12 14 16 18 20 22 240.88

0.9

0.92

0.94

0.96

0.98

1

Evolution of Job efficiency as more HT cores are being used

dell-2011viglen-2011

Number of threads

Job

efficie

ncy

Conclusions• New procurements now take into account the hyperthreading capabilities• For 2012, dual 8 core CPU systems go up to 32 virtual cores• Systems were procured with 128 GB RAM in order to exploit full hyperthreading capabilities• Dual Gigabit links, in the future single 10 GB as they became more cost effective• So far RAID0 software raid setup has proven sufficient for disk I/O• Performance gains so far on par with previous generations• By spending a bit extra on RAM, we save more by buying fewer nodes• This also saves machine room space, cables, and power

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320

50

100

150

200

250

300

350

400

2012 procurements benchmarks

ocf-2012dell-2012

Number of threads

HEPS

PEC

2009-strea

mline

2009-vigle

n

2010-cluste

rvisio

n

2010-vigle

n

2011-vigle

n

2011-dell

2008-vigle

n (non HT)

0

0.5

1

1.5

2

2.5

3

3.5

4

Available RAM under different setups

RAM per coreRAM per vcoreRAM per job slot

RAM

(GB)

Physical cores(before) HT Cores Job slots(after)0

2000

4000

6000

8000

10000

12000Comparison of physical cores, full HT cores, and optimum

setup

Core

s

pre 2012 batch farm 2012-present farm0

2000

4000

6000

8000

10000

12000

14000

16000

Evolution of batch farm size

Physical coresHT cores

Job

slots

2009-streamline 2009-viglen 2010-clustervision 2010-viglen 2011-viglen 2011-dell 2008-viglen (non HT)0

200

400

600

800

1000

1200

1400

Physical cores compared to HT cores

Physical cores(before)HT Cores

2009-streamline 2009-viglen 2010-clustervision 2010-viglen 2011-viglen 2011-dell0%

5%

10%

15%

20%

25%

HESPEC % increase with full HT

Make Job Slots per WN Efficiency Average Job Length (mins) Standard Deviation (mins) Number of jobs

Dell 12 0.9715 297 370 19065

Viglen 12 0.9757 320 390 23864

Dell 14 0.9719 238 326 6118

Viglen 14 0.9767 270 341 11249

Dell 16 0.9859 343 254 6550

Viglen 16 0.985 304 249 8756

Dell 18 0.9781 377 390 5014

Viglen 18 0.9808 350 391 6263

Dell 20 0.9758 318 346 11339

Viglen 20 0.9756 260 285 11229

Dell 22 0.9747 387 315 6317

Viglen 22 0.9783 305 236 6307

Dell 24 0.9257 544 373 6650

Viglen 24 0.9311 372 278 6713

Documents

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1