1
Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using Hyper-Threading RAL Tier1 Setup The RAL TIER1 batch farm consists of several multicore, hyperthreading capable CPUs. Increases in the amount of memory per node combined with experiences from other sites made hyperthreading an attractive option for increasing job throughput. RAL supports all LHC VOs, with prime users being Atlas, CMS and LHCB, and a 10% of resources is devoted to non-LHC VOs The virtual cores provided by hyperthreading could double the batch farm capacity, however the amount of memory available in the batch nodes did not permit that. LHC jobs require more than 2GB RAM to run smoothly. With all HT cores enabled total job slots capacity could double. In practice memory constrains resulted in an increase of 30% Method Each generation of batch farm hardware with hyperthreading capability was benchmarked with HEPSPEC, progressively increasing the number of threads up to the total number of virtual cores Benchmarks at that time were conducted using Scientific Linux 5. Scientific Linux 6 benchmarks were run later as the batch farm was set to be updated. Scientific Linux 6 performed slightly better than Scientific Linux 5. The overall trend was identical Power, temperature, disk I/O and batch server performance were closely monitored The results indicated a nearly linear increase in the hepspec scores, flattening at about 14 threads for 2 CPU 4 core nodes and 20 threads for 2 CPU 6 core nodes The revealed sweet spots were then configured for use in the batch farm to discover where production VO jobs would perform optimally 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0 50 100 150 200 250 HEPSPEC results 2009- streamline 2009-viglen 2010- clustervision 2010-viglen 2011-viglen 2011-dell Number of threads HEPSPEC Results Overall, 2000 extra job slots and 9298 extra hepspec were added in the batch farm using already available hardware Average job time increases as expected, but overall job throughput increased Network/disk/power/temperature usage did not increase in a way that could negatively affect the overall throughput or require additional maneuvers Batch server was able to handle the extra job slots Of critical importance is the sharp drop in job efficiency as job slots approach the upper hyperthreading limit. This means that real world VO jobs would suffer if we went for full batchfarm HEPSPEC performance! 12 14 16 18 20 22 24 0.88 0.9 0.92 0.94 0.96 0.98 1 Evolution of Job efficiency as more HT cores are being used dell-2011 viglen-2011 Number of threads Job efficiency Conclusions New procurements now take into account the hyperthreading capabilities For 2012, dual 8 core CPU systems go up to 32 virtual cores Systems were procured with 128 GB RAM in order to exploit full hyperthreading capabilities Dual Gigabit links, in the future single 10 GB as they became more cost effective So far RAID0 software raid setup has proven sufficient for disk I/O Performance gains so far on par with previous generations By spending a bit extra on RAM, we save more by buying fewer nodes This also saves machine room space, cables, and power 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 0 50 100 150 200 250 300 350 400 2012 procurements benchmarks ocf-2012 dell-2012 Number of threads HEPSPEC 200 9- stream line 2009-viglen 201 0- cluste rvision 2010-viglen 2011-vigl en 2011- dell 2008-viglen (non HT) 0 0.5 1 1.5 2 2.5 3 3.5 4 Available RAM under different setups RAM per core RAM per vcore RAM per job slot RAM (GB) Physical cores(before) HT Cores Job slots(after) 0 2000 4000 6000 8000 10000 12000 Comparison of physical cores, full HT cores, and optimum setup Cores pre 2012 batch farm 2012-present farm 0 2000 4000 6000 8000 10000 12000 14000 16000 Evolution of batch farm size Physical cores HT cores Job slots 2009- streamline 200 9- viglen 201 0-cluste rvisio n 2010-viglen 201 1- viglen 201 1- dell 2008-vigl en (non HT) 0 200 400 600 800 1000 1200 1400 Physical cores compared to HT cores Physical cores(before) HT Cores 2009- streamline 2009-viglen 2010- clustervision 2010-viglen 2011-viglen 2011-dell 0% 5% 10% 15% 20% 25% HESPEC % increase with full HT Make Job Slots per WN Efficiency Average Job Length (mins) Standard Deviation (mins) Number of jobs Dell 12 0.9715 297 370 19065 Viglen 12 0.9757 320 390 23864 Dell 14 0.9719 238 326 6118 Viglen 14 0.9767 270 341 11249 Dell 16 0.9859 343 254 6550 Viglen 16 0.985 304 249 8756 Dell 18 0.9781 377 390 5014 Viglen 18 0.9808 350 391 6263 Dell 20 0.9758 318 346 11339 Viglen 20 0.9756 260 285 11229 Dell 22 0.9747 387 315 6317 Viglen 22 0.9783 305 236 6307 Dell 24 0.9257 544 373 6650 Viglen 24 0.9311 372 278 6713

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1

  • Upload
    elijah

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Maximising job throughput using Hyper-Threading. Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1. RAL Tier1 Setup - PowerPoint PPT Presentation

Citation preview

Page 1: Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1

Alastair Dewhurst, Dimitrios ZilaskosRAL Tier1

Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams

Maximising job throughput using Hyper-Threading

RAL Tier1 SetupThe RAL TIER1 batch farm consists of several multicore, hyperthreading capable CPUs. Increases in the amount of memory per node combined with experiences from other sites made hyperthreading an attractive option for increasing job throughput.RAL supports all LHC VOs, with prime users being Atlas, CMS and LHCB, and a 10% of resources is devoted to non-LHC VOsThe virtual cores provided by hyperthreading could double the batch farm capacity, however the amount of memory available in the batch nodes did not permit that. LHC jobs require more than 2GB RAM to run smoothly.

With all HT cores enabled total job slots capacity could double. In practice memory constrains resulted in an increase of 30%

Method• Each generation of batch farm hardware with

hyperthreading capability was benchmarked with HEPSPEC, progressively increasing the number of threads up to the total number of virtual cores

• Benchmarks at that time were conducted using Scientific Linux 5. Scientific Linux 6 benchmarks were run later as the batch farm was set to be updated.

• Scientific Linux 6 performed slightly better than Scientific Linux 5. The overall trend was identical

• Power, temperature, disk I/O and batch server performance were closely monitored

• The results indicated a nearly linear increase in the hepspec scores, flattening at about 14 threads for 2 CPU 4 core nodes and 20 threads for 2 CPU 6 core nodes

• The revealed sweet spots were then configured for use in the batch farm to discover where production VO jobs would perform optimally

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 240

50

100

150

200

250

HEPSPEC results

2009-streamline

2009-viglen

2010-clustervision

2010-viglen

2011-viglen

2011-dell

Number of threads

HEPS

PEC

Results• Overall, 2000 extra job slots and 9298 extra hepspec were added in the batch farm using already available

hardware• Average job time increases as expected, but overall job throughput increased• Network/disk/power/temperature usage did not increase in a way that could negatively affect the overall

throughput or require additional maneuvers• Batch server was able to handle the extra job slots• Of critical importance is the sharp drop in job efficiency as job slots approach the upper

hyperthreading limit. This means that real world VO jobs would suffer if we went for full batchfarm HEPSPEC performance!

12 14 16 18 20 22 240.88

0.9

0.92

0.94

0.96

0.98

1

Evolution of Job efficiency as more HT cores are being used

dell-2011viglen-2011

Number of threads

Job

efficie

ncy

Conclusions• New procurements now take into account the hyperthreading capabilities• For 2012, dual 8 core CPU systems go up to 32 virtual cores• Systems were procured with 128 GB RAM in order to exploit full hyperthreading capabilities• Dual Gigabit links, in the future single 10 GB as they became more cost effective• So far RAID0 software raid setup has proven sufficient for disk I/O• Performance gains so far on par with previous generations• By spending a bit extra on RAM, we save more by buying fewer nodes• This also saves machine room space, cables, and power

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320

50

100

150

200

250

300

350

400

2012 procurements benchmarks

ocf-2012dell-2012

Number of threads

HEPS

PEC

2009-strea

mline

2009-vigle

n

2010-cluste

rvisio

n

2010-vigle

n

2011-vigle

n

2011-dell

2008-vigle

n (non HT)

0

0.5

1

1.5

2

2.5

3

3.5

4

Available RAM under different setups

RAM per coreRAM per vcoreRAM per job slot

RAM

(GB)

Physical cores(before) HT Cores Job slots(after)0

2000

4000

6000

8000

10000

12000Comparison of physical cores, full HT cores, and optimum

setup

Core

s

pre 2012 batch farm 2012-present farm0

2000

4000

6000

8000

10000

12000

14000

16000

Evolution of batch farm size

Physical coresHT cores

Job

slots

2009-streamline 2009-viglen 2010-clustervision 2010-viglen 2011-viglen 2011-dell 2008-viglen (non HT)0

200

400

600

800

1000

1200

1400

Physical cores compared to HT cores

Physical cores(before)HT Cores

2009-streamline 2009-viglen 2010-clustervision 2010-viglen 2011-viglen 2011-dell0%

5%

10%

15%

20%

25%

HESPEC % increase with full HT

Make Job Slots per WN Efficiency Average Job Length (mins) Standard Deviation (mins) Number of jobs

Dell 12 0.9715 297 370 19065

Viglen 12 0.9757 320 390 23864

Dell 14 0.9719 238 326 6118

Viglen 14 0.9767 270 341 11249

Dell 16 0.9859 343 254 6550

Viglen 16 0.985 304 249 8756

Dell 18 0.9781 377 390 5014

Viglen 18 0.9808 350 391 6263

Dell 20 0.9758 318 346 11339

Viglen 20 0.9756 260 285 11229

Dell 22 0.9747 387 315 6317

Viglen 22 0.9783 305 236 6307

Dell 24 0.9257 544 373 6650

Viglen 24 0.9311 372 278 6713