82
Studying a 40 Tb/s data acquisition system for the LHCb experiment Sébastien VALAT CERN 29/01/2018

Studying a 40 Tb/s data acquisition system for the LHCb experiment · 2020. 11. 1. · LHCb, an upgrade for 2018-2020 › Update of sub-detectors › Removal of hardware trigger Currently

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Studying a 40 Tb/s data

    acquisition system for

    the LHCb experiment

    Sébastien VALAT – CERN› 29/01/2018

  • Plan

    › Context

    › Event building benchmark

    › First test at scale : hitting the wall

    › Routing and cabling

    › A word on failure recovery & storage

    › Conclusion

    Studying a 40 Tb/s DAQ - Sébastien Valat 229/01/2018

  • Context

    Studying a 40 Tb/s DAQ - Sébastien Valat 329/01/2018

  • Reminder on LHC

    › Accelerator of 27 km› 10 000 superconductive magnets› Collision energy up to 14 TeV› Proton-Proton collisions, but also heavy-ions› 4 BIG experiments :

    ▪ ALICE, ATLAS, CMS, LHCb

    Studying a 40 Tb/s DAQ - Sébastien Valat 429/01/2018

  • What is BIG ?

    ATLAS

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 5

  • Big detectors as onions

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 6

  • What is a collision

    Studying a 40 Tb/s DAQ - Sébastien Valat 729/01/2018

  • Most collisions are…. « useless »

    › A collision every 25ns

    › Meaning 40 Mhz

    › Most collisions are well known physics

    › New physics is rare

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 8

    109 Hz

    5106 Hz

    EWK: 20–100 Hz

    10 Hz

  • LHCb, an upgrade for 2018-2020

    › Update of sub-detectors› Removal of hardware trigger

    ▪ Currently in FPGA and analogic components

    ▪ Have to answer in real time

    ▪ Hard to maintain and update

    › New status :▪ Larger event rate (1 Mhz to 40 Mhz)

    ▪ Larger event size (50 KB to ~100 KB)

    › Much more data for DAQ & Trigger (x80 => 40 Tb/s)

    Studying a 40 Tb/s DAQ - Sébastien Valat 9

    Detector

    Hardware trigger

    DAQ

    Filter units :

    L0 Trigger

    High Level Trigger

    CERN long term storage

    Offline physic analysis

    40 Mhz

    1 Mhz

    20 Khz

    29/01/2018

  • 40 Tb/s for LHCb in 2020

    Studying a 40 Tb/s DAQ - Sébastien Valat 1029/01/2018

  • Data flow

    › Numbers▪ ~8800 optical links going out

    from detector to the surface

    (~100 m) and up to ~4.8 Gb/s

    each.

    ▪ ~500 readout nodes

    (up to 24 input links each)

    ▪ Filter farm of O(1000) nodes

    › 40 Tb/s to transport▪ Can run over 512 nodes

    ▪ Need a 100 Gb/s network

    ▪ Considering net. load at 80%

    ▪ So 80 Gb/s

    ▪ Current estimation is 70 Gb/s

    Studying a 40 Tb/s DAQ - Sébastien Valat 11

    ~1000-2000

    ~4000 ?

    29/01/2018

  • Starting point : the acquisition board

    Studying a 40 Tb/s DAQ - Sébastien Valat 12

    Up to 48 fibers and 100 Gb/s

    29/01/2018

  • Event building network

    benchmark

    Studying a 40 Tb/s DAQ - Sébastien Valat 1329/01/2018

  • Event building network topology

    › Work with 4 units:▪ Readout unit (RU) to read data from PCI-E readout board

    ▪ Builder unit (BU) to merge data from Readout unit and send to Filter unit.

    ▪ Filter unit (FU) to select the interesting data (not considered here)

    ▪ Event manager (EM) to dispatch the work over Builder unit (credits)

    › RU/BU mostly does a “all-to-all”▪ To aggregate the data chunks from each collision

    Studying a 40 Tb/s DAQ - Sébastien Valat 14

    EM

    BU-1RU-1 BU-2RU-2 BU-500RU-500

    Node - 1

    29/01/2018

    FU

  • IO nodes hardware

    › Three IO boards at 100 Gb/s per node:▪ PCI-40 for fiber input

    ▪ Event building network

    ▪ Output to filter farm

    › Also stress the memory (320 Gb/s of total traffic)▪ 80 in + 80 out

    1529/01/2018

    PCI-40 PCI-40

    DAQ

    network

    DAQ

    network

    FILTER

    networkFILTER

    network

    CPU CPU CPU CPU

    Buffer Buffer

    Buffer Buffer

    80 Gb/s

    80 in +80 out

  • The DAQ network technologies

    › We expect to use HPC networks▪ 100 Gb/s (expect running at 80 Gb/s)

    › Technologies we looked on:▪ Infiniband EDR (HDR 200 Gb/s ?)

    ▪ Intel Omni-path (version 2, 200 Gb/s ?)

    ▪ 100 Gb/s Ethernet

    › Not as HPC apps▪ We need to fully fill the network continuously

    ▪ In full duplex

    ▪ Bad pattern : many gathers (all-to-all) !

    › Think using a fat-tree topology

    Studying a 40 Tb/s DAQ - Sébastien Valat 1629/01/2018

  • Fat tree

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 17

    › Fat tree, theory▪ Based on switches

    ▪ More bandwidth at each level

    › Fat tree : reality▪ Switch radix is currently 36 or 48

    ▪ Count cables for 512 nodes !

    N

    Leaf

    Spine

    Leaf

    Spine Spine

    4x

    Leaf

  • Switch and cables

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 18

    › Example of Mellanox switch (36 ports)

    › Can use director switch for spine▪ Provide 648 ports (Intel has up to 754)

    › Cables▪ Capper : up to 3 meters, price O(150 CHF)

    ▪ Optics : up to 100 meters, price O(500 CHF)

  • Interfaces to exploit them

    › MPI▪ Support all networks

    ▪ Optimized for IB/Omni-path

    ▪ Ease test on HPC sites

    ▪ How to support fault tolerance ?

    ̵ We need to run 24h/24h and 7d/7

    › Infiniband VERBS▪ For IB and Ethernet (RoCE / softRoCE / iwrap)

    ▪ Low level

    ▪ OK for fault tolerance

    › Libfabric▪ For IB, Omni-path, and TCP

    ▪ Low level

    ▪ Node failure and recovery support to be studied.

    › TCP/UPD (we don’t depend on latency)▪ Issue : CPU usage and memory bandwidth

    Studying a 40 Tb/s DAQ - Sébastien Valat 1929/01/2018

  • DAQPIPE

    › A benchmark to evaluate event building solutions▪ DAQ Protocol Independent Performance Evaluator

    ▪ Provides EM/RU/BU units

    › Support various APIs▪ MPI

    ▪ Libfabric

    ▪ PSM2

    ▪ Verbs

    ▪ TCP / UDP

    ▪ RapidIO

    › Multiple protocols▪ PUSH

    ▪ PULL

    Studying a 40 Tb/s DAQ - Sébastien Valat 2029/01/2018

  • DAQPIPE

    › We need to pack events▪ 100 KB / 512 = ~200 bytes

    › Three message size on the network▪ Command : ~64 B

    ▪ Meta-data : ~10 KB

    ▪ Data : ~512 KB or 1 MB

    Studying a 40 Tb/s DAQ - Sébastien Valat 21

    0

    50

    100

    150

    200

    250

    1 16 256 4096 65536 1048576

    Ban

    dw

    idth

    (G

    b/s

    )

    Size (B)

    osu_bibw

    EDR OPA

    29/01/2018

  • Communication control parameters

    › Manage communication scheduling models▪ Barrel shift ordering (with N on-fly)

    ▪ Random ordering (with N on-fly)

    ▪ Send all in one go

    › Messages size ▪ best ~512 KB

    › Parallel gathers (credits) ▪ best ~2

    › Parallel communication per gather▪ With how many nodes we communicate

    ▪ best ~8

    ▪ So in total 2*8 communication in flight

    › Process per nodes, 1 or 2 ▪ best 2

    Studying a 40 Tb/s DAQ - Sébastien Valat 22

    * From Bandwidth-optimal all-to-all exchanges in fat tree networks

    29/01/2018

  • What we have seen here on our

    cluster with 16 nodes OPA

    Studying a 40 Tb/s DAQ - Sébastien Valat 23

    87 Gb/s = 2% of the points

    29/01/2018

  • Performance stability over

    parameters (16 nodes OPA)

    Studying a 40 Tb/s DAQ - Sébastien Valat 2429/01/2018

  • First test at scale (512):

    hitting the wall

    Studying a 40 Tb/s DAQ - Sébastien Valat 25

    From http://www.flickriver.com/photos/tags/supercomputer/interesting/

    29/01/2018

  • Mapping

    › This cluster had a 2:1 mapping

    › 15 link up for 32 nodes

    › We selected 15 nodes per leaf switch

    › “Problem” I don’t know which nodes they gave us on each sub switches

    Studying a 40 Tb/s DAQ - Sébastien Valat 26

    Leaf switch

    Core

    29/01/2018

  • Summary scalability on

    Day-1 ….

    Studying a 40 Tb/s DAQ - Sébastien Valat 2729/01/2018

  • Day-2, 2 hours before leaving

    try random scheduling

    Studying a 40 Tb/s DAQ - Sébastien Valat 28

    30.37

    15 min before leaving

    5 min…

    29/01/2018

  • Final summary

    Studying a 40 Tb/s DAQ - Sébastien Valat 2929/01/2018

  • Another test on two clusters

    Studying a 40 Tb/s DAQ - Sébastien Valat 30

    0

    20

    40

    60

    80

    100

    120

    0 20 40 60 80 100 120 140

    Ban

    dw

    idth

    (G

    b/s

    )

    Nodes

    DAQPIPE-OPA DAQPIPE-IB LSEB-IB

    IB

    OPA

    29/01/2018

  • MPI_Alltoall

    Studying a 40 Tb/s DAQ - Sébastien Valat 31

    0

    20

    40

    60

    80

    100

    120

    0 20 40 60 80 100 120 140

    Ban

    dw

    idth

    (G

    b/s

    )

    Nodes

    Omni-Path OpenMPI Infiniband-impi Omni-Path IntelMPI

    OPA

    IB

    29/01/2018

  • Routing & cabling

    Studying a 40 Tb/s DAQ - Sébastien Valat 3229/01/2018

  • Micro-benchmarks on OPA

    Studying a 40 Tb/s DAQ - Sébastien Valat 33

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    16 32 64 128

    Ban

    diw

    idth

    (G

    b/s

    )

    Scaling

    one-to-one many-to-one gather gatherWithCmd gatherWithCmdMeta

    29/01/2018

  • Performance of one-to-one

    Studying a 40 Tb/s DAQ - Sébastien Valat 34

    Shift = 1

    Shift = 3

    • Default one to one send to the neighbor node

    • One variant is to shift

    • Each represent a step in DAQPIPE (barrel shifting)

    • Similar discussion than [1] but with experiment measurements

    [1] Bandwidth-optimal All-to-all Exchanges in Fat Tree Networks, B. Prisacari

    29/01/2018

  • One-to-one with shifts

    InfiniBand

    › Full-duplex 16 nodes Full-duplex, 46 nodes

    Studying a 40 Tb/s DAQ - Sébastien Valat 35

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 50 100 150

    BN

    DW

    IDT

    H (

    GB

    /S)

    SHIFT

    min average max

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 50 100 150

    BA

    ND

    WID

    TH

    (G

    B/S

    )

    SHIFT

    min average max

    29/01/2018

    1/2 of BW

  • Ideal and real topologies

    Studying a 40 Tb/s DAQ - Sébastien Valat 36

    Holes[01-10,12-18]

    [11]

    29/01/2018

  • Cabling on director switch

    Is it better ?

    › { 'OmniPth00117501fa000404 L110B':› { '1': 'node027 hfi1_0',› '2': 'node030 hfi1_0',› '3': 'node126 hfi1_0',› '5': 'node036 hfi1_0',› '6': 'node031 hfi1_0',› '7': 'node123 hfi1_0',› '8': 'node122 hfi1_0',› '9': 'node006 hfi1_0',› '10': 'node033 hfi1_0',› '11': 'node034 hfi1_0',› '12': 'node016 hfi1_0',› '13': 'node004 hfi1_0',› '14': 'node002 hfi1_0',› '15': 'node125 hfi1_0' },

    › 'OmniPth00117501fa000404 L118A':› { '2': 'node116 hfi1_0',› '7': 'node008 hfi1_0',› '11': 'node021 hfi1_0',› '15': 'node012 hfi1_0' },› 'OmniPth00117501fa000404 L118B':› { '2': 'node010 hfi1_0',› '3': 'node009 hfi1_0',› '5': 'node121 hfi1_0',› '6': 'node022 hfi1_0',› '7': 'node120 hfi1_0',› '8': 'node117 hfi1_0',› '9': 'node020 hfi1_0',› '10': 'node011 hfi1_0',› '13': 'node115 hfi1_0' },

    Studying a 40 Tb/s DAQ - Sébastien Valat 37

    Problem: Slurm numberd then ranks by ordering node names

    29/01/2018

  • Collisions over barrel steps

    On 64 InfiniBand nodes

    Studying a 40 Tb/s DAQ - Sébastien Valat 38

    0

    5

    10

    15

    20

    25

    30

    0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64

    Colli

    sio

    ns

    Step ID

    Real Clean Real-renum

    29/01/2018

  • Conflicts

    › The real topology implies many conflicts› Due to :

    ▪ routing table

    ▪ missing nodes in leaf switches

    › 64 nodes imply 4096 communications▪ 870 conflicts => 21%, 704 conflicts => 17%, 160 conflicts => 3%

    Studying a 40 Tb/s DAQ - Sébastien Valat 39

    160

    704

    559

    870

    0 100 200 300 400 500 600 700 800 900 1000

    CLEAN TOPOLOGY

    REAL TOPOLOGY

    Link collisions

    Conflicts on 64 nodes

    Random Barrel

    21%

    17%

    7%

    29/01/2018

  • Conflicts on clean topology

    Studying a 40 Tb/s DAQ - Sébastien Valat 40

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    0 20 40 60 80 100 120 140 160 180 200

    CO

    NF

    LIC

    TS

    NODES

    29/01/2018

  • Conflicts on clean topology

    Pourcentage…

    Studying a 40 Tb/s DAQ - Sébastien Valat 41

    0.00%

    0.50%

    1.00%

    1.50%

    2.00%

    2.50%

    3.00%

    3.50%

    4.00%

    4.50%

    0 100 200 300 400 500 600

    CO

    NF

    LIC

    TS

    NODES

    29/01/2018

  • Example on clean topology

    › Switch radix : 36› Run on 45 nodes (2 * 18 + 9)› Look on step 28 => 9 conflicts

    42Studying a 40 Tb/s DAQ - Sébastien Valat29/01/2018

  • Extract two conflicts to understand

    43

    › Example, 2 conflicts of step 28▪ 20 -> 37

    ▪ 29 -> 1

    › They use the same UP link

    Studying a 40 Tb/s DAQ - Sébastien Valat29/01/2018

  • Access again to Maronci

    From 07/2017 (Tech A) & 02/2017 (Tech B)

    Studying a 40 Tb/s DAQ - Sébastien Valat

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 50 100 150 200

    Ban

    dw

    idth

    (G

    b/s

    )

    Shift

    One-to-one shift on Marconi/OPA,64 nodes

    (15 nodes per switch)

    Min Average Max

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 50 100 150

    Bnd

    wid

    th (

    Gb/s

    )

    Shift

    one-to-one with shift on IB(Full-duplex, 46 nodes)

    min average max

    1/3 of BW

    1/2 of BW

    44

    1/2 of BW

  • Looking on routing tables

    • This time I “well” selected the nodes based on the real topology

    • BUT, looking on routing tables :• There is no regular pattern for up-links

    • Each leaf switch has a different ordering on the uplinks

    Studying a 40 Tb/s DAQ - Sébastien Valat 45

  • …. Not same link in use depending on leaf switches

    Studying a 40 Tb/s DAQ - Sébastien Valat

    Access second node of target

    46

  • Simulation matching

    Studying a 40 Tb/s DAQ - Sébastien Valat 47

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 50 100 150 200

    Ban

    dw

    idth

    (G

    b/s

    )

    Shift

    One-to-one shift on Marconi/OPA64 nodes

    (15 nodes per switch)

    Min Average Max

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 50 100 150

    Ban

    dw

    idth

    (G

    b/s

    )

    Shift

    Simulation on Marocni topology,64 nodes (15 nodes per switch)

    Min Average Max

    29/01/2018

  • Performance improvement with

    optimized routing on OPA

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 10 20 30 40 50 60 70 80 90

    Gb/s

    # of nodes (two ranks per node)

    DAQP performance

    DAQP-SYNC original veraion code, shortest path, unbalanced tree

    DAQP-SYNC opt versoin(reduced buffer size, windows), shortest path, unbalanced tree

    DAQP-SYNC opt version, fat tree, passthrough, balanced tree

    DAQPIPE original benchmark (avg)

    DAQPIPE original benchmark, shortestpath, unbalanced tree

  • One-to-one scan on Lenovo

    This time with close to perfect cabling

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 49

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 20 40 60 80 100 120 140

    Avera

    ge

    ba

    nd

    wid

    th (

    Gb/s

    )

    Shift

    one-to-one shift scan on 64 nodes

    64 nodes

  • Last results on Tech B (LENOVO)

    This time with close to perfect cabling

    Studying a 40 Tb/s DAQ - Sébastien Valat 50

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 10 20 30 40 50 60 70 80

    BA

    ND

    WID

    TH

    (G

    B/S

    )

    NODES

    Barrel Random

    29/01/2018

    83 Gb/s

  • Failure recovery

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 51

  • A word on failure-recovery

    › Support done for IB-verbs▪ Was “easy”

    ▪ Disconnection detection (delay of ~1s)

    ▪ Reconnection

    › Tried for Omni-Path PSM2▪ Encountered issues

    ▪ Sometime call exit() when remote node stop

    ▪ This propagate errors to other nodes

    ▪ Also had to twik to reset the connection state

    CMS - LHCb DAQ status - Sébastien Valat 52

  • Storage

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 53

  • First filter: HLT1

    Fast reconstruction and selection

    Synchronous with DAQ at 30 MHz

    Output: ~1 MHz

    Output : ~1 Tb/s => 128 GB/s

    Disk buffer for HLT1-accepted events

    Second filter: HLT2

    Full reconstruction and selection

    Asynchronous (events from disk)

    Output: ~100 kHz => ~12 Gb/s

    HLT1data-flow

    HLT2data-flow

    Basic strategy, same as run 2

    From Tommaso Colombo

  • Event filter: buffer storage

    The disk buffer allows exploiting LHC downtime

    Maximize event filter farm utilization

    Need large buffer to absorb long LHC runs:~100 PB for a week’s worth of data

    Currently investigating both centralizedand distributed solutions

    Requirements:

    Must sustain a total of: ~150 GB/s input + ~150 GB/s output

    I/O pattern:1 sequential read stream + 1 sequential write stream per filter node

    No need for a file-system: an object store is enough

    Minimal redundancy: some data loss is acceptable

    Non-uniform data access costs is acceptable:filter nodes should process “local” data first

    A global name-space is desirable for ease of operation and monitoring

    Online storage

    300 GB/s ~ 100 PB 2–4 k streams

  • Conclusion

    Studying a 40 Tb/s DAQ - Sébastien Valat 5629/01/2018

  • Conclusion

    › We are really dependent on the cabling & routing▪ An issue when testing on existing supercomputer we do not manage

    ▪ Now better understanding the issue

    › We are discussing with Intel for OPA issues▪ Already get improvement validated up to 64 nodes

    ▪ Thanks to some driver patches

    › InfiniBand results are very good▪ Seen 83 Gb/s on 64 nodes with non ideal cabling.

    ▪ Here we have a chance to get the bandwidth at 512 nodes.

    ▪ In last resort there will be HDR (200 Gb/s)

    › Future work▪ Still have to prove 80 Gb/s at 512 nodes

    Studying a 40 Tb/s DAQ - Sébastien Valat 5729/01/2018

  • Backup

    Studying a 40 Tb/s DAQ - Sébastien Valat 5829/01/2018

  • Network topology

    Studying a 40 Tb/s DAQ - Sébastien Valat 5929/01/2018

  • Memory bandwdith margins

    on 2 Xeon E5-2630 v4

    0

    100

    200

    300

    400

    500

    600

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

    Co

    py b

    an

    dw

    idth

    Gb

    /s

    Cores

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 60

    1 socket 2 socket

  • Try to generate more steps

    › Random▪ We randomly selected a non conflicting step (+10%)

    › Barrel rand :▪ Move conflicts to another step (+23%)

    › Test for 64 nodes

    Studying a 40 Tb/s DAQ - Sébastien Valat 61

    71

    79

    64

    0 10 20 30 40 50 60 70 80 90

    RANDOM

    BARREL-RAND

    BARREL

    Steps

    Gen sched, required steps

    29/01/2018

  • Example with timings

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 62

    0

    10

    20

    30

    40

    50

    60

    70

    0 100 200 300 400 500 600 700

    Ban

    dw

    idth

    (G

    b/s

    )

    Deltay (us)

    DAQPIPE timings on Tech A

    timings no-timings best

  • Scaling ideal topology to 512 nodes

    2157 conflicts (0,8%)

    Studying a 40 Tb/s DAQ - Sébastien Valat 63

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9C

    on

    flic

    ts p

    er

    ste

    p

    Steps

    29/01/2018

  • Htopml integration

    Communication scheduling

    Studying a 40 Tb/s DAQ - Sébastien Valat 6429/01/2018

  • Memory bandwidth

    (stream)

    0

    100

    200

    300

    400

    500

    600

    700

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

    Ba

    nd

    wid

    th (

    Gb

    /s)

    Cores

    Studying a 40 Tb/s DAQ - Sébastien Valat 6529/01/2018

  • Studying a 40 Tb/s DAQ - Sébastien Valat 6629/01/2018

  • Scaling ideal topology to 512 nodes

    2157 conflicts (0,8%)

    Studying a 40 Tb/s DAQ - Sébastien Valat 67

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9C

    on

    flic

    ts p

    er

    ste

    p

    Steps

    29/01/2018

  • Fully populating the leaf switches

    › Barrel shifting shows no-conflict if:▪ Ideal topology

    ▪ Good routing table

    ▪ Fully populating the leaf switches

    › It implies with 36 port switches to run with▪ 72 nodes, not 64 (+12%)

    ▪ 540 nodes, not 512 (+5%)

    › With 48 port switches :▪ 96 nodes, not 64 (+50%)

    ▪ 528 nodes, not 512 (+3%)

    Studying a 40 Tb/s DAQ - Sébastien Valat 6829/01/2018

  • Slurm numbering

    › Slurm can use a topology to improve

    › It provides the list of nodes on same switch

    › But still, inside a switch it uses the node name to order ranks

    Studying a 40 Tb/s DAQ - Sébastien Valat 6929/01/2018

  • IB Single-way, 46 nodes

    Studying a 40 Tb/s DAQ - Sébastien Valat 70

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 20 40 60 80 100 120 140

    Ban

    dw

    idth

    (G

    b/s

    )

    shift

    Série1 Série2 Série3

    29/01/2018

  • Support of AMC40

    Studying a 40 Tb/s DAQ - Sébastien Valat 71

    Amc40DataSource

    29/01/2018

  • During the night

    micro-benchmarks

    Studying a 40 Tb/s DAQ - Sébastien Valat 7229/01/2018

  • Compared to DELL cluster

    Full duplex

    Studying a 40 Tb/s DAQ - Sébastien Valat 7329/01/2018

  • Micro-benchmarks on IB

    Studying a 40 Tb/s DAQ - Sébastien Valat 74

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    16 32 64 128

    Ban

    dw

    idth

    (G

    b/s

    )

    Scaling

    one-to-one many-to-one gather gatherWithCmd gatherWithCmdMeta

    29/01/2018

  • Why triggering

    › We cannot stored all the collisions !▪ Far too much data !

    › Most collisions produce already well known physics

    › We keep only interesting events for new physics

    › Challenge for upgrade: need to trigger in software only▪ Need to improve current software performance

    › For some costly functions▪ Looks on GPU

    ▪ Looks on possible CPU embedded FPGA for some costly functions

    Studying a 40 Tb/s DAQ - Sébastien Valat 7529/01/2018

  • Stability over time (3 hours)

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 76

  • Stability over time (3 hours)

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 77

  • Last access to Marconi

    • Tried PSM options proposed by Intel

    • It shoed some improvements

    • But we are still at 70 Gb/s, not 80…

    From 07/2017

    Studying a 40 Tb/s DAQ - Sébastien Valat

    export PSM2_RTS_CTS_INTERLEAVE=1

    export PSM2_MQ_RNDV_HFI_WINDOW=1048576

    export TMI_CONFIG=$I_MPI_ROOT/etc64/tmi.conf

    0

    10

    20

    30

    40

    50

    60

    70

    80

    0 20 40 60

    Ban

    dw

    idth

    (G

    b/s

    )Nodes

    DAQPIPE scaling

    Default HFI options

    78

  • Buffering size issue

    Studying a 40 Tb/s DAQ - Sébastien Valat 79

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000

    Ban

    dw

    idth

    (G

    b/s

    )

    ruStored

    ruStored scan on 16 EDR nodes (Lenovo)

    Run 15s Run 40s Run 120s

    29/01/2018

  • LHCb

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 80

  • RDMA

    Remote Direct Memory Access

    29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 81

    › Communication with Ethernet▪ Traverse all IP stack in OS

    ▪ Internal memory copies

    (use mem. bandwidth & CPU)

    ▪ Higher latency

    › OS by pass in InfiniBand or Omni-Path

    ▪ RDMA : Remote Direct Memory

    Access

    ▪ No memory copies

    ▪ Lower latency

    ▪ CPU not involved

    App Buf.

    OS

    NIC

    App Buf.

    OS

    NIC

    App Buf.

    OS NIC

    App Buf.

    OS NICBuf. Buf.

  • DAQPIPE

    › A benchmark to evaluate event building solutions▪ DAQ Protocol Independent Performance Evaluator

    ▪ Provides EM/RU/BU units

    › Support various APIs▪ MPI

    ▪ Libfabric

    ▪ PSM2

    ▪ Verbs

    ▪ TCP / UDP

    ▪ RapidIO

    › Three message size on the network▪ Command : ~64 B

    ▪ Meta-data : ~10 KB

    ▪ Data : ~1 MB

    Studying a 40 Tb/s DAQ - Sébastien Valat 82

    0

    50

    100

    150

    200

    250

    1 16 256 4096 65536 1048576

    Ban

    dw

    idth

    (G

    b/s

    )

    Size (B)

    osu_bibw

    EDR OPA

    29/01/2018