Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Studying a 40 Tb/s data
acquisition system for
the LHCb experiment
Sébastien VALAT – CERN› 29/01/2018
Plan
› Context
› Event building benchmark
› First test at scale : hitting the wall
› Routing and cabling
› A word on failure recovery & storage
› Conclusion
Studying a 40 Tb/s DAQ - Sébastien Valat 229/01/2018
Context
Studying a 40 Tb/s DAQ - Sébastien Valat 329/01/2018
Reminder on LHC
› Accelerator of 27 km› 10 000 superconductive magnets› Collision energy up to 14 TeV› Proton-Proton collisions, but also heavy-ions› 4 BIG experiments :
▪ ALICE, ATLAS, CMS, LHCb
Studying a 40 Tb/s DAQ - Sébastien Valat 429/01/2018
What is BIG ?
ATLAS
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 5
Big detectors as onions
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 6
What is a collision
Studying a 40 Tb/s DAQ - Sébastien Valat 729/01/2018
Most collisions are…. « useless »
› A collision every 25ns
› Meaning 40 Mhz
› Most collisions are well known physics
› New physics is rare
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 8
109 Hz
5106 Hz
EWK: 20–100 Hz
10 Hz
LHCb, an upgrade for 2018-2020
› Update of sub-detectors› Removal of hardware trigger
▪ Currently in FPGA and analogic components
▪ Have to answer in real time
▪ Hard to maintain and update
› New status :▪ Larger event rate (1 Mhz to 40 Mhz)
▪ Larger event size (50 KB to ~100 KB)
› Much more data for DAQ & Trigger (x80 => 40 Tb/s)
Studying a 40 Tb/s DAQ - Sébastien Valat 9
Detector
Hardware trigger
DAQ
Filter units :
L0 Trigger
High Level Trigger
CERN long term storage
Offline physic analysis
40 Mhz
1 Mhz
20 Khz
29/01/2018
40 Tb/s for LHCb in 2020
Studying a 40 Tb/s DAQ - Sébastien Valat 1029/01/2018
Data flow
› Numbers▪ ~8800 optical links going out
from detector to the surface
(~100 m) and up to ~4.8 Gb/s
each.
▪ ~500 readout nodes
(up to 24 input links each)
▪ Filter farm of O(1000) nodes
› 40 Tb/s to transport▪ Can run over 512 nodes
▪ Need a 100 Gb/s network
▪ Considering net. load at 80%
▪ So 80 Gb/s
▪ Current estimation is 70 Gb/s
Studying a 40 Tb/s DAQ - Sébastien Valat 11
~1000-2000
~4000 ?
29/01/2018
Starting point : the acquisition board
Studying a 40 Tb/s DAQ - Sébastien Valat 12
Up to 48 fibers and 100 Gb/s
29/01/2018
Event building network
benchmark
Studying a 40 Tb/s DAQ - Sébastien Valat 1329/01/2018
Event building network topology
› Work with 4 units:▪ Readout unit (RU) to read data from PCI-E readout board
▪ Builder unit (BU) to merge data from Readout unit and send to Filter unit.
▪ Filter unit (FU) to select the interesting data (not considered here)
▪ Event manager (EM) to dispatch the work over Builder unit (credits)
› RU/BU mostly does a “all-to-all”▪ To aggregate the data chunks from each collision
Studying a 40 Tb/s DAQ - Sébastien Valat 14
EM
BU-1RU-1 BU-2RU-2 BU-500RU-500
Node - 1
…
29/01/2018
FU
IO nodes hardware
› Three IO boards at 100 Gb/s per node:▪ PCI-40 for fiber input
▪ Event building network
▪ Output to filter farm
› Also stress the memory (320 Gb/s of total traffic)▪ 80 in + 80 out
1529/01/2018
PCI-40 PCI-40
DAQ
network
DAQ
network
FILTER
networkFILTER
network
CPU CPU CPU CPU
Buffer Buffer
Buffer Buffer
80 Gb/s
80 in +80 out
The DAQ network technologies
› We expect to use HPC networks▪ 100 Gb/s (expect running at 80 Gb/s)
› Technologies we looked on:▪ Infiniband EDR (HDR 200 Gb/s ?)
▪ Intel Omni-path (version 2, 200 Gb/s ?)
▪ 100 Gb/s Ethernet
› Not as HPC apps▪ We need to fully fill the network continuously
▪ In full duplex
▪ Bad pattern : many gathers (all-to-all) !
› Think using a fat-tree topology
Studying a 40 Tb/s DAQ - Sébastien Valat 1629/01/2018
Fat tree
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 17
› Fat tree, theory▪ Based on switches
▪ More bandwidth at each level
› Fat tree : reality▪ Switch radix is currently 36 or 48
▪ Count cables for 512 nodes !
N
Leaf
Spine
Leaf
Spine Spine
4x
Leaf
Switch and cables
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 18
› Example of Mellanox switch (36 ports)
› Can use director switch for spine▪ Provide 648 ports (Intel has up to 754)
› Cables▪ Capper : up to 3 meters, price O(150 CHF)
▪ Optics : up to 100 meters, price O(500 CHF)
Interfaces to exploit them
› MPI▪ Support all networks
▪ Optimized for IB/Omni-path
▪ Ease test on HPC sites
▪ How to support fault tolerance ?
̵ We need to run 24h/24h and 7d/7
› Infiniband VERBS▪ For IB and Ethernet (RoCE / softRoCE / iwrap)
▪ Low level
▪ OK for fault tolerance
› Libfabric▪ For IB, Omni-path, and TCP
▪ Low level
▪ Node failure and recovery support to be studied.
› TCP/UPD (we don’t depend on latency)▪ Issue : CPU usage and memory bandwidth
Studying a 40 Tb/s DAQ - Sébastien Valat 1929/01/2018
DAQPIPE
› A benchmark to evaluate event building solutions▪ DAQ Protocol Independent Performance Evaluator
▪ Provides EM/RU/BU units
› Support various APIs▪ MPI
▪ Libfabric
▪ PSM2
▪ Verbs
▪ TCP / UDP
▪ RapidIO
› Multiple protocols▪ PUSH
▪ PULL
Studying a 40 Tb/s DAQ - Sébastien Valat 2029/01/2018
DAQPIPE
› We need to pack events▪ 100 KB / 512 = ~200 bytes
› Three message size on the network▪ Command : ~64 B
▪ Meta-data : ~10 KB
▪ Data : ~512 KB or 1 MB
Studying a 40 Tb/s DAQ - Sébastien Valat 21
0
50
100
150
200
250
1 16 256 4096 65536 1048576
Ban
dw
idth
(G
b/s
)
Size (B)
osu_bibw
EDR OPA
29/01/2018
Communication control parameters
› Manage communication scheduling models▪ Barrel shift ordering (with N on-fly)
▪ Random ordering (with N on-fly)
▪ Send all in one go
› Messages size ▪ best ~512 KB
› Parallel gathers (credits) ▪ best ~2
› Parallel communication per gather▪ With how many nodes we communicate
▪ best ~8
▪ So in total 2*8 communication in flight
› Process per nodes, 1 or 2 ▪ best 2
Studying a 40 Tb/s DAQ - Sébastien Valat 22
* From Bandwidth-optimal all-to-all exchanges in fat tree networks
29/01/2018
What we have seen here on our
cluster with 16 nodes OPA
Studying a 40 Tb/s DAQ - Sébastien Valat 23
87 Gb/s = 2% of the points
29/01/2018
Performance stability over
parameters (16 nodes OPA)
Studying a 40 Tb/s DAQ - Sébastien Valat 2429/01/2018
First test at scale (512):
hitting the wall
Studying a 40 Tb/s DAQ - Sébastien Valat 25
From http://www.flickriver.com/photos/tags/supercomputer/interesting/
29/01/2018
Mapping
› This cluster had a 2:1 mapping
› 15 link up for 32 nodes
› We selected 15 nodes per leaf switch
› “Problem” I don’t know which nodes they gave us on each sub switches
Studying a 40 Tb/s DAQ - Sébastien Valat 26
Leaf switch
Core
29/01/2018
Summary scalability on
Day-1 ….
Studying a 40 Tb/s DAQ - Sébastien Valat 2729/01/2018
Day-2, 2 hours before leaving
try random scheduling
Studying a 40 Tb/s DAQ - Sébastien Valat 28
30.37
15 min before leaving
5 min…
29/01/2018
Final summary
Studying a 40 Tb/s DAQ - Sébastien Valat 2929/01/2018
Another test on two clusters
Studying a 40 Tb/s DAQ - Sébastien Valat 30
0
20
40
60
80
100
120
0 20 40 60 80 100 120 140
Ban
dw
idth
(G
b/s
)
Nodes
DAQPIPE-OPA DAQPIPE-IB LSEB-IB
IB
OPA
29/01/2018
MPI_Alltoall
Studying a 40 Tb/s DAQ - Sébastien Valat 31
0
20
40
60
80
100
120
0 20 40 60 80 100 120 140
Ban
dw
idth
(G
b/s
)
Nodes
Omni-Path OpenMPI Infiniband-impi Omni-Path IntelMPI
OPA
IB
29/01/2018
Routing & cabling
Studying a 40 Tb/s DAQ - Sébastien Valat 3229/01/2018
Micro-benchmarks on OPA
Studying a 40 Tb/s DAQ - Sébastien Valat 33
0
10
20
30
40
50
60
70
80
90
100
16 32 64 128
Ban
diw
idth
(G
b/s
)
Scaling
one-to-one many-to-one gather gatherWithCmd gatherWithCmdMeta
29/01/2018
Performance of one-to-one
Studying a 40 Tb/s DAQ - Sébastien Valat 34
Shift = 1
Shift = 3
• Default one to one send to the neighbor node
• One variant is to shift
• Each represent a step in DAQPIPE (barrel shifting)
• Similar discussion than [1] but with experiment measurements
[1] Bandwidth-optimal All-to-all Exchanges in Fat Tree Networks, B. Prisacari
29/01/2018
One-to-one with shifts
InfiniBand
› Full-duplex 16 nodes Full-duplex, 46 nodes
Studying a 40 Tb/s DAQ - Sébastien Valat 35
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150
BN
DW
IDT
H (
GB
/S)
SHIFT
min average max
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150
BA
ND
WID
TH
(G
B/S
)
SHIFT
min average max
29/01/2018
1/2 of BW
Ideal and real topologies
Studying a 40 Tb/s DAQ - Sébastien Valat 36
Holes[01-10,12-18]
[11]
29/01/2018
Cabling on director switch
Is it better ?
› { 'OmniPth00117501fa000404 L110B':› { '1': 'node027 hfi1_0',› '2': 'node030 hfi1_0',› '3': 'node126 hfi1_0',› '5': 'node036 hfi1_0',› '6': 'node031 hfi1_0',› '7': 'node123 hfi1_0',› '8': 'node122 hfi1_0',› '9': 'node006 hfi1_0',› '10': 'node033 hfi1_0',› '11': 'node034 hfi1_0',› '12': 'node016 hfi1_0',› '13': 'node004 hfi1_0',› '14': 'node002 hfi1_0',› '15': 'node125 hfi1_0' },
› 'OmniPth00117501fa000404 L118A':› { '2': 'node116 hfi1_0',› '7': 'node008 hfi1_0',› '11': 'node021 hfi1_0',› '15': 'node012 hfi1_0' },› 'OmniPth00117501fa000404 L118B':› { '2': 'node010 hfi1_0',› '3': 'node009 hfi1_0',› '5': 'node121 hfi1_0',› '6': 'node022 hfi1_0',› '7': 'node120 hfi1_0',› '8': 'node117 hfi1_0',› '9': 'node020 hfi1_0',› '10': 'node011 hfi1_0',› '13': 'node115 hfi1_0' },
Studying a 40 Tb/s DAQ - Sébastien Valat 37
Problem: Slurm numberd then ranks by ordering node names
29/01/2018
Collisions over barrel steps
On 64 InfiniBand nodes
Studying a 40 Tb/s DAQ - Sébastien Valat 38
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
Colli
sio
ns
Step ID
Real Clean Real-renum
29/01/2018
Conflicts
› The real topology implies many conflicts› Due to :
▪ routing table
▪ missing nodes in leaf switches
› 64 nodes imply 4096 communications▪ 870 conflicts => 21%, 704 conflicts => 17%, 160 conflicts => 3%
Studying a 40 Tb/s DAQ - Sébastien Valat 39
160
704
559
870
0 100 200 300 400 500 600 700 800 900 1000
CLEAN TOPOLOGY
REAL TOPOLOGY
Link collisions
Conflicts on 64 nodes
Random Barrel
21%
17%
7%
29/01/2018
Conflicts on clean topology
Studying a 40 Tb/s DAQ - Sébastien Valat 40
0
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120 140 160 180 200
CO
NF
LIC
TS
NODES
29/01/2018
Conflicts on clean topology
Pourcentage…
Studying a 40 Tb/s DAQ - Sébastien Valat 41
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
3.00%
3.50%
4.00%
4.50%
0 100 200 300 400 500 600
CO
NF
LIC
TS
NODES
29/01/2018
Example on clean topology
› Switch radix : 36› Run on 45 nodes (2 * 18 + 9)› Look on step 28 => 9 conflicts
42Studying a 40 Tb/s DAQ - Sébastien Valat29/01/2018
Extract two conflicts to understand
43
› Example, 2 conflicts of step 28▪ 20 -> 37
▪ 29 -> 1
› They use the same UP link
Studying a 40 Tb/s DAQ - Sébastien Valat29/01/2018
Access again to Maronci
From 07/2017 (Tech A) & 02/2017 (Tech B)
Studying a 40 Tb/s DAQ - Sébastien Valat
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200
Ban
dw
idth
(G
b/s
)
Shift
One-to-one shift on Marconi/OPA,64 nodes
(15 nodes per switch)
Min Average Max
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150
Bnd
wid
th (
Gb/s
)
Shift
one-to-one with shift on IB(Full-duplex, 46 nodes)
min average max
1/3 of BW
1/2 of BW
44
1/2 of BW
Looking on routing tables
• This time I “well” selected the nodes based on the real topology
• BUT, looking on routing tables :• There is no regular pattern for up-links
• Each leaf switch has a different ordering on the uplinks
Studying a 40 Tb/s DAQ - Sébastien Valat 45
…. Not same link in use depending on leaf switches
Studying a 40 Tb/s DAQ - Sébastien Valat
Access second node of target
46
Simulation matching
Studying a 40 Tb/s DAQ - Sébastien Valat 47
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200
Ban
dw
idth
(G
b/s
)
Shift
One-to-one shift on Marconi/OPA64 nodes
(15 nodes per switch)
Min Average Max
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150
Ban
dw
idth
(G
b/s
)
Shift
Simulation on Marocni topology,64 nodes (15 nodes per switch)
Min Average Max
29/01/2018
Performance improvement with
optimized routing on OPA
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90
Gb/s
# of nodes (two ranks per node)
DAQP performance
DAQP-SYNC original veraion code, shortest path, unbalanced tree
DAQP-SYNC opt versoin(reduced buffer size, windows), shortest path, unbalanced tree
DAQP-SYNC opt version, fat tree, passthrough, balanced tree
DAQPIPE original benchmark (avg)
DAQPIPE original benchmark, shortestpath, unbalanced tree
One-to-one scan on Lenovo
This time with close to perfect cabling
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 49
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120 140
Avera
ge
ba
nd
wid
th (
Gb/s
)
Shift
one-to-one shift scan on 64 nodes
64 nodes
Last results on Tech B (LENOVO)
This time with close to perfect cabling
Studying a 40 Tb/s DAQ - Sébastien Valat 50
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80
BA
ND
WID
TH
(G
B/S
)
NODES
Barrel Random
29/01/2018
83 Gb/s
Failure recovery
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 51
A word on failure-recovery
› Support done for IB-verbs▪ Was “easy”
▪ Disconnection detection (delay of ~1s)
▪ Reconnection
› Tried for Omni-Path PSM2▪ Encountered issues
▪ Sometime call exit() when remote node stop
▪ This propagate errors to other nodes
▪ Also had to twik to reset the connection state
CMS - LHCb DAQ status - Sébastien Valat 52
Storage
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 53
First filter: HLT1
Fast reconstruction and selection
Synchronous with DAQ at 30 MHz
Output: ~1 MHz
Output : ~1 Tb/s => 128 GB/s
Disk buffer for HLT1-accepted events
Second filter: HLT2
Full reconstruction and selection
Asynchronous (events from disk)
Output: ~100 kHz => ~12 Gb/s
HLT1data-flow
HLT2data-flow
Basic strategy, same as run 2
From Tommaso Colombo
Event filter: buffer storage
The disk buffer allows exploiting LHC downtime
Maximize event filter farm utilization
Need large buffer to absorb long LHC runs:~100 PB for a week’s worth of data
Currently investigating both centralizedand distributed solutions
Requirements:
Must sustain a total of: ~150 GB/s input + ~150 GB/s output
I/O pattern:1 sequential read stream + 1 sequential write stream per filter node
No need for a file-system: an object store is enough
Minimal redundancy: some data loss is acceptable
Non-uniform data access costs is acceptable:filter nodes should process “local” data first
A global name-space is desirable for ease of operation and monitoring
Online storage
300 GB/s ~ 100 PB 2–4 k streams
Conclusion
Studying a 40 Tb/s DAQ - Sébastien Valat 5629/01/2018
Conclusion
› We are really dependent on the cabling & routing▪ An issue when testing on existing supercomputer we do not manage
▪ Now better understanding the issue
› We are discussing with Intel for OPA issues▪ Already get improvement validated up to 64 nodes
▪ Thanks to some driver patches
› InfiniBand results are very good▪ Seen 83 Gb/s on 64 nodes with non ideal cabling.
▪ Here we have a chance to get the bandwidth at 512 nodes.
▪ In last resort there will be HDR (200 Gb/s)
› Future work▪ Still have to prove 80 Gb/s at 512 nodes
Studying a 40 Tb/s DAQ - Sébastien Valat 5729/01/2018
Backup
Studying a 40 Tb/s DAQ - Sébastien Valat 5829/01/2018
Network topology
Studying a 40 Tb/s DAQ - Sébastien Valat 5929/01/2018
Memory bandwdith margins
on 2 Xeon E5-2630 v4
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Co
py b
an
dw
idth
Gb
/s
Cores
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 60
1 socket 2 socket
Try to generate more steps
› Random▪ We randomly selected a non conflicting step (+10%)
› Barrel rand :▪ Move conflicts to another step (+23%)
› Test for 64 nodes
Studying a 40 Tb/s DAQ - Sébastien Valat 61
71
79
64
0 10 20 30 40 50 60 70 80 90
RANDOM
BARREL-RAND
BARREL
Steps
Gen sched, required steps
29/01/2018
Example with timings
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 62
0
10
20
30
40
50
60
70
0 100 200 300 400 500 600 700
Ban
dw
idth
(G
b/s
)
Deltay (us)
DAQPIPE timings on Tech A
timings no-timings best
Scaling ideal topology to 512 nodes
2157 conflicts (0,8%)
Studying a 40 Tb/s DAQ - Sébastien Valat 63
0
1
2
3
4
5
6
7
8
9C
on
flic
ts p
er
ste
p
Steps
29/01/2018
Htopml integration
Communication scheduling
Studying a 40 Tb/s DAQ - Sébastien Valat 6429/01/2018
Memory bandwidth
(stream)
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Ba
nd
wid
th (
Gb
/s)
Cores
Studying a 40 Tb/s DAQ - Sébastien Valat 6529/01/2018
Studying a 40 Tb/s DAQ - Sébastien Valat 6629/01/2018
Scaling ideal topology to 512 nodes
2157 conflicts (0,8%)
Studying a 40 Tb/s DAQ - Sébastien Valat 67
0
1
2
3
4
5
6
7
8
9C
on
flic
ts p
er
ste
p
Steps
29/01/2018
Fully populating the leaf switches
› Barrel shifting shows no-conflict if:▪ Ideal topology
▪ Good routing table
▪ Fully populating the leaf switches
› It implies with 36 port switches to run with▪ 72 nodes, not 64 (+12%)
▪ 540 nodes, not 512 (+5%)
› With 48 port switches :▪ 96 nodes, not 64 (+50%)
▪ 528 nodes, not 512 (+3%)
Studying a 40 Tb/s DAQ - Sébastien Valat 6829/01/2018
Slurm numbering
› Slurm can use a topology to improve
› It provides the list of nodes on same switch
› But still, inside a switch it uses the node name to order ranks
Studying a 40 Tb/s DAQ - Sébastien Valat 6929/01/2018
IB Single-way, 46 nodes
Studying a 40 Tb/s DAQ - Sébastien Valat 70
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120 140
Ban
dw
idth
(G
b/s
)
shift
Série1 Série2 Série3
29/01/2018
Support of AMC40
Studying a 40 Tb/s DAQ - Sébastien Valat 71
Amc40DataSource
29/01/2018
During the night
micro-benchmarks
Studying a 40 Tb/s DAQ - Sébastien Valat 7229/01/2018
Compared to DELL cluster
Full duplex
Studying a 40 Tb/s DAQ - Sébastien Valat 7329/01/2018
Micro-benchmarks on IB
Studying a 40 Tb/s DAQ - Sébastien Valat 74
0
10
20
30
40
50
60
70
80
90
100
16 32 64 128
Ban
dw
idth
(G
b/s
)
Scaling
one-to-one many-to-one gather gatherWithCmd gatherWithCmdMeta
29/01/2018
Why triggering
› We cannot stored all the collisions !▪ Far too much data !
› Most collisions produce already well known physics
› We keep only interesting events for new physics
› Challenge for upgrade: need to trigger in software only▪ Need to improve current software performance
› For some costly functions▪ Looks on GPU
▪ Looks on possible CPU embedded FPGA for some costly functions
Studying a 40 Tb/s DAQ - Sébastien Valat 7529/01/2018
Stability over time (3 hours)
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 76
Stability over time (3 hours)
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 77
Last access to Marconi
• Tried PSM options proposed by Intel
• It shoed some improvements
• But we are still at 70 Gb/s, not 80…
From 07/2017
Studying a 40 Tb/s DAQ - Sébastien Valat
export PSM2_RTS_CTS_INTERLEAVE=1
export PSM2_MQ_RNDV_HFI_WINDOW=1048576
export TMI_CONFIG=$I_MPI_ROOT/etc64/tmi.conf
0
10
20
30
40
50
60
70
80
0 20 40 60
Ban
dw
idth
(G
b/s
)Nodes
DAQPIPE scaling
Default HFI options
78
Buffering size issue
Studying a 40 Tb/s DAQ - Sébastien Valat 79
0
10
20
30
40
50
60
70
80
90
100
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Ban
dw
idth
(G
b/s
)
ruStored
ruStored scan on 16 EDR nodes (Lenovo)
Run 15s Run 40s Run 120s
29/01/2018
LHCb
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 80
RDMA
Remote Direct Memory Access
29/01/2018 Studying a 40 Tb/s DAQ - Sébastien Valat 81
› Communication with Ethernet▪ Traverse all IP stack in OS
▪ Internal memory copies
(use mem. bandwidth & CPU)
▪ Higher latency
› OS by pass in InfiniBand or Omni-Path
▪ RDMA : Remote Direct Memory
Access
▪ No memory copies
▪ Lower latency
▪ CPU not involved
App Buf.
OS
NIC
App Buf.
OS
NIC
App Buf.
OS NIC
App Buf.
OS NICBuf. Buf.
DAQPIPE
› A benchmark to evaluate event building solutions▪ DAQ Protocol Independent Performance Evaluator
▪ Provides EM/RU/BU units
› Support various APIs▪ MPI
▪ Libfabric
▪ PSM2
▪ Verbs
▪ TCP / UDP
▪ RapidIO
› Three message size on the network▪ Command : ~64 B
▪ Meta-data : ~10 KB
▪ Data : ~1 MB
Studying a 40 Tb/s DAQ - Sébastien Valat 82
0
50
100
150
200
250
1 16 256 4096 65536 1048576
Ban
dw
idth
(G
b/s
)
Size (B)
osu_bibw
EDR OPA
29/01/2018