Upload
andrea-petrucci
View
24
Download
0
Tags:
Embed Size (px)
Citation preview
Achieving High Performance with TCP over 40GbE on NUMA Architectures for CMS Data Acquisition
Andrea Petrucci – CERN (PH/CMD)
The 19th Real Time Conference27 May 2014, Nara, Japan
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 2
Outline
New Processor Architectures
DAQ at the CMS Experiment
CMS Online Software Framework (XDAQ) Architecture Foundation Memory and Thread Managements Data Transmission
TCPLA - TCP Layered Architecture Motivation Architecture
Performance Tuning
Preliminary Results
Summary
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 3
New Processor Architectures
In middle of 2000’s the processor frequency stabilized and the number of cores per processor started to increase
The “golden” era for software developers ended in 2004 when Intel cancelled its high-performance uniprocessor projects “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in software”
(Herb Sutter, 2005)
Software designed with concurrency in mind resulted in more efficient use of new processors
Processor evolution (source: Sam Naffziger, AMD)
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 4
Non-Uniform Memory Access (NUMA)
Non-Uniform Memory Access (NUMA)
The distance between processing cores and memory or I/O interrupts varies within Non-Uniform Memory Access (NUMA) architectures
CPUM
em
ory
C
ontr
oll
er CPU
Mem
ory
C
ontro
ller
CPU
Mem
ory
C
ontr
oll
er CPU
Mem
ory
C
ontro
ller
I/O Device
s
I/O Device
s
I/O Device
s
I/O Device
s
CMS Experiment
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 6
CMS DAQ Requirements for LHC Run 2
Parameters
Data Sources (FEDs) ~ 620
Trigger levels 2
First Level rate 100 kHz
Event size 1 to 2 MB
Readout Throughput 200 GB/s
High Level Trigger 1 kHz
Storage Bandwidth 2 GB/s
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 7
CMS DAQ System for LHC Run 2
CPUSocket
CPUSocket I/OI/O
40 Gb/sEthernet
56 Gb/sInfiniband FDR
Readout Unit
CMS Online Software Framework (XDAQ)
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 9
XDAQ Framework
The XDAQ is software platform created specifically for the development of distributed data acquisition systems
Implemented in C++, developed by the CMS DAQ group
Provides platform independent services, tools for inter-process communication, configuration and control
Builds upon industrial standards, open protocols and libraries, and is designed according to the object-oriented model
For further information about XDAQ see:
J. Gutleber, S. Murray and L. Orsini, Towards a homogeneous architecture for high-energy physics data acquisition systems published in Computer Physics Communications, vol. 153, issue 2, pp. 155-163, 2003 http://www.sciencedirect.com/science/article/pii/S0010465503001619
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 10
XDAQ Architecture Foundation
CoreExecutive
PluginInterface
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 11
ApplicationPlugin
CoreExecutive
PluginInterface
XDAQ Architecture Foundation
XMLConfiguration
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 12
XDAQ Architecture Foundation
XMLConfiguration
SOAP
HTTP
RESTful
Control and Interface
ApplicationPlugin
CoreExecutive
PluginInterface
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 13
XDAQ Architecture Foundation
XMLConfiguration
SOAP
HTTP
RESTful
Control and Interface
ApplicationPlugin
CoreExecutive
PluginInterface
Hardware Access
VMEPCI Memory
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 14
XDAQ Architecture Foundation
XMLConfiguration
SOAP
HTTP
RESTful
Control and Interface
ApplicationPlugin
CoreExecutive
PluginInterface
Hardware Access
VMEPCI Memory
Protocols and Formats
Uniform building blocks - One or more executives per computer contain application and service components
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 15
CPU
Memory – Memory Pools
Memory Pool
Cached buffers
All memory allocation can bound to specific NUMA node
Memory pools allocate and cache memory log2 best-fit allocation
No fragmentation of memory over long runs
Buffers are recycled during runs – constant time for retrieval
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 16
Memory – Buffer Loaning
Step 1 Step 2 Step 3
Task A Task B
Loans reference
Allocates reference
Releases reference
Buffer loaning allows zero-copy of data between software layers and processes
Task A Task B Task A Task B
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 17
I/O Controller
I/O Controller
CPU
Multithreading
WorkloopThread
Application Thread
Workloops can be bound to run on specific CPU cores by configuration
Work assigned by application
Workloops provide easy use of threads
Assignment of work
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 18
Protocol level
XDAQ
User Application
Logical communication
NIC
Peer Transport
Data Transmission – Peer Transports
XDAQ
User Application
NIC
Peer Transport
Peer Transport
Application uses network transparently through XDAQ
framework
User application is network and protocol independent
Routing defined by XDAQ configuration
Connections setup through Peer to Peer model
TCP Layered Architecture
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 20
Motivation
Learned lessons from experience using previous peer transports for TCP in a real environment lead to a more advanced design to cope with Separation of TCP socket handling and XDAQ protocols Exploitation of a multi-threading environment Optimisation through configurability to improve
performance Grouping sockets and also to associate different threads
to different roles
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 21
TCPLA inspired by uDAPLDirect Access Programming Library
Developed by DAT collaborative http://www.datcollaborative.org/
Defines user Direct Access Programming Library (uDAPL) and kernel Direct Access Programming Library (kDAPL) APIs
TCPLA is inspired by uDAPL specification, in particular the send/receive semantics it describes
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 22
What is the TCP Layered Architecture
TCP Layered Architecture (TCPLA) is a lightweight, transport and platform independent user-level library for handling socket processing
Provides the user with an event driven model of handling network communications
All calls to send and receive data are performed asynchronously
Two peer transports have been developed ptFRL is specific for reading TCP/IP streams from FEROLs in RU
machine ptUTCP is for general purpose data flow
ptUTCP
General protocols
ptFRL
FEROL protocol
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 23
TCPLA Architecture Principles (I)
Send Queue
Receive Queue
Completion Event Queue Received Buffer Connection Request Sent Buffer Connection established
Peer Transport
Pre-allocated buffers for receiving
Filled buffers for sending
Event handler
TCP Layered Architecture
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 24
TCPLA Architecture Principles (II)
socket i/osocket i/o
socket i/osocket i/o
socket i/osocket i/osocket i/o
Receive Workloop
Send Workloop
Event WorkloopAssociate 1..N socket per workloop
Performance Tuning
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 26
Performance Factors
CPU affinity
I/O Interrupt affinity
Memory affinity
TCP Custom Kernel Settings
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 27
Readout Unit Machine Affinities
Affinities
40 Gb/sEthernet
56 Gb/sInfiniband
I/O I/O
Readout Unit
151311 9 7 5 3 1
CPU Socket 1
NUMA Node 1 (16 GB)
141210 8 6 4 2 0
CPU Socket 0
NUMA Node 0 (16 GB)
40 GbE - I/O Interrupts
40 GbE - Socket reading
Infiniband - I/O Interrupts
DAQ threads
DAQ Memory allocation
Cores available to operating system
Preliminary Results
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 29
P2P Example: Out of the Box vs Affinity
Lin
k sa
tura
tion
wit
h a
ffinit
y
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 30
FEROL Aggregation into One RU
Aggregation n-to-1, for example 12 Streams from 12 FEROLs each
sending fragment size from 2 to 4 kB 16 Streams from 8 FEROLs each
sending fragment size from 1 to 2 kB
Concentrated in one 40 GbE NIC in RU PC
Mellanox SX 1024 with 48 ports at 10 GbE and 12 ports at 40 GbE
Reliability and congestion handled by TCP/IP Post frames are enabled in the 10/40 GbE
switch
12/16 Streams
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 31
DAQ Test Bed
Final performance indicator of the TCPLA is based upon the performance achieved while executing simultaneous input and output in the RU
Test bed up to 47 FEROLs (10 GbE) in input
CPU CPU I/OI/O40 Gb/sEthernet
56 Gb/sInfiniband FDR
Readout Unit
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 32
Test with 1Data Source from FEROL (I)
40 GE
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 33
Test with 1Data Source from FEROL (II)
40 GE
working range
90% efficiency of 40 GEat 4.5 GB/s
Test with 2 Data Sources from FEROL
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 35
Test with 2 Data Sources from FEROL
40 GE
working range
90% efficiency of 40 GEat 4.5 GB/s
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 36
Summary
TCPLA is based on the standard socket library with optimizations for NUMA environments using the XDAQ framework
It has been developed to allow high performance of critical applications in the new CMS DAQ system by exploiting multi-core architectures
It will be used for both data flow and monitoring
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 37
Questions ?
Send Timeout
Receive Timeout
Affinity
Single-Threaded Dispatcher
Multi-Threaded Dispatcher
Poll
Select
Subnet Scanning
Port Scanning
TCPSDP
Blocking
Non-Blocking
Connect On Request
Polling CycleDatagram
Auto Connect
FRL
I2O
B2IN
I/O Queue Size
Polling Work Loop
Waiting Work Loop
Event Queue Size
Receive Buffers
Receive Block Size
Max Clients
Max FD
TCPLA is a Powerful Tool
Backup Slides
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 39
Software Architecture
PeerTransport
EventHandler InterfaceAdapter
EndPoint
SelectPoll
PublicServicePoint
Poll SelectDispatcher
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 40
Connecting
EventHandlerEventQueueInterfaceAdapte
r
ConnectConnect
Wait for establishment
Push established event
Consume event
a
a
ml
a
Thread
Mutex-less
Asynchronous
Legend:
ml ml
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 41
Sending
Push sent event
EventHandlerEventQueueInterfaceAdapte
r
Push into send queueBuffer to send
Check sending is possible
Consume event
Pop buffer from send queue
Send buffer
a
a
a
a ml
a
Thread
Mutex-less
Asynchronous
Legend:
ml ml
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 42
Receiving
Dispatch as reference
Check for incoming
EventHandlerEventQueue Dispatcher
Incoming packet
Push receive event Consume event
Provide free buffer
DispatcherPublicServicePo
int
Get free buffer
Socket receive
a
a
a
ml ml
ml
a
Thread
Mutex-less
Asynchronous
Legend:
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 43
Readout Unit Machine Affinities
Affinities
40 Gb/sEthernet
56 Gb/sInfiniband
I/O I/O
Readout Unit
151311 9 7 5 3 1
CPU Socket 1
NUMA Node 1 (16 GB)
141210 8 6 4 2 0
CPU Socket 0
NUMA Node 0 (16 GB)
40 GbE - I/O Interrupts
40 GbE - Socket reading
Infiniband - I/O Interrupts
DAQ threads
DAQ Memory allocation
Cores available to operating system