43
Achieving High Performance with TCP over 40GbE on NUMA Architectures for CMS Data Acquisition Andrea Petrucci – CERN (PH/CMD) The 19th Real Time Conference 27 May 2014, Nara, Japan

RT2014_TCPLA_Nara_27052014-V1

Embed Size (px)

Citation preview

Page 1: RT2014_TCPLA_Nara_27052014-V1

Achieving High Performance with TCP over 40GbE on NUMA Architectures for CMS Data Acquisition

Andrea Petrucci – CERN (PH/CMD)

The 19th Real Time Conference27 May 2014, Nara, Japan

Page 2: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 2

Outline

New Processor Architectures

DAQ at the CMS Experiment

CMS Online Software Framework (XDAQ) Architecture Foundation Memory and Thread Managements Data Transmission

TCPLA - TCP Layered Architecture Motivation Architecture

Performance Tuning

Preliminary Results

Summary

Page 3: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 3

New Processor Architectures

In middle of 2000’s the processor frequency stabilized and the number of cores per processor started to increase

The “golden” era for software developers ended in 2004 when Intel cancelled its high-performance uniprocessor projects “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in software”

(Herb Sutter, 2005)

Software designed with concurrency in mind resulted in more efficient use of new processors

Processor evolution (source: Sam Naffziger, AMD)

Page 4: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 4

Non-Uniform Memory Access (NUMA)

Non-Uniform Memory Access (NUMA)

The distance between processing cores and memory or I/O interrupts varies within Non-Uniform Memory Access (NUMA) architectures

CPUM

em

ory

C

ontr

oll

er CPU

Mem

ory

C

ontro

ller

CPU

Mem

ory

C

ontr

oll

er CPU

Mem

ory

C

ontro

ller

I/O Device

s

I/O Device

s

I/O Device

s

I/O Device

s

Page 5: RT2014_TCPLA_Nara_27052014-V1

CMS Experiment

Page 6: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 6

CMS DAQ Requirements for LHC Run 2

Parameters

Data Sources (FEDs) ~ 620

Trigger levels 2

First Level rate 100 kHz

Event size 1 to 2 MB

Readout Throughput 200 GB/s

High Level Trigger 1 kHz

Storage Bandwidth 2 GB/s

Page 7: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 7

CMS DAQ System for LHC Run 2

CPUSocket

CPUSocket I/OI/O

40 Gb/sEthernet

56 Gb/sInfiniband FDR

Readout Unit

Page 8: RT2014_TCPLA_Nara_27052014-V1

CMS Online Software Framework (XDAQ)

Page 9: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 9

XDAQ Framework

The XDAQ is software platform created specifically for the development of distributed data acquisition systems

Implemented in C++, developed by the CMS DAQ group

Provides platform independent services, tools for inter-process communication, configuration and control

Builds upon industrial standards, open protocols and libraries, and is designed according to the object-oriented model

For further information about XDAQ see:

J. Gutleber, S. Murray and L. Orsini, Towards a homogeneous architecture for high-energy physics data acquisition systems published in Computer Physics Communications, vol. 153, issue 2, pp. 155-163, 2003 http://www.sciencedirect.com/science/article/pii/S0010465503001619

Page 10: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 10

XDAQ Architecture Foundation

CoreExecutive

PluginInterface

Page 11: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 11

ApplicationPlugin

CoreExecutive

PluginInterface

XDAQ Architecture Foundation

XMLConfiguration

Page 12: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 12

XDAQ Architecture Foundation

XMLConfiguration

SOAP

HTTP

RESTful

Control and Interface

ApplicationPlugin

CoreExecutive

PluginInterface

Page 13: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 13

XDAQ Architecture Foundation

XMLConfiguration

SOAP

HTTP

RESTful

Control and Interface

ApplicationPlugin

CoreExecutive

PluginInterface

Hardware Access

VMEPCI Memory

Page 14: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 14

XDAQ Architecture Foundation

XMLConfiguration

SOAP

HTTP

RESTful

Control and Interface

ApplicationPlugin

CoreExecutive

PluginInterface

Hardware Access

VMEPCI Memory

Protocols and Formats

Uniform building blocks - One or more executives per computer contain application and service components

Page 15: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 15

CPU

Memory – Memory Pools

Memory Pool

Cached buffers

All memory allocation can bound to specific NUMA node

Memory pools allocate and cache memory log2 best-fit allocation

No fragmentation of memory over long runs

Buffers are recycled during runs – constant time for retrieval

Page 16: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 16

Memory – Buffer Loaning

Step 1 Step 2 Step 3

Task A Task B

Loans reference

Allocates reference

Releases reference

Buffer loaning allows zero-copy of data between software layers and processes

Task A Task B Task A Task B

Page 17: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 17

I/O Controller

I/O Controller

CPU

Multithreading

WorkloopThread

Application Thread

Workloops can be bound to run on specific CPU cores by configuration

Work assigned by application

Workloops provide easy use of threads

Assignment of work

Page 18: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 18

Protocol level

XDAQ

User Application

Logical communication

NIC

Peer Transport

Data Transmission – Peer Transports

XDAQ

User Application

NIC

Peer Transport

Peer Transport

Application uses network transparently through XDAQ

framework

User application is network and protocol independent

Routing defined by XDAQ configuration

Connections setup through Peer to Peer model

Page 19: RT2014_TCPLA_Nara_27052014-V1

TCP Layered Architecture

Page 20: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 20

Motivation

Learned lessons from experience using previous peer transports for TCP in a real environment lead to a more advanced design to cope with Separation of TCP socket handling and XDAQ protocols Exploitation of a multi-threading environment Optimisation through configurability to improve

performance Grouping sockets and also to associate different threads

to different roles

Page 21: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 21

TCPLA inspired by uDAPLDirect Access Programming Library

Developed by DAT collaborative http://www.datcollaborative.org/

Defines user Direct Access Programming Library (uDAPL) and kernel Direct Access Programming Library (kDAPL) APIs

TCPLA is inspired by uDAPL specification, in particular the send/receive semantics it describes

Page 22: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 22

What is the TCP Layered Architecture

TCP Layered Architecture (TCPLA) is a lightweight, transport and platform independent user-level library for handling socket processing

Provides the user with an event driven model of handling network communications

All calls to send and receive data are performed asynchronously

Two peer transports have been developed ptFRL is specific for reading TCP/IP streams from FEROLs in RU

machine ptUTCP is for general purpose data flow

ptUTCP

General protocols

ptFRL

FEROL protocol

Page 23: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 23

TCPLA Architecture Principles (I)

Send Queue

Receive Queue

Completion Event Queue Received Buffer Connection Request Sent Buffer Connection established

Peer Transport

Pre-allocated buffers for receiving

Filled buffers for sending

Event handler

TCP Layered Architecture

Page 24: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 24

TCPLA Architecture Principles (II)

socket i/osocket i/o

socket i/osocket i/o

socket i/osocket i/osocket i/o

Receive Workloop

Send Workloop

Event WorkloopAssociate 1..N socket per workloop

Page 25: RT2014_TCPLA_Nara_27052014-V1

Performance Tuning

Page 26: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 26

Performance Factors

CPU affinity

I/O Interrupt affinity

Memory affinity

TCP Custom Kernel Settings

Page 27: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 27

Readout Unit Machine Affinities

Affinities

40 Gb/sEthernet

56 Gb/sInfiniband

I/O I/O

Readout Unit

151311 9 7 5 3 1

CPU Socket 1

NUMA Node 1 (16 GB)

141210 8 6 4 2 0

CPU Socket 0

NUMA Node 0 (16 GB)

40 GbE - I/O Interrupts

40 GbE - Socket reading

Infiniband - I/O Interrupts

DAQ threads

DAQ Memory allocation

Cores available to operating system

Page 28: RT2014_TCPLA_Nara_27052014-V1

Preliminary Results

Page 29: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 29

P2P Example: Out of the Box vs Affinity

Lin

k sa

tura

tion

wit

h a

ffinit

y

Page 30: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 30

FEROL Aggregation into One RU

Aggregation n-to-1, for example 12 Streams from 12 FEROLs each

sending fragment size from 2 to 4 kB 16 Streams from 8 FEROLs each

sending fragment size from 1 to 2 kB

Concentrated in one 40 GbE NIC in RU PC

Mellanox SX 1024 with 48 ports at 10 GbE and 12 ports at 40 GbE

Reliability and congestion handled by TCP/IP Post frames are enabled in the 10/40 GbE

switch

12/16 Streams

Page 31: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 31

DAQ Test Bed

Final performance indicator of the TCPLA is based upon the performance achieved while executing simultaneous input and output in the RU

Test bed up to 47 FEROLs (10 GbE) in input

CPU CPU I/OI/O40 Gb/sEthernet

56 Gb/sInfiniband FDR

Readout Unit

Page 32: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 32

Test with 1Data Source from FEROL (I)

40 GE

Page 33: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 33

Test with 1Data Source from FEROL (II)

40 GE

working range

90% efficiency of 40 GEat 4.5 GB/s

Page 34: RT2014_TCPLA_Nara_27052014-V1

Test with 2 Data Sources from FEROL

Page 35: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 35

Test with 2 Data Sources from FEROL

40 GE

working range

90% efficiency of 40 GEat 4.5 GB/s

Page 36: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 36

Summary

TCPLA is based on the standard socket library with optimizations for NUMA environments using the XDAQ framework

It has been developed to allow high performance of critical applications in the new CMS DAQ system by exploiting multi-core architectures

It will be used for both data flow and monitoring

Page 37: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 37

Questions ?

Send Timeout

Receive Timeout

Affinity

Single-Threaded Dispatcher

Multi-Threaded Dispatcher

Poll

Select

Subnet Scanning

Port Scanning

TCPSDP

Blocking

Non-Blocking

Connect On Request

Polling CycleDatagram

Auto Connect

FRL

I2O

B2IN

I/O Queue Size

Polling Work Loop

Waiting Work Loop

Event Queue Size

Receive Buffers

Receive Block Size

Max Clients

Max FD

TCPLA is a Powerful Tool

Page 38: RT2014_TCPLA_Nara_27052014-V1

Backup Slides

Page 39: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 39

Software Architecture

PeerTransport

EventHandler InterfaceAdapter

EndPoint

SelectPoll

PublicServicePoint

Poll SelectDispatcher

Page 40: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 40

Connecting

EventHandlerEventQueueInterfaceAdapte

r

ConnectConnect

Wait for establishment

Push established event

Consume event

a

a

ml

a

Thread

Mutex-less

Asynchronous

Legend:

ml ml

Page 41: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 41

Sending

Push sent event

EventHandlerEventQueueInterfaceAdapte

r

Push into send queueBuffer to send

Check sending is possible

Consume event

Pop buffer from send queue

Send buffer

a

a

a

a ml

a

Thread

Mutex-less

Asynchronous

Legend:

ml ml

Page 42: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 42

Receiving

Dispatch as reference

Check for incoming

EventHandlerEventQueue Dispatcher

Incoming packet

Push receive event Consume event

Provide free buffer

DispatcherPublicServicePo

int

Get free buffer

Socket receive

a

a

a

ml ml

ml

a

Thread

Mutex-less

Asynchronous

Legend:

Page 43: RT2014_TCPLA_Nara_27052014-V1

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 43

Readout Unit Machine Affinities

Affinities

40 Gb/sEthernet

56 Gb/sInfiniband

I/O I/O

Readout Unit

151311 9 7 5 3 1

CPU Socket 1

NUMA Node 1 (16 GB)

141210 8 6 4 2 0

CPU Socket 0

NUMA Node 0 (16 GB)

40 GbE - I/O Interrupts

40 GbE - Socket reading

Infiniband - I/O Interrupts

DAQ threads

DAQ Memory allocation

Cores available to operating system