Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL

Data Recording Model at XFEL

CRISP 2nd Annual meeting

March 18-19, 2013

Djelloul Boukhelef

1Djelloul Boukhelef - XFEL

Outline

• Purpose and scope• Hardware setup• Software architecture• Experiments & results

– Network– Storage

• Summary & outlook


Purpose and present scope

• Build a prototype of a fully featured DAQ/DM/SC system– Select/install adequate h/w, develop s/w, and test all system’s

properties: Control, DAQ, DM, and SC systems

• Current prototype focuses on:– Data acquisition, pre-processing, formatting and storage– Assess the performance and stability of the h/w + s/w

• Network: bandwidth (10Gbps), UDP packets loss, TCP behavior…

• Processing: concurrent read, processing, write operations, …

• Storage: performance of disk (write), concurrent IO operations, …

• Software development– Application architecture: processing pipeline, communication, …– Design for performance, robustness, scalability, flexibility, …


Hardware setup

4

Details were presented in the IT&DM meeting in OctoberDjelloul Boukhelef - XFEL

• 2D detector generates ~10GB of image data per second

• Data is multiplexed on 16 channels (10GbE)

• 1GB/1.6 sec = 640MB/s per channel

• Lots of other slow data streams

Software architecture


Overview

• Current prototype consists of three software components– Data feeder

• No TB board (Train builder emulator) Feed the PCL with train data

• With TB board: Feed TB with detector data

– PC Layer software• Acquire, pre-process, reduce, monitor,

format, send data to storage and SC

– Storage service

• Device/DeviceServer model to build a distributed control system – PSR, Flexibility and Configurability

6

PC Layer

Storage service

Storage node 1

Storage node S

TCP

PCL node 1 PCL node N

Train Builder Emulator

Data Feeder MData Feeder 1

UDP

UDP

Djelloul Boukhelef - XFEL

Timer server

Train Builder Emulator(Master)

Data Feeder 1

Data Feeder 2

Data Feeder M

PCL node 1

PCL node 2

PCL node N

Storage node 1

Storage node 1

Storage node 2

Storage node 2

Storage node N

Storage node N

TCP

TCP

UDP

TCP

Tra

in B

uil

der

Lay

erP

C L

aye

rS

tora

ge

Lay

er

•Groups (Id, rate, ...)•Groups (Id, rate, ...)

•Net-timer•PCL nodes

•Net-timer•PCL nodes

•TB-Emulator (Master)•Train Metadata•Data files

•TB-Emulator (Master)•Train Metadata•Data files

•Storage nodes•Train Metadata

•Storage nodes•Train Metadata

•CPU, Network•CPU, Network

•CPU, Network•CPU, Network

•CPU, Queues, Network •CPU, Queues, Network

•CPU, Queues, Network•CPU, Queues, Network

•Folder•Naming convention

•Folder•Naming convention

•CPU•CPU

Net

wo

rk T

imer

Architecture overview

Data-driven model

Configurations (xml files)


Processing pipeline

• Distributed and parallel processing

8

T1T1

T1T1

T2T2

T2T2

Receive

Write

T1T1

T1T1

T1T1

T2T2

T2T2

T2T2

Receive

Process

Format

T1T1

T1T1

T1T1

T2T2

T2T2

T2T2

Generate

Build

Send

T1T1 T2T2Send

Ac

qu

isit

ion

Tra

in b

uild

er

Pro

ces

sin

gP

C L

ayer

Sto

rag

e

Online analysisSC

Pipelining and multithreading on multi-core


Train Builder

Image Data Generator

Detector Data

Generator

Packetizer Timer server

Detector data queue (pointers)

Image data queue (pointers)

Train data queue(tokens)

DAQ

requ

est

Raw data buffer

Data files (offline)

Generate & Store(CImg)

Load

Trains buffer

Clock-ticks (TCP)

Packets (UDP)

PCL nodes

Tra

in B

uild

er E

mul

ator

Det

ecto

r em

ulat

orImages

Data Feeder


De-packetizer

Trains buffer

Processing Pipeline

(Monitoring, reduction, …)

Packets (UDP)

Train builder

Process queue (train id)

Format queue (train id)

Formatter

In memory files

Writer

Write queue (file name)

TCP stream

Online storage

Statistics, …

PC-Layer node

Simple processing: checksumNeed real algorithms


Disk array

Ringbuffer

Storage server

Reader

Writer

Free slot

Get free slotFill slot

Flush slot

TCP networkread

writeSync

• Memory ring buffer– Buffer size (16G)– Slot (chunk) size (1G)

• Threads pool– Reader: Read data stream (TCP)

and store it into the memory buffer– Writer: Write filled buffers slots into disk (cache)– Sync: Flush cache to disk (physical writing)

• IO modes– Cached: issues IO command via OS– Direct: talk directly to the IO device– Splice: data transfer within kernel space


Network performance


Train Transfer Protocol (TTP)

• Train data format– Header, images & descriptors,

detector data, trailer

• Train Transfer Protocol (TTP)– Based on UDP: designed for fast data

transfer where TCP is not implemented or not suitable (overhead, delays)

– Transfers are identified by unique identifiers (frame number)

– Packetization: bundle the data block (frame) into small packets that are tagged with increasing packet numbers

– Flags: SoF, EoF, Padding – Packet trailer mode

13

Data 8kbytes

# Frame 4b

# Packet 3b

1bSoF, EoF

Header

Images descriptors

Images data

Detector specific data

Trailer

Train dataDjelloul Boukhelef - XFEL

Previous results (reminder)

• Two types of programs run in parallel on all machines– Feeders: generate train data, packetize it into packets, send

them using UDP– Receivers: reconstruct train data from packets,

store them in memory (overwrite)

• Run length (#packets/total time):– Typical: 3.5×108 ~ 2.5×109 packets (few hours)– Maximum: 5×109 packets (16h37m)

• Time profile – XFEL: 10 MHz for 16 channels Send 1 train

(~131100 packets) within 1.6sec– Continuous: no waiting time between trains sending

14

1 5

2

3

4

6

7

8

PCL Node

2

PCL Node

1

Unidirectional stream

Concurrent send/receive


Previous results (reminder)

• Network transfer rate: 1GB train 0.87sec ≈ 9.9Gbps• CPU usage (ie. receiver core) 40%• Packets loss

– Few packets (tens to hundreds) are sometimes lost per run– It happens only at the beginning of some runs (not train)– Observed sometimes on all machines, some machines only.

We have run with no packet loss on any machine, also

• Ignoring first lost packets which affect only the first train– Typical run (3.5×108) less than 3.7 out of 10000 trains– Long run (5×109) less than 26 out of one million

trains


Train switching

• In previous experiments:– Each feeder is configured to feed

one PC layer node (one-to-one)– Packet loss appears at the start of a run– In TB, trains are sent out through

different channels every time10 trains 16 channels per second.

• Question:– What if the feeder sends train data

to a different PC layer node every time?

16

Feeder 2Feeder 2 Feeder 3Feeder 3Feeder 1Feeder 1

PCLayer 1PCLayer 1 PCLayer 2PCLayer 2

10GbE Switch

st101 st102 st103

st104 st105

Sub-net 1

TTP

Feeder 2Feeder 2Feeder 1Feeder 1


10GbE Switch

st101 st102

st104 st105

Sub-net 1

TTP


Train switching

• Test configuration– 3 feeders nodes

• Pre-load images from disk

• Build train data (header, trailer,…)

• Calculate checksum (Adler32)

• Packetize train data (TTP)

– 2 PC layer nodes• Depacketize (TTP)

• No processing is performed

• Format to HDF5 file

• Stream files through TCP (splice)

– 2 storage nodes• Write files to shared memory (splice)

17

TimerTimer

Train builderTrain builder



Storage 1 Storage 2

10GbE Switch

10GbE Switch

st401

st101 st102 st103

st104 st105

st106 st107

st401

Sub-net 1

Sub-net 2

TTP

TCP


Train switching

• 3 Feeders feed 2 PC layer nodes in round robin manner– Rate: 2 trains every 1.6 second

• A PC layer node receives 1 train every 1.6 sec, each time from a different feeder

• A Feeder sends out 1 train every 2.4 sec,each time to a different IP address

– Packetizer checks the send buffer in order to avoid overwriting previous (not sent yet) packets, eg. every 100 packets.

– All Feeder-to-PCLayer data transfers are done on the same sub-network

– Train transfer time is .88 sec, ie. there is an overlap between two consecutive trains of 0.08sec (9% of the time)

18

Feeder PCL Time

1 1 0.02 2 0.83 1 1.61 2 2.4

2 1 3.23 2 4.01 1 4.8…



10GbE Switch

st101 st102 st103

st104 st105

Sub-net 1

TTP


Experiment

• Total run time– Short time: less than ½ hour– Long time: 18 hours (81657 trains ~ 80TB)

• Observations:– 6 trains were affected at the beginning of the run for each pc layer

node, than they continue smoothly with no train loss until 1am.– … 8am, 2 more trains were affected at one PC layer (probably due

to the nightly update). No train loss on the other node.– … than the run continues very stable until the end– Network send and receive (TTP and TCP) were very stable– Formatting time was not stable al the time


Summary

• Results: – Trains sent: 81657 (27219 trains per feeder)– PCLayer01 (received:40820, affected: 8)– PCLayer02 (received:40823, affected: 6)

– Train size: 1073754173 = 1GB (image data) + 12.06MB (header, descriptors, detector data, and trailer)

– # packets per train: 131202– Transfer time: 0.877579 sec– Transfer rate: 1.1395 GBytes/sec = 9.116 Gbps

• Sustainable and stable network bandwidth


Storage performance


Cached IO Issue read/write commands via OS

kernel, which will execute the IO. Data are copied to/from page cache.

Zero-copy operation Splice socket and file descriptors.

Perform data transfer within the kernel space (transparent).

Problem using Linux splice function with IBM/Mellanox

Couldn’t figure out the reason!!

Direct IO Performs read/write directly to/from

the device, bypassing page cache DMA: Memory alignment (512) RAID: strip size (64K), # of disks Per file vs. per partition (Linux)

Network

Direct IO vs Cached IO

Device driver

Page remapping

readwrite

flush

Direct IO

Kernel space

User land

Device driver

Page cache Kernel buffer

Disk

Application buffer

Hardware layer

Cached IO


File number

Tim

e (s

ec)

File number

Tim

e (s

ec)

File size: 1GBTime period: 1.6sDisk: empty

File size: 2GBTime period: 3.2sDisk: empty

Cached IO

• Run length: 2+ hours• Buffer size: 4GB• Method: read all file data into one buffer, write it once, then sync


Run configuration

• Two types of programs run in parallel on four machines– Sender: open in memory data file,

stream its content using TCP, close file.– Receiver: read file data from socket

to RAM buffer (16GB), write it to disk.

• Run length– Typical: 4.5 TB (2 hours)– Maximum: until disk is full

9TB (4 hours) and 28TB(12 hours)– Disks are cleaned before every run

• Time profile– 1G 1.6sec per box

24

PCL Node 101

Storage 201

Storage 202

Storage 203

Storage 204

PCL Node 102

PCL Node 103

PCL Node 104

External disks Internal disks

Dell machines

IBM machines

Size (GB) 1 10 20 40

Time (sec) 1.6 16 32 64


Direct IO – external storage

• Storage extension box: 3TB 7.2Krpm 6Gbps NL SAS,RAID6

25

File (GB)

Network(Gbps)

Storage (GB/s)

Net read (sec)

Disk write (sec)

Overall (sec)

avg. avg. max avg. max avg. max avg.

1 9.86 0.95 2.27 0.87 2.95 1.06 3.95 1.93

10 9.85 0.97 9.01 8.72 15.58 10.30 16.45 11.17

20 9.83 0.98 17.79 17.48 30.07 20.45 30.95 21.33

40 9.61 0.97 38.81 35.74 58.06 41.39 59.06 42.36


Direct IO – internal storage

• Internal disks: 14x900GB 10Krpm 6Gbps SASRAID6

26

File (GB)

Network(Gbps)

Storage (GB/s)

Net read (sec)

Disk write (sec)

Overall (sec)

avg. avg. max avg. max avg. max avg.

1 9.86 1.17 3.42 0.87 2.47 0.86 4.23 1.73

10 9.60 1.09 9.38 8.95 11.28 9.19 12.15 10.07

20 9.36 1.06 19.18 18.36 22.63 18.84 23.55 19.75

40 9.38 1.07 38.11 36.64 45.25 37.54 46.22 38.50


Long run experiments

27

Internal storage: 9TB, 918 files, 4 hours

External storage: 28TB, 2792 files, 12 hours

Network read Disk write Overall time

Network read Disk write Overall time

Average

Buffer size: 16GB, Slot size: 1GB, File size: 10GB, Time profile: 16secDjelloul Boukhelef - XFEL

Statistics from Ganglia

• Long run experiment (5:24am - 5:29pm)– Host: exflst201 (with external storage)

• Disk write: 671.14MB/s• Network bandwidth: 676.8MB/s• CPU usage: syst: 5.81% user: 0.39 %

– Reader: syst: 49.54% user: 0.75%– Writer: syst: 7.03% user: 0.010%

28

Network Disk

Reader (core 0)

Writer (core 1)

CPU


Result summary

• We need to write 1GB data file within 1.6s per storage box– Both internal and external storage configurations are able to

achieve this rate (1.1GB/s, 0.97GB/s, resp.)– 16 storage boxes are needed to handle 10GB/s train data stream

• High network bandwidth and low CPU load (stable)• Direct IO:

– Network read and disk write operations are overlapped 97% Low overall time per file

– Application buffer: For big files, the bigger the slot size the better disk IO performance (as long as DMA allows)

• To do: – Concurrent IO operations: write/write, write/read, file merging– Storage manager: file indexing, disk space management


Summary & Outlook


Summary

• First half of slice test hardware is configured and running• Testing and tuning network and i/o performance using

– System/community tools: netperf, iozone, …– PCL software– Train builder board

• TB (Emulator) PCL software (Dell)– Bandwidth: 9.9 Gbps (99% of the wire speed)– Low UDP packet loss rate: only few packets loss at the start of

runs (3.5×108 ~ 5×109) less than 3.7~0.26 per 105 trains can be affected at most

• PC Layer (Dell) Storage boxes (IBM)– TCP data streaming: ~9.8 Gbps– Write terabytes of data to disk at 0.97 to 1.1 GB/s speed


Outlook

• Fully featured DAQ system– Data readout, pre-processing, monitoring, storage– Feed the system with real data and apply real algorithms

(processing, monitoring, scientific computing)– Deployment, configuration and control: upload libraries, initiate

devices, start/stop and monitor runs Device composition

• Soak and stress testing– Test performance (CPU, IO, network), behavior (bugs, memory

leaks), reliability (error handling, failure), and stability of the system• Significant workload applied over long period of time

– Parallel tasks: forwarding data to online analysis or scientific computing, multiple streams into the same storage server, …

• Cluster file system vs. DDN vs. local storage system• Data management: structure, access control, metadata, …


Thanks!


Documents

Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL