Upload
gary-gaines
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Data Recording Model at XFEL
CRISP 2nd Annual meeting
March 18-19, 2013
Djelloul Boukhelef
1Djelloul Boukhelef - XFEL
Outline
• Purpose and scope• Hardware setup• Software architecture• Experiments & results
– Network– Storage
• Summary & outlook
2Djelloul Boukhelef - XFEL
Purpose and present scope
• Build a prototype of a fully featured DAQ/DM/SC system– Select/install adequate h/w, develop s/w, and test all system’s
properties: Control, DAQ, DM, and SC systems
• Current prototype focuses on:– Data acquisition, pre-processing, formatting and storage– Assess the performance and stability of the h/w + s/w
• Network: bandwidth (10Gbps), UDP packets loss, TCP behavior…
• Processing: concurrent read, processing, write operations, …
• Storage: performance of disk (write), concurrent IO operations, …
• Software development– Application architecture: processing pipeline, communication, …– Design for performance, robustness, scalability, flexibility, …
3Djelloul Boukhelef - XFEL
Hardware setup
4
Details were presented in the IT&DM meeting in OctoberDjelloul Boukhelef - XFEL
• 2D detector generates ~10GB of image data per second
• Data is multiplexed on 16 channels (10GbE)
• 1GB/1.6 sec = 640MB/s per channel
• Lots of other slow data streams
Software architecture
5Djelloul Boukhelef - XFEL
Overview
• Current prototype consists of three software components– Data feeder
• No TB board (Train builder emulator) Feed the PCL with train data
• With TB board: Feed TB with detector data
– PC Layer software• Acquire, pre-process, reduce, monitor,
format, send data to storage and SC
– Storage service
• Device/DeviceServer model to build a distributed control system – PSR, Flexibility and Configurability
6
PC Layer
Storage service
Storage node 1
Storage node S
TCP
PCL node 1 PCL node N
Train Builder Emulator
Data Feeder MData Feeder 1
UDP
UDP
Djelloul Boukhelef - XFEL
Timer server
Train Builder Emulator(Master)
Data Feeder 1
Data Feeder 2
Data Feeder M
PCL node 1
PCL node 2
PCL node N
Storage node 1
Storage node 1
Storage node 2
Storage node 2
Storage node N
Storage node N
TCP
TCP
UDP
TCP
Tra
in B
uil
der
Lay
erP
C L
aye
rS
tora
ge
Lay
er
•Groups (Id, rate, ...)•Groups (Id, rate, ...)
•Net-timer•PCL nodes
•Net-timer•PCL nodes
•TB-Emulator (Master)•Train Metadata•Data files
•TB-Emulator (Master)•Train Metadata•Data files
•Storage nodes•Train Metadata
•Storage nodes•Train Metadata
•CPU, Network•CPU, Network
•CPU, Network•CPU, Network
•CPU, Queues, Network •CPU, Queues, Network
•CPU, Queues, Network•CPU, Queues, Network
•Folder•Naming convention
•Folder•Naming convention
•CPU•CPU
Net
wo
rk T
imer
Architecture overview
Data-driven model
Configurations (xml files)
7Djelloul Boukhelef - XFEL
Processing pipeline
• Distributed and parallel processing
8
T1T1
T1T1
T2T2
T2T2
Receive
Write
T1T1
T1T1
T1T1
T2T2
T2T2
T2T2
Receive
Process
Format
T1T1
T1T1
T1T1
T2T2
T2T2
T2T2
Generate
Build
Send
T1T1 T2T2Send
Ac
qu
isit
ion
Tra
in b
uild
er
Pro
ces
sin
gP
C L
ayer
Sto
rag
e
Online analysisSC
Pipelining and multithreading on multi-core
Djelloul Boukhelef - XFEL
Train Builder
Image Data Generator
Detector Data
Generator
Packetizer Timer server
Detector data queue (pointers)
Image data queue (pointers)
Train data queue(tokens)
DAQ
requ
est
Raw data buffer
Data files (offline)
Generate & Store(CImg)
Load
Trains buffer
Clock-ticks (TCP)
Packets (UDP)
PCL nodes
Tra
in B
uild
er E
mul
ator
Det
ecto
r em
ulat
orImages
Data Feeder
9Djelloul Boukhelef - XFEL
De-packetizer
Trains buffer
Processing Pipeline
(Monitoring, reduction, …)
Packets (UDP)
Train builder
Process queue (train id)
Format queue (train id)
Formatter
In memory files
Writer
Write queue (file name)
TCP stream
Online storage
Statistics, …
PC-Layer node
Simple processing: checksumNeed real algorithms
10Djelloul Boukhelef - XFEL
Disk array
Ringbuffer
Storage server
Reader
Writer
Free slot
Get free slotFill slot
Flush slot
TCP networkread
writeSync
• Memory ring buffer– Buffer size (16G)– Slot (chunk) size (1G)
• Threads pool– Reader: Read data stream (TCP)
and store it into the memory buffer– Writer: Write filled buffers slots into disk (cache)– Sync: Flush cache to disk (physical writing)
• IO modes– Cached: issues IO command via OS– Direct: talk directly to the IO device– Splice: data transfer within kernel space
11Djelloul Boukhelef - XFEL
Network performance
12Djelloul Boukhelef - XFEL
Train Transfer Protocol (TTP)
• Train data format– Header, images & descriptors,
detector data, trailer
• Train Transfer Protocol (TTP)– Based on UDP: designed for fast data
transfer where TCP is not implemented or not suitable (overhead, delays)
– Transfers are identified by unique identifiers (frame number)
– Packetization: bundle the data block (frame) into small packets that are tagged with increasing packet numbers
– Flags: SoF, EoF, Padding – Packet trailer mode
13
Data 8kbytes
# Frame 4b
# Packet 3b
1bSoF, EoF
Header
Images descriptors
Images data
Detector specific data
Trailer
Train dataDjelloul Boukhelef - XFEL
Previous results (reminder)
• Two types of programs run in parallel on all machines– Feeders: generate train data, packetize it into packets, send
them using UDP– Receivers: reconstruct train data from packets,
store them in memory (overwrite)
• Run length (#packets/total time):– Typical: 3.5×108 ~ 2.5×109 packets (few hours)– Maximum: 5×109 packets (16h37m)
• Time profile – XFEL: 10 MHz for 16 channels Send 1 train
(~131100 packets) within 1.6sec– Continuous: no waiting time between trains sending
14
1 5
2
3
4
6
7
8
PCL Node
2
PCL Node
1
Unidirectional stream
Concurrent send/receive
Djelloul Boukhelef - XFEL
Previous results (reminder)
• Network transfer rate: 1GB train 0.87sec ≈ 9.9Gbps• CPU usage (ie. receiver core) 40%• Packets loss
– Few packets (tens to hundreds) are sometimes lost per run– It happens only at the beginning of some runs (not train)– Observed sometimes on all machines, some machines only.
We have run with no packet loss on any machine, also
• Ignoring first lost packets which affect only the first train– Typical run (3.5×108) less than 3.7 out of 10000 trains– Long run (5×109) less than 26 out of one million
trains
15Djelloul Boukhelef - XFEL
Train switching
• In previous experiments:– Each feeder is configured to feed
one PC layer node (one-to-one)– Packet loss appears at the start of a run– In TB, trains are sent out through
different channels every time10 trains 16 channels per second.
• Question:– What if the feeder sends train data
to a different PC layer node every time?
16
Feeder 2Feeder 2 Feeder 3Feeder 3Feeder 1Feeder 1
PCLayer 1PCLayer 1 PCLayer 2PCLayer 2
10GbE Switch
st101 st102 st103
st104 st105
Sub-net 1
TTP
Feeder 2Feeder 2Feeder 1Feeder 1
PCLayer 1PCLayer 1 PCLayer 2PCLayer 2
10GbE Switch
st101 st102
st104 st105
Sub-net 1
TTP
Djelloul Boukhelef - XFEL
Train switching
• Test configuration– 3 feeders nodes
• Pre-load images from disk
• Build train data (header, trailer,…)
• Calculate checksum (Adler32)
• Packetize train data (TTP)
– 2 PC layer nodes• Depacketize (TTP)
• No processing is performed
• Format to HDF5 file
• Stream files through TCP (splice)
– 2 storage nodes• Write files to shared memory (splice)
17
TimerTimer
Train builderTrain builder
Feeder 2Feeder 2 Feeder 3Feeder 3Feeder 1Feeder 1
PCLayer 1PCLayer 1 PCLayer 2PCLayer 2
Storage 1 Storage 2
10GbE Switch
10GbE Switch
st401
st101 st102 st103
st104 st105
st106 st107
st401
Sub-net 1
Sub-net 2
TTP
TCP
Djelloul Boukhelef - XFEL
Train switching
• 3 Feeders feed 2 PC layer nodes in round robin manner– Rate: 2 trains every 1.6 second
• A PC layer node receives 1 train every 1.6 sec, each time from a different feeder
• A Feeder sends out 1 train every 2.4 sec,each time to a different IP address
– Packetizer checks the send buffer in order to avoid overwriting previous (not sent yet) packets, eg. every 100 packets.
– All Feeder-to-PCLayer data transfers are done on the same sub-network
– Train transfer time is .88 sec, ie. there is an overlap between two consecutive trains of 0.08sec (9% of the time)
18
Feeder PCL Time
1 1 0.02 2 0.83 1 1.61 2 2.4
2 1 3.23 2 4.01 1 4.8…
Feeder 2Feeder 2 Feeder 3Feeder 3Feeder 1Feeder 1
PCLayer 1PCLayer 1 PCLayer 2PCLayer 2
10GbE Switch
st101 st102 st103
st104 st105
Sub-net 1
TTP
Djelloul Boukhelef - XFEL
Experiment
• Total run time– Short time: less than ½ hour– Long time: 18 hours (81657 trains ~ 80TB)
• Observations:– 6 trains were affected at the beginning of the run for each pc layer
node, than they continue smoothly with no train loss until 1am.– … 8am, 2 more trains were affected at one PC layer (probably due
to the nightly update). No train loss on the other node.– … than the run continues very stable until the end– Network send and receive (TTP and TCP) were very stable– Formatting time was not stable al the time
19Djelloul Boukhelef - XFEL
Summary
• Results: – Trains sent: 81657 (27219 trains per feeder)– PCLayer01 (received:40820, affected: 8)– PCLayer02 (received:40823, affected: 6)
– Train size: 1073754173 = 1GB (image data) + 12.06MB (header, descriptors, detector data, and trailer)
– # packets per train: 131202– Transfer time: 0.877579 sec– Transfer rate: 1.1395 GBytes/sec = 9.116 Gbps
• Sustainable and stable network bandwidth
20Djelloul Boukhelef - XFEL
Storage performance
21Djelloul Boukhelef - XFEL
Cached IO Issue read/write commands via OS
kernel, which will execute the IO. Data are copied to/from page cache.
Zero-copy operation Splice socket and file descriptors.
Perform data transfer within the kernel space (transparent).
Problem using Linux splice function with IBM/Mellanox
Couldn’t figure out the reason!!
Direct IO Performs read/write directly to/from
the device, bypassing page cache DMA: Memory alignment (512) RAID: strip size (64K), # of disks Per file vs. per partition (Linux)
Network
Direct IO vs Cached IO
Device driver
Page remapping
readwrite
flush
Direct IO
Kernel space
User land
Device driver
Page cache Kernel buffer
Disk
Application buffer
Hardware layer
Cached IO
22Djelloul Boukhelef - XFEL
File number
Tim
e (s
ec)
File number
Tim
e (s
ec)
File size: 1GBTime period: 1.6sDisk: empty
File size: 2GBTime period: 3.2sDisk: empty
Cached IO
• Run length: 2+ hours• Buffer size: 4GB• Method: read all file data into one buffer, write it once, then sync
23Djelloul Boukhelef - XFEL
Run configuration
• Two types of programs run in parallel on four machines– Sender: open in memory data file,
stream its content using TCP, close file.– Receiver: read file data from socket
to RAM buffer (16GB), write it to disk.
• Run length– Typical: 4.5 TB (2 hours)– Maximum: until disk is full
9TB (4 hours) and 28TB(12 hours)– Disks are cleaned before every run
• Time profile– 1G 1.6sec per box
24
PCL Node 101
Storage 201
Storage 202
Storage 203
Storage 204
PCL Node 102
PCL Node 103
PCL Node 104
External disks Internal disks
Dell machines
IBM machines
Size (GB) 1 10 20 40
Time (sec) 1.6 16 32 64
Djelloul Boukhelef - XFEL
Direct IO – external storage
• Storage extension box: 3TB 7.2Krpm 6Gbps NL SAS,RAID6
25
File (GB)
Network(Gbps)
Storage (GB/s)
Net read (sec)
Disk write (sec)
Overall (sec)
avg. avg. max avg. max avg. max avg.
1 9.86 0.95 2.27 0.87 2.95 1.06 3.95 1.93
10 9.85 0.97 9.01 8.72 15.58 10.30 16.45 11.17
20 9.83 0.98 17.79 17.48 30.07 20.45 30.95 21.33
40 9.61 0.97 38.81 35.74 58.06 41.39 59.06 42.36
Djelloul Boukhelef - XFEL
Direct IO – internal storage
• Internal disks: 14x900GB 10Krpm 6Gbps SASRAID6
26
File (GB)
Network(Gbps)
Storage (GB/s)
Net read (sec)
Disk write (sec)
Overall (sec)
avg. avg. max avg. max avg. max avg.
1 9.86 1.17 3.42 0.87 2.47 0.86 4.23 1.73
10 9.60 1.09 9.38 8.95 11.28 9.19 12.15 10.07
20 9.36 1.06 19.18 18.36 22.63 18.84 23.55 19.75
40 9.38 1.07 38.11 36.64 45.25 37.54 46.22 38.50
Djelloul Boukhelef - XFEL
Long run experiments
27
Internal storage: 9TB, 918 files, 4 hours
External storage: 28TB, 2792 files, 12 hours
Network read Disk write Overall time
Network read Disk write Overall time
Average
Buffer size: 16GB, Slot size: 1GB, File size: 10GB, Time profile: 16secDjelloul Boukhelef - XFEL
Statistics from Ganglia
• Long run experiment (5:24am - 5:29pm)– Host: exflst201 (with external storage)
• Disk write: 671.14MB/s• Network bandwidth: 676.8MB/s• CPU usage: syst: 5.81% user: 0.39 %
– Reader: syst: 49.54% user: 0.75%– Writer: syst: 7.03% user: 0.010%
28
Network Disk
Reader (core 0)
Writer (core 1)
CPU
Djelloul Boukhelef - XFEL
Result summary
• We need to write 1GB data file within 1.6s per storage box– Both internal and external storage configurations are able to
achieve this rate (1.1GB/s, 0.97GB/s, resp.)– 16 storage boxes are needed to handle 10GB/s train data stream
• High network bandwidth and low CPU load (stable)• Direct IO:
– Network read and disk write operations are overlapped 97% Low overall time per file
– Application buffer: For big files, the bigger the slot size the better disk IO performance (as long as DMA allows)
• To do: – Concurrent IO operations: write/write, write/read, file merging– Storage manager: file indexing, disk space management
29Djelloul Boukhelef - XFEL
Summary & Outlook
30Djelloul Boukhelef - XFEL
Summary
• First half of slice test hardware is configured and running• Testing and tuning network and i/o performance using
– System/community tools: netperf, iozone, …– PCL software– Train builder board
• TB (Emulator) PCL software (Dell)– Bandwidth: 9.9 Gbps (99% of the wire speed)– Low UDP packet loss rate: only few packets loss at the start of
runs (3.5×108 ~ 5×109) less than 3.7~0.26 per 105 trains can be affected at most
• PC Layer (Dell) Storage boxes (IBM)– TCP data streaming: ~9.8 Gbps– Write terabytes of data to disk at 0.97 to 1.1 GB/s speed
31Djelloul Boukhelef - XFEL
Outlook
• Fully featured DAQ system– Data readout, pre-processing, monitoring, storage– Feed the system with real data and apply real algorithms
(processing, monitoring, scientific computing)– Deployment, configuration and control: upload libraries, initiate
devices, start/stop and monitor runs Device composition
• Soak and stress testing– Test performance (CPU, IO, network), behavior (bugs, memory
leaks), reliability (error handling, failure), and stability of the system• Significant workload applied over long period of time
– Parallel tasks: forwarding data to online analysis or scientific computing, multiple streams into the same storage server, …
• Cluster file system vs. DDN vs. local storage system• Data management: structure, access control, metadata, …
32Djelloul Boukhelef - XFEL
Thanks!
33Djelloul Boukhelef - XFEL