synchrotron.org.au
Big Data at the Australian Synchrotron Professor Andrew Peele
Director Australian Synchrotron and ANSTO Representative in Victoria
Australian Nuclear Science and Technology Organisation ANSTO is a public research organisation with a variety of roles for the nation.
ANSTO operates Australia’s multipurpose nuclear reactor.
Research and Innovation Science and Engineering Commercial Businesses
Expert advice and support to Government and
international agencies
Australia’s National Research Priorities
Landmark and National Research
Infrastructure
ANSTO Research Infrastructure
• OPAL multi-purpose reactor • Australian Centre for
Neutron Scattering • Australian Synchrotron • Centre for Accelerator Science
Radiobiology & Bioimaging
Isotope Tracing in Natural Systems
Radiotracers & Radioisotopes
Materials Development & Characterisation
Nuclear Stewardship
National Deuteration Facility
Soil and water
Environmental change and health
Food
Resources
Advanced manufacturing
Cyber security
Transport
Energy
Life-changing pharmaceutical breakthroughs
Several drugs have been developed following structural studies and target screening at the Australian Synchrotron and are now under clinical trials
Venetoclax DEVELOPED BY
WEHI, Genentech & Abbott FOR TREATMENT OF
Chronic Lymphocytic Leukaemia
CSL362 DEVELOPED BY
St Vincent’s Institute of Medical Research & CSL FOR TREATMENT OF
Acute Myeloid Leukaemia cancer cells
Momelotinib DEVELOPED BY
Gilead Sciences FOR TREATMENT OF
Myelofibrosis and Pancreatic Cancer
Nexvax2 DEVELOPED BY
Monash University with ImmunsanT FOR TREATMENT OF
Celiac Disease
Solanezumab DEVELOPED BY
St Vincent’s Institute FOR TREATMENT OF
Alzheimer’s Disease
PRMT5 inhibitors DEVELOPED BY
Cancer Therapeutics CRC with Merck FOR TREATMENT OF
Melanoma, Breast Cancer
Infrastructure for researchers
Far-IR IMBL IRM MX1/MX2 PD SAXS SXR XAS XFM
900
750
600
450
300
150
0
Merit beamtime
Facility time 20%
80% • Free of charge to users • Travel and accommodation paid • Expectation to publish
Including commercial access
Shifts requested Shifts awarded
Infrastructure for researchers
Access is peer reviewed based on merit consistent with international best-practice:
Quality of the proposal
National benefit and applications
Track record The need for Synchrotron radiation
40% 30% 30%
Three application rounds per year
Operates 24/7 (apart from maintenance periods)
More than 5600 researcher visits per year Around 1000 experiments
All facilities are oversubscribed. The success rate for applications is about 60%. About right for competition to breed excellence.
Our current 10 operational beamlines (Capacity for 30+ beamlines)
IRM Infrared Microscope
Far - IR Terahertz / Far-IR Spectroscopy
MX2 Micro-focused Crystallography
MX1 Macromolecular Crystallography
XFM X-ray Fluorescence Microscopy (4–25 keV)
IMBL Imaging and Medical Beamline (30–120 keV)
PD Powder Diffraction (4–37 keV)
XAS X-ray Absorption Spectroscopy (4–50 keV)
SAXS / WAXS Small Angle X-ray Scattering / Wide Angle X-ray Scattering (6–20 keV)
SXR Soft X-ray Spectroscopy (90–2500 eV)
Soft X-ray Imaging
synchrotron.org.au
Managing Big Data at the Australian Synchrotron Dr Andreas Moll
Senior Scientific Software Engineer
Flavours of Big Data: Data volume
15
Imaging and Medical Beamline X-ray Fluorescence Microscopy beamline
~270 TB ~146 TB
Flavours of Big Data: Single images
16
1 Gigapixel image 40 × 9 mm = 66667 × 15000 (600 nm) pixels, raw data 250 GB, scan time 38 hrs.
Petrographic section of high grade ore from western shear zone of the Sunrise Dam gold deposit, WA
Sr:Fe:Rb map
Fisher et al., Miner. Deposita 50, 665-674 (2015)
Flavours of Big Data: Data rate
17
Sample Orientation
Diffraction Pattern Data acquisition took 15 minutes
Next iteration of detector will be 18 seconds and can create raw data with ~4 GB / s!
Micro Crystallography (MX2) beamline
Dealing with Big Data
18
Scientific software • Data management • Workflows • Real time analysis • Distributed computing • Automatic workflows for data reduction and processing • Remote analysis tools for users
Infrastructure • Storage • Compute (CPU + GPU) • Network
Big Data definition
A volume of data that is too large or too complex to process by simple means, hence requiring significant investments in IT infrastructure, workflows and tools to capture, store, transfer, analyse and visualise datasets.
Infrastructure at the Australian Synchrotron
19
Central storage: 650 TB Additional storage at RDS: 440 TB We still keep all historic user data (except IMBL) Official data retention period: 6 – 12 months
Storage:
MASSIVE (operated by Monash University) • Batch system (based on SLURM) • Remote Desktop environment • Realtime visualisation
HPC:
42 nodes, each with • 2x6 core X5650 CPUs • 48 GB RAM • 2 NVIDIA M2070 GPUs • 58 TB GPFS file system
Data collection and processing
20
Imaging and Medical Beamline
• Three experimental enclosures for various resolutions and image modalities • Largest beam in the world, up to 540 x 48 mm in 3B • High-flux from the superconducting multipole wiggler • Dedicated near-beam surgery and animal holding and preparation facilities. • All with the Computed Tomography (CT) capabilities
Computed Tomography
21
X-ray Beam
Sample Detector
Projections (individual TIF files) Slices Visualisation and Analysis
reconstruction
capture
22
Computed Tomography
2B
X Pixels 2560
Y Pixels 600
Bit Depth (Ruby) 16
Single Image size (MB) 2.9
Acquisition Time* (s) 0.05
Projections 1800
Slices 25
Total Dataset Size (GB) 132
Time (min) 38
~3 - 5 GB per minute
~ 3 samples / 2 hours ~12 samples / shift
~ 36 samples per day
~14 TB raw data in a 3 day experiment
Detector parameters
Raw data size
23
Computed Tomography
Stitches together serial scans into single projection image at each angle
Uses projections to reconstruct tomographic slices of the sample
2560 x 600 px x 25 slices with 10% overlap 1800 projections
1 Slice (2560 x 2560 px), now 32 bit!
Full Sample (13620 slices) 116 GB per sample
25 MB per slice
332 GB per sample (plus 8 bit (83 GB))
1) Stitching: 2) Reconstruction with X-tract:
~ 60 TB total data potential for 1 experiment (3 days)!
24
Computed Tomography
Uses projections to reconstruct tomographic slices of the sample
1 Slice (2560 x 2560 px), now 32 bit!
Full Sample (13620 slices)
22 TB
25 MB per slice
332 GB per sample (plus 8 bit (83 GB))
2) Reconstruction with X-tract:
~ 60 TB total data potential for 1 experiment (3 days)!
25
Online vs offline
Online (during the beamtime)
Local Storage
Compute
imblcompute
Offline (post beamtime)
Run for each projection in parallel
X-tract uses CUDA for GPU acceleration
VNC
Paradigm shift: bring the users to the data and not the data to the users
48 CPUs 2 GPUs 512GB RAM 60 TB Storage
IMBL Detector
collect
User at beamline
How to handle Big Data:
Remote analysis instead of data transfer (sftp, hard drives etc.)
Gigapixel image on MASSIVE
27
Cluster mode
Gigapixel image = 2,505 files, each 100 MB Analysed using GeoPixe software Can run in Cluster mode for data sorting and extraction
• Partition data • Parallelise sorting through data
Each MASSIVE remote access session provides:
How to handle Big Data:
• 12 CPUs • 1 GPU
‘Realtime’ processing and data reduction
28
Automatic workflows • reduce data by averaging data, removing unwanted data, etc. • first, quick reconstruction of ‘live data’ for quick user feedback • full processing of the data where possible
Example MX2 beamline: Workflow for automatic data processing and protein structure determination from MX diffraction images (close to real-time)
1. single shot assessment of space group and quality metric 2. data reduction of datasets with special care for the type
of experiment (chemical or protein crystallography)
What we have learned
29
Design and implementation of all workflows were driven by the available infrastructure e.g. MASSIVE and RDS services existed before the workflow
Next iteration:
Workflows are custom built and can’t be re-used
Depend on external service provider
• Decouple workflow and infrastructure • Generic workflow software • Microservice architecture
ASCI – Australian Synchrotron Computing Infrastructure
Realtime diffraction spot finding at MX2 • Uses newly developed workflow software • Check quality of recorded data live
ASCI - Australian Synchrotron Computing Infrastructure
30
6 nodes, each with • 48 CPUs • 2 GPUs (NVIDIA GeForce GTX 1080) • 512 GB RAM 2PB (raw) of Ceph storage
Analysis Session
Analysis Session
Analysis Session
Analysis Session
Infrastructure Service
Workflow Service
Firewall, nginx Routing + Security
HTML5 based VNC connection
Automatic load balancing of docker containers docker images
IMBL
XFM
SAXS/WAXS
create instance
…
The future of data processing
31
Streaming of data instead of writing (intermediate) files to disk Clever file formats (structure the data in an optimal way) So far: TIF, text, proprietary binary files Next: HDF5 Distributed computing Common workflow system (graph based, distributed) Microservice architecture
Automated metadata capture and data curation / preservation
• split monolithic applications into independent services • allows for more flexibility and scalability
Task
Task Task
Task Task
Task
Task
Summary
32
• Big Data requires clever storage, file formats and processing algorithms
• Bring the users to the data and not the data to the users
• The facility that provides users with the best computing environment will have a competitive edge
Send user home with information not with data
XFM - ideally suited to study bio-metals
Simultaneous access to 10+ elements; Z > 14 ~ Si
High sensitivity - sub-ppm; sub-mM; 1e-12g / s
Native contrast - no dyes or contrast agents necessary - but possible!
Quantitative
Non-destructive / minor damage
Extended penetration & DoF - study intact cells & sections
Sensitive to chemical speciation via XANES spectroscopy
34
LA-ICP-MS XFM
Spatial resolution S
ensi
tivity
ppt
ppb
ppm
0.1 μm 1 μm 10 μm 100 μm
LMD-LA-ICP-MS
SEM-EDX PIXE
EJ New Dalton Trans (2013), 42(9) pp 3210
Data
Antony van der Ent, Hugh Harris, Martin de Jonge, Peter Erskine, Rachel Mak, Jolanta Mesjasz-Przybylowicz, Wojciech Przybylowicz, Emmanuelle Montargès-Pelletier, Alban Barnabas, Guillaume Echevarria, David Paterson and Daryl Howard University of Adelaide Australian Synchrotron
The Maia Detector
1. Form a spot on a specimen 2. Collect fluorescence + scatter in 384 detector pixels and stage position signals while scanning sample
XFM @ AS: ~2 µm FWHM ~1e10 ph / s
Sample position
Fitted spectrum
(integrated) Fluorescence spectrum
Naïve Data Storage
1 Gpix image = 1 GB (pixels in image) x 2048 (spectral channels) x 384 detector pixels =
786 TB for one image!
SrFeRb
Event Mode Data Storage
Fitted spectrum
1
10
5 10 15 20 0 Energy [keV]
After “training”, elemental maps are determined:
by performing a fit of the elemental & scatter intensities in each low-statistical single-pixel spectrum
THIS FIT CAN BE LINEAR (but often isn’t)
Many empty channels suggest event mode data storage
How many events are there?
AS brilliance – 1019 ph/s/mrad2/mm2/0.1%bw ~1015 ph/s 0.1% bw at AS front end
40
AS
Event Mode Data Storage
1015
1010
107
106
Storage Ring
Beamline/Mono
Sample
Detector
Photons/s
1 MB/s 86 GB/day
1 TB/day for all AS
What next?
42
1 TB/day for all AS 10 TB/day for all AS
New Beamlines & new detection
systems
AS
Future
10 EB/day for all?? AS
XFM is being used to study the sub-micron metal distribution in grains such as
wheat, barley and rice.
B. K. R. Trijatmiko, et al. Scientific Reports, 6, 19792 (2016).
Big Data = Supercharging food
First International field trials in Philippines & Colombia
Iron Zinc Natural 2 16 Target 13 28 This Study 15 45 (µg g-1 of rice)
More than two billion people are micronutrient deficient
Wild Type Johnson Strain
B. Kyriacou, et al., J. Cereal Science, 59, 173 (2014).
Big Data = Benefits to industry
1 %
11.6 %
• Through research programs • > 200 companies interacting with University and
research institutions • Access to researchers • Access to Grant funding • Access to facilities • Internal Beamline-Industry Group
Big Data = Real-life benefits
De-clogging Ink-jet printer heads for MemJet
Materials for improved solar cell efficiency
Gold in Gum Leaves
Facilitating approval of generic oncology medication for Hospira
Testing safety of zinc nanoparticles in sunscreen
Venetoclax approved by FDA to combat chronic lymphocytic leukemia
Strengthening sheep leather
Over 1,284 protein structures solved
Cultural Heritage – finding hidden artworks
Iron enriched rice variants
Over 2,800 peer reviewed papers
Over 620 student theses
Zeobond green cement
Stainless magnesium
New beamlines
47
BioSAXS
MX3
Micro materials characterisation
Advanced diffraction and scattering
Medium Energy XAS
Mirco-CT
X-ray fluorescence nanoprobe
New beam lines = Meet demand, fill gaps
Geosciences Health / Medical Advanced materials
High energy
3D Imaging
High throughput protein structure
Small crystal capacity
Residual stress analysis
Combined spectroscopy, diffraction and
imaging
New beam lines = More real-life benefits
Geosciences Health / Medical Advanced materials
Better use of resources
Better drugs Better materials