55
synchrotron.org.au Big ideas + big data = real life benefits Thursday 27 October 2016

Thursday 27 October 2016 - Australian Synchrotron · Session . Analysis Session Analysis . Session . Analysis Session Service Workflow Service . Routing + Security . Firewall, nginx

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

synchrotron.org.au

Big ideas + big data = real life benefits Thursday 27 October 2016

synchrotron.org.au

Big Data at the Australian Synchrotron Professor Andrew Peele

Director Australian Synchrotron and ANSTO Representative in Victoria

Australian Nuclear Science and Technology Organisation ANSTO is a public research organisation with a variety of roles for the nation.

ANSTO operates Australia’s multipurpose nuclear reactor.

Research and Innovation Science and Engineering Commercial Businesses

Expert advice and support to Government and

international agencies

Australia’s National Research Priorities

Landmark and National Research

Infrastructure

ANSTO Research Infrastructure

• OPAL multi-purpose reactor • Australian Centre for

Neutron Scattering • Australian Synchrotron • Centre for Accelerator Science

Radiobiology & Bioimaging

Isotope Tracing in Natural Systems

Radiotracers & Radioisotopes

Materials Development & Characterisation

Nuclear Stewardship

National Deuteration Facility

Soil and water

Environmental change and health

Food

Resources

Advanced manufacturing

Cyber security

Transport

Energy

Multi-site organisation

CLAYTON VIC LUCAS HEIGHTS NSW CAMPERDOWN NSW

Life-changing pharmaceutical breakthroughs

Several drugs have been developed following structural studies and target screening at the Australian Synchrotron and are now under clinical trials

Venetoclax DEVELOPED BY

WEHI, Genentech & Abbott FOR TREATMENT OF

Chronic Lymphocytic Leukaemia

CSL362 DEVELOPED BY

St Vincent’s Institute of Medical Research & CSL FOR TREATMENT OF

Acute Myeloid Leukaemia cancer cells

Momelotinib DEVELOPED BY

Gilead Sciences FOR TREATMENT OF

Myelofibrosis and Pancreatic Cancer

Nexvax2 DEVELOPED BY

Monash University with ImmunsanT FOR TREATMENT OF

Celiac Disease

Solanezumab DEVELOPED BY

St Vincent’s Institute FOR TREATMENT OF

Alzheimer’s Disease

PRMT5 inhibitors DEVELOPED BY

Cancer Therapeutics CRC with Merck FOR TREATMENT OF

Melanoma, Breast Cancer

Infrastructure for researchers

Far-IR IMBL IRM MX1/MX2 PD SAXS SXR XAS XFM

900

750

600

450

300

150

0

Merit beamtime

Facility time 20%

80% • Free of charge to users • Travel and accommodation paid • Expectation to publish

Including commercial access

Shifts requested Shifts awarded

Infrastructure for researchers

Access is peer reviewed based on merit consistent with international best-practice:

Quality of the proposal

National benefit and applications

Track record The need for Synchrotron radiation

40% 30% 30%

Three application rounds per year

Operates 24/7 (apart from maintenance periods)

More than 5600 researcher visits per year Around 1000 experiments

All facilities are oversubscribed. The success rate for applications is about 60%. About right for competition to breed excellence.

Our current 10 operational beamlines (Capacity for 30+ beamlines)

IRM Infrared Microscope

Far - IR Terahertz / Far-IR Spectroscopy

MX2 Micro-focused Crystallography

MX1 Macromolecular Crystallography

XFM X-ray Fluorescence Microscopy (4–25 keV)

IMBL Imaging and Medical Beamline (30–120 keV)

PD Powder Diffraction (4–37 keV)

XAS X-ray Absorption Spectroscopy (4–50 keV)

SAXS / WAXS Small Angle X-ray Scattering / Wide Angle X-ray Scattering (6–20 keV)

SXR Soft X-ray Spectroscopy (90–2500 eV)

Soft X-ray Imaging

synchrotron.org.au

Managing Big Data at the Australian Synchrotron Dr Andreas Moll

Senior Scientific Software Engineer

Flavours of Big Data: Data volume

15

Imaging and Medical Beamline X-ray Fluorescence Microscopy beamline

~270 TB ~146 TB

Flavours of Big Data: Single images

16

1 Gigapixel image 40 × 9 mm = 66667 × 15000 (600 nm) pixels, raw data 250 GB, scan time 38 hrs.

Petrographic section of high grade ore from western shear zone of the Sunrise Dam gold deposit, WA

Sr:Fe:Rb map

Fisher et al., Miner. Deposita 50, 665-674 (2015)

Flavours of Big Data: Data rate

17

Sample Orientation

Diffraction Pattern Data acquisition took 15 minutes

Next iteration of detector will be 18 seconds and can create raw data with ~4 GB / s!

Micro Crystallography (MX2) beamline

Dealing with Big Data

18

Scientific software • Data management • Workflows • Real time analysis • Distributed computing • Automatic workflows for data reduction and processing • Remote analysis tools for users

Infrastructure • Storage • Compute (CPU + GPU) • Network

Big Data definition

A volume of data that is too large or too complex to process by simple means, hence requiring significant investments in IT infrastructure, workflows and tools to capture, store, transfer, analyse and visualise datasets.

Infrastructure at the Australian Synchrotron

19

Central storage: 650 TB Additional storage at RDS: 440 TB We still keep all historic user data (except IMBL) Official data retention period: 6 – 12 months

Storage:

MASSIVE (operated by Monash University) • Batch system (based on SLURM) • Remote Desktop environment • Realtime visualisation

HPC:

42 nodes, each with • 2x6 core X5650 CPUs • 48 GB RAM • 2 NVIDIA M2070 GPUs • 58 TB GPFS file system

Data collection and processing

20

Imaging and Medical Beamline

• Three experimental enclosures for various resolutions and image modalities • Largest beam in the world, up to 540 x 48 mm in 3B • High-flux from the superconducting multipole wiggler • Dedicated near-beam surgery and animal holding and preparation facilities. • All with the Computed Tomography (CT) capabilities

Computed Tomography

21

X-ray Beam

Sample Detector

Projections (individual TIF files) Slices Visualisation and Analysis

reconstruction

capture

22

Computed Tomography

2B

X Pixels 2560

Y Pixels 600

Bit Depth (Ruby) 16

Single Image size (MB) 2.9

Acquisition Time* (s) 0.05

Projections 1800

Slices 25

Total Dataset Size (GB) 132

Time (min) 38

~3 - 5 GB per minute

~ 3 samples / 2 hours ~12 samples / shift

~ 36 samples per day

~14 TB raw data in a 3 day experiment

Detector parameters

Raw data size

23

Computed Tomography

Stitches together serial scans into single projection image at each angle

Uses projections to reconstruct tomographic slices of the sample

2560 x 600 px x 25 slices with 10% overlap 1800 projections

1 Slice (2560 x 2560 px), now 32 bit!

Full Sample (13620 slices) 116 GB per sample

25 MB per slice

332 GB per sample (plus 8 bit (83 GB))

1) Stitching: 2) Reconstruction with X-tract:

~ 60 TB total data potential for 1 experiment (3 days)!

24

Computed Tomography

Uses projections to reconstruct tomographic slices of the sample

1 Slice (2560 x 2560 px), now 32 bit!

Full Sample (13620 slices)

22 TB

25 MB per slice

332 GB per sample (plus 8 bit (83 GB))

2) Reconstruction with X-tract:

~ 60 TB total data potential for 1 experiment (3 days)!

25

Online vs offline

Online (during the beamtime)

Local Storage

Compute

imblcompute

Offline (post beamtime)

Run for each projection in parallel

X-tract uses CUDA for GPU acceleration

VNC

Paradigm shift: bring the users to the data and not the data to the users

48 CPUs 2 GPUs 512GB RAM 60 TB Storage

IMBL Detector

collect

User at beamline

How to handle Big Data:

Remote analysis instead of data transfer (sftp, hard drives etc.)

Remote access with Strudel

26

Gigapixel image on MASSIVE

27

Cluster mode

Gigapixel image = 2,505 files, each 100 MB Analysed using GeoPixe software Can run in Cluster mode for data sorting and extraction

• Partition data • Parallelise sorting through data

Each MASSIVE remote access session provides:

How to handle Big Data:

• 12 CPUs • 1 GPU

‘Realtime’ processing and data reduction

28

Automatic workflows • reduce data by averaging data, removing unwanted data, etc. • first, quick reconstruction of ‘live data’ for quick user feedback • full processing of the data where possible

Example MX2 beamline: Workflow for automatic data processing and protein structure determination from MX diffraction images (close to real-time)

1. single shot assessment of space group and quality metric 2. data reduction of datasets with special care for the type

of experiment (chemical or protein crystallography)

What we have learned

29

Design and implementation of all workflows were driven by the available infrastructure e.g. MASSIVE and RDS services existed before the workflow

Next iteration:

Workflows are custom built and can’t be re-used

Depend on external service provider

• Decouple workflow and infrastructure • Generic workflow software • Microservice architecture

ASCI – Australian Synchrotron Computing Infrastructure

Realtime diffraction spot finding at MX2 • Uses newly developed workflow software • Check quality of recorded data live

ASCI - Australian Synchrotron Computing Infrastructure

30

6 nodes, each with • 48 CPUs • 2 GPUs (NVIDIA GeForce GTX 1080) • 512 GB RAM 2PB (raw) of Ceph storage

Analysis Session

Analysis Session

Analysis Session

Analysis Session

Infrastructure Service

Workflow Service

Firewall, nginx Routing + Security

HTML5 based VNC connection

Automatic load balancing of docker containers docker images

IMBL

XFM

SAXS/WAXS

create instance

The future of data processing

31

Streaming of data instead of writing (intermediate) files to disk Clever file formats (structure the data in an optimal way) So far: TIF, text, proprietary binary files Next: HDF5 Distributed computing Common workflow system (graph based, distributed) Microservice architecture

Automated metadata capture and data curation / preservation

• split monolithic applications into independent services • allows for more flexibility and scalability

Task

Task Task

Task Task

Task

Task

Summary

32

• Big Data requires clever storage, file formats and processing algorithms

• Bring the users to the data and not the data to the users

• The facility that provides users with the best computing environment will have a competitive edge

Send user home with information not with data

The Australian Synchrotron is an information pipeline

XFM - ideally suited to study bio-metals

Simultaneous access to 10+ elements; Z > 14 ~ Si

High sensitivity - sub-ppm; sub-mM; 1e-12g / s

Native contrast - no dyes or contrast agents necessary - but possible!

Quantitative

Non-destructive / minor damage

Extended penetration & DoF - study intact cells & sections

Sensitive to chemical speciation via XANES spectroscopy

34

LA-ICP-MS XFM

Spatial resolution S

ensi

tivity

ppt

ppb

ppm

0.1 μm 1 μm 10 μm 100 μm

LMD-LA-ICP-MS

SEM-EDX PIXE

EJ New Dalton Trans (2013), 42(9) pp 3210

Data

Antony van der Ent, Hugh Harris, Martin de Jonge, Peter Erskine, Rachel Mak, Jolanta Mesjasz-Przybylowicz, Wojciech Przybylowicz, Emmanuelle Montargès-Pelletier, Alban Barnabas, Guillaume Echevarria, David Paterson and Daryl Howard University of Adelaide Australian Synchrotron

The Maia Detector

1. Form a spot on a specimen 2. Collect fluorescence + scatter in 384 detector pixels and stage position signals while scanning sample

XFM @ AS: ~2 µm FWHM ~1e10 ph / s

Sample position

Fitted spectrum

(integrated) Fluorescence spectrum

Naïve Data Storage

1 Gpix image = 1 GB (pixels in image) x 2048 (spectral channels) x 384 detector pixels =

786 TB for one image!

SrFeRb

Event Mode Data Storage

Fitted spectrum

1

10

5 10 15 20 0 Energy [keV]

After “training”, elemental maps are determined:

by performing a fit of the elemental & scatter intensities in each low-statistical single-pixel spectrum

THIS FIT CAN BE LINEAR (but often isn’t)

Many empty channels suggest event mode data storage

How many events are there?

AS brilliance – 1019 ph/s/mrad2/mm2/0.1%bw ~1015 ph/s 0.1% bw at AS front end

40

AS

Event Mode Data Storage

1015

1010

107

106

Storage Ring

Beamline/Mono

Sample

Detector

Photons/s

1 MB/s 86 GB/day

1 TB/day for all AS

What next?

42

1 TB/day for all AS 10 TB/day for all AS

New Beamlines & new detection

systems

AS

Future

10 EB/day for all?? AS

XFM is being used to study the sub-micron metal distribution in grains such as

wheat, barley and rice.

B. K. R. Trijatmiko, et al. Scientific Reports, 6, 19792 (2016).

Big Data = Supercharging food

First International field trials in Philippines & Colombia

Iron Zinc Natural 2 16 Target 13 28 This Study 15 45 (µg g-1 of rice)

More than two billion people are micronutrient deficient

Wild Type Johnson Strain

B. Kyriacou, et al., J. Cereal Science, 59, 173 (2014).

Big Data = Benefits to industry

1 %

11.6 %

• Through research programs • > 200 companies interacting with University and

research institutions • Access to researchers • Access to Grant funding • Access to facilities • Internal Beamline-Industry Group

Big Data = Real-life benefits

De-clogging Ink-jet printer heads for MemJet

Materials for improved solar cell efficiency

Gold in Gum Leaves

Facilitating approval of generic oncology medication for Hospira

Testing safety of zinc nanoparticles in sunscreen

Venetoclax approved by FDA to combat chronic lymphocytic leukemia

Strengthening sheep leather

Over 1,284 protein structures solved

Cultural Heritage – finding hidden artworks

Iron enriched rice variants

Over 2,800 peer reviewed papers

Over 620 student theses

Zeobond green cement

Stainless magnesium

Making the Big Data challenge even bigger

New beamlines

47

BioSAXS

MX3

Micro materials characterisation

Advanced diffraction and scattering

Medium Energy XAS

Mirco-CT

X-ray fluorescence nanoprobe

New beam lines = Meet demand, fill gaps

Geosciences Health / Medical Advanced materials

High energy

3D Imaging

High throughput protein structure

Small crystal capacity

Residual stress analysis

Combined spectroscopy, diffraction and

imaging

New beam lines = More real-life benefits

Geosciences Health / Medical Advanced materials

Better use of resources

Better drugs Better materials

Questions at end

synchrotron.org.au

Big ideas + big data = real life benefits Thursday 27 October 2016