Big Scientific Data and Data Science - Institute for Data ...idies.jhu.edu/wp-content/uploads/2017/10/Tony-Hey... · Big Data and Cognitive Computing: Hartree Centre collaboration

Big Scientific Dataand Data Science

Professor Tony Hey

Chief Data Scientist

Rutherford Appleton Laboratory, STFC

[email protected]

Thousand years ago – Experimental Science• Description of natural phenomena

Last few hundred years – Theoretical Science• Newton’s Laws, Maxwell’s Equations…

Last few decades – Computational Science• Simulation of complex phenomena

Today – Data-Intensive Science• Scientists overwhelmed with data sets

from many different sources

• Data captured by instruments

• Data generated by simulations

• Data generated by sensor networks

e-Science and the Fourth Paradigm

2

2

2.

3

4

a

cG

a

a

eScience is the set of tools and technologiesto support data federation and collaboration

• For analysis and data mining• For data visualization and exploration• For scholarly communication and dissemination

With thanks to Jim Gray

http://es.rice.edu/ES/humsoc/Galileo/Images/Astro/Instruments/hevelius_telescope.gif

Examples of Data-Intensive Science

Cosmic Dawn(First Stars and Galaxies)

Galaxy Evolution(Normal Galaxies z~2-3)

Cosmology(Dark Energy, Large Scale Structure)

Cosmic Magnetism(Origin, Evolution)

Cradle of Life(Planets, Molecules, SETI)

Testing General Relativity(Strong Regime, Gravitational Waves)

Exploration of the Unknown

Extremely broad range of science!

Data Flow through the SKA

Footer text

SKA1-LOW

SKA1-MID

~2 Pb/s

8.8 Tb/s

7.2 Tb/s

~50 PFLOPS

~5 Tb/s

100 PFLOPS

Users

130 - 300 PB/yr

Large data sets: satellite observations

Some Machine Learning Methods

Neural networks

K-means clustering

Principal Component Analysis

Boltzmann machinesSupport Vector Machines

Hidden Markov Models

Kalman filters

Decision trees

Bayesian networks

Radial basis functions

Linear regression

Markov random fields

Random forests

The Machine Learning Revolution• Neural networks are just one example of a

Machine Learning (ML) algorithm

• Deep Neural Networks are now exciting the whole of the IT industry since they enable us to:

• Build computing systems that improve with experience

• Solve extremely hard problems

• Extract more value from Big Data

• Approach human intelligence

e.g. natural language processing

• The change in the Word Error Rate (WER) with time for the NIST “Switchboard” data.

• In 2016 Microsoft researchers achieved a word error rate (WER) of 6.3 percent, the lowest in the industry.

HPDA Architectures for High Performance Data Analytics

• Distributed SQLServer cluster/cloud• 50 servers, 1.1PB disk, 500 CPU

• Connected with 20 Gbit/sec Infiniband

• Linked to 1500 core compute cluster

• Extremely high speed seq I/O (75GB/s)

• Balanced: Amdahl number >0.5

• Dedicated to eScience, provide public access through services

• Funded by Moore Foundation, Microsoft and Pan-STARRS

• Winner of SC08 Storage Challenge!

An Example:The JASMIN Environmental Science

Super Data Cluster

http://ndg.nerc.ac.uk

British Atmospheric Data Centre

British Oceanographic Data Centre

Simulations

Assimilation

The e-Science NERC DataGrid Project +

Centre for Environmental Data Analytics

JASMIN Super-Data Cluster infrastructure

The UK Met Office UPSCALE campaign

10

01

00

10

00

01

110101

5 TB

per

day

Data conversion & compression

2.5

TB JASMINData transfer

HERMIT @ HLRS

Automation controller

Clear data from HPC once successfully transferred and

data validated

Example Data Analysis

• Tropical cyclone tracking has become routine; 50 years of N512 data can be processed in 50 jobs in one day

• Eddy vectors; analysis we would not attempt on a server/workstation (total of 3 months of processor time and ~40 GB memory needed) completed in 24 hours in 1,600 batch jobs

• JASMIN HPDA architecture has clearly demonstrated the value of cluster computing to data processing and analysis.

M Roberts et al: Journal of Climate 28 (2), 574-596

Big Scientific Data from Large Experimental Facilities

UK Science and Technology Facilities Council (STFC)

Daresbury LaboratorySci-Tech Dasresbury CampusWarrington, Cheshire

Big Data and Cognitive Computing:Hartree Centre collaboration with IBM Research

Central Laser Facility

ISIS (SpallationNeutron Source)

Diamond Light Source

LHC Tier 1 computingJASMIN Super-Data-Cluster

Rutherford Appleton Laboratory

Diamond Light Source

Science Examples

Pharmaceutical manufacture &

processing

Casting aluminium

Structure of the Histamine H1

receptor

Non-destructive imaging of fossils

Detector data rates increasing faster than Moore’s Law

1

10

100

1000

10000

2007 2012

Detector Performance (MB/s)

Data Rates at Diamond

Thanks to Mark Heron

Thanks to Mark Heron

0

1

2

3

4

5

6

Cumulative Amount of Data Generated at Diamond

Data

Siz

e in P

B

Nucleous

Cryo-SXT Data

● Noisy data, missing wedge artifacts, missing

boundaries

● Tens to hundreds of organelles per dataset

● Tedious to manually annotate

● Cell types can look different

● Few previous annotations available

● Automated techniques usually fail

Segmentation

Neuronal-like mammalian cell line; single slice

Nucleus

Cytoplasm

Challenges:

Data

● B24: Cryo Transmission X-ray Microscopy beamline at DLS

● Data Collection: Tilt series from ±65° with 0.5° step size

● Reconstructed volumes up to 1000x1000x600 voxels

● Voxel resolution: ~40nm currently

● Total depth: up to 10μm

● GOAL: Study structure and morphological changes of whole cells

3D Volume Data

Segmentation of Cryo-Soft X-ray Tomography (Cryo-SXT) data

Computer VisionLaboratory

B24 beamlineData Analysis Software Group

[email protected]

Nucleous

Workflow

Data Preprocessing

Data Representation

Feature Extraction

User’s Manual Segmentations

Classification

Tomographic Cell Analysis: Feature ExtractionFeatures are extracted from voxels to represent their appearance:

● Intensity-based filters (Gaussian Convolutions)

● Textural filters (eigenvalues of Hessian and Structure Tensor)

User Annotation + Machine Learning

Refinement

User Annotations

Predictions Refinement

Using few user annotations as an input:

● Machine learning classifier (Random Forest) trained to discriminate between

Nucleus and Cytoplasm and predict the class of each SuperVoxel

● Markov Random Field then used to refine the predictions

[email protected]

The Cryo-Electron Microscopy Revolution

• CMOS sensors have revolutionised TEM

• FALCON 1 generates 300 Mbps

• FALCON 2 will generate 180 Gbps

• Exciting scientific results on membrane protein structures already published and much more to come

With thanks to Nicola Guerrini

➢ Faster and less noisy sensors for better performance are the way forward

➢ Systems becoming widely available and will generate huge datasets

ISIS

Peak Assignment in Inelastic Neutron Scattering

• Vibrational motion of atoms crucial for many properties of a material -e.g., how well it conducts electricity or heat

• Peaks in INS spectrum correspond to specific atomic vibrations

• Peak assignment: what specific vibrational motions of atoms give rise to specific peaks ?

INS Spectrum of crystalline benzene

S. Parker and S. Mukhopadhyay (ISIS)

Modelling & Simulation forINS Peak Assignment Calculated INS Spectrum of crystalline benzene

• INS spectra can be computed for a given atomic structure

• Calculations allow us to see what specific vibrational motion of atoms occur, and at what frequency

L. Liborio

Materials Workbench

K. Dymkowski

The Central Laser Facility

• National imaging facility with peer-reviewed, funded access

• Located in Research Complex at Harwell

• Cluster of microscopes and lasers and expert end-to-end multidisciplinary support

• Operations and some development funded by STFC

• Key developments funded through external grant – BBSRC, MRC

OCTOPUS Facility in the CLF

With thanks to Dan Rolfe

Example: EGFR cell signalling in cancer• Driven OCTOPUS single molecule

developments

• User in plant cell imaging now catching up in scale of challenge

• Part of a PhD project:

• 1 experimental technique

• 50 experimental conditions

• 30 datasets for each condition

• 1000 single molecule tracks for each condition

• Multiple properties & events of interest in each track

• Comparison of just one property…

With thanks to Dan Rolfe

Multidimensional single molecule tracking

• Automated registration & tracking in multiple channels

• Computer vision

• Bayesian feature detection from astronomical galaxy detection

• Instrumental metadata from acquisition

• Flexible specification of many instrument configurations

Rolfe et al 2011, Euro Biophys J, 2011With thanks to Dan Rolfe

Big Scientific Data Benchmarks

How can Academia compete with Industry on Machine Learning and AI?

Companies like Facebook, Google, Amazon, Baidu and Microsoft have three key advantages over academia:

1. These companies all have many, very large, private datasets that they will never make publicly available

2. Each of these companies employs many hundreds of computer scientists with PhDs in Machine Learning and AI

3. Their researchers and developers have essentially unlimited computing power at their disposal

➢ ImageNet example for computer vision community

• ImageNet is an image dataset organized according to WordNet hierarchy. There are more than 100,000 WordNet concepts.

• ImageNet provides 1000 images of each concept that are quality-controlled and human-annotated.

• In competitions, ImageNet offers tens of millions of sorted images for concepts in the WordNet hierarchy.

➢ The ImageNet dataset has proved very useful for advancing research in computer vision

A Particle Physics ExampleDataset for Machine Learning

The Higgs Challenge

Machine Learning winners of the Higgs Challenge• Winner Gábor Melis, a graduate in software engineering and

mathematics, developed an algorithm that is an ensemble ofdeep neural networks trained on random subsets of dataprovided with very little feature engineering and no physicsknowledge

• Runner-up Tim Salimans, who has a PhD in Econometrics andworks as a data science consultant, developed a solution hedescribes as a combination of a large number of boosteddecision tree ensembles

• A Special High Energy Physics meets Machine Learning Awardwas presented to Tianqi Chen and Tong He of Team Crowwork.Their XG Boost algorithm was an excellent compromisebetween performance and simplicity, which could improvetools currently used in high-energy physics.

Winners of the Higgs Machine

Learning Challenge: Gábor

Melis and Tim Salimans (top

row), Tianqi Chen and Tong

He (bottom row).

https://cds.cern.ch/record/1972036/files/Higgs Machine_image.jpg

The STFC Big Scientific Data Benchmarks?

Use open Scientific Datasets for ‘Experimental’ Data Science

• The idea is to create scientific datasets that are sufficiently large and complex to provide a realistic testing ground for ML algorithms.

• These open datasets can form the basis for training academics and industry to understand which is the best algorithm and hardware execution platform to find different features in the data.

➢ Use experimental data from STFC Large Scale Facilities to create a set of scientific ‘benchmark’ datasets

➢Complement the computational benchmarks from the Hartree Centre

Experimental Data Science Datasets• Astronomy datasets from LSST, SKA

• Particle Physics LHC datasets from ATLAS, CMS

• Large Scale Facilities datasets – DLS, ISIS, CLF and Hartree

• Environmental datasets from JASMIN data

• Fusion datasets from Culham

➢ The creation of such curated datasets will allow experimentation and training in Machine Learning technologies executed on different hardware architectures

➢Use these datasets as basis for training courses in Data Science for both academia and industry

A Fusion Big Scientific Data Benchmark?

• Filamentary plasma structures play important role in turbulent particle transport

• Archive of 400GB of video data from MAST Tokomak at Culham

• Developing synthetic data training set of simulated filaments with known properties

• Promising exploration of applicability of Machine Learning techniques

Proof of Concept: An Initial Set of Benchmarks

1. Particle Physics – LHC datasets (Particle tracking, Supersymmetric particles)

2. Astronomy – LSST, SKA simulated datasets (STFC CDTs in Data Intensive Science)

3. Diamond – Cryo-SXT dataset (e.g. Mark Basham, DLS)

4. Diamond - Cryo-EM dataset (e.g. Dave Stuart, DLS)

5. CLF – Octopus, single molecule tracking dataset (e.g. Dan Rolfe, CLF)

6. ISIS – Peak detection with noisy datasets (e.g. Anders Markvardsen, ISIS)

7. Environment – Extreme weather events, air quality satellite data (JASMIN/CEDA, RAL Space)

8. Fusion – MAST Filament Video Dataset (Rob Akers, Culham)

The Ada Lovelace Center

The Data Analysis Gap• Complex Data

• Too big to move in some cases• High CPU / memory requirements• May need to combine data from different sources

• Complex software environments• Variation in users’ knowledge of HPC• Variation in home computing environments• Variation in the availability of Analysis and modelling

Software• Diverse science communities supported by the Facilities

• Different analysis software requirements

➢Users’ access to usable computing to handle experimental science a barrier to science

With thanks to Brian Matthews, SCD

The ALC - Towards a “Super-facility”?

“A network of connected facilities, software and expertiseto enable new modes of discovery”

Katie Antypas, Inder Monga, Lawrence Berkeley National Laboratory

Infrastructure + Software + Expertise

With Common Interfaces and Transparent Access

Data

Catalogue

Petabyte

Data storage

Parallel

File system

HPC

CPU+GPU

VisualisationData

Catalogue

Petabyte

Data storageParallel

File systemHPC

CPU+GPUVisualisationSoftware

Data

Acquisition

Documents

Big Scientific Data and Data Science - Institute for Data ...idies.jhu.edu/wp-content/uploads/2017/10/Tony-Hey... · Big Data and Cognitive Computing: Hartree Centre collaboration