Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Big Scientific Dataand Data Science
Professor Tony Hey
Chief Data Scientist
Rutherford Appleton Laboratory, STFC
Thousand years ago – Experimental Science• Description of natural phenomena
Last few hundred years – Theoretical Science• Newton’s Laws, Maxwell’s Equations…
Last few decades – Computational Science• Simulation of complex phenomena
Today – Data-Intensive Science• Scientists overwhelmed with data sets
from many different sources
• Data captured by instruments
• Data generated by simulations
• Data generated by sensor networks
e-Science and the Fourth Paradigm
2
2
2.
3
4
a
cG
a
a
eScience is the set of tools and technologiesto support data federation and collaboration
• For analysis and data mining• For data visualization and exploration• For scholarly communication and dissemination
With thanks to Jim Gray
Examples of Data-Intensive Science
Cosmic Dawn(First Stars and Galaxies)
Galaxy Evolution(Normal Galaxies z~2-3)
Cosmology(Dark Energy, Large Scale Structure)
Cosmic Magnetism(Origin, Evolution)
Cradle of Life(Planets, Molecules, SETI)
Testing General Relativity(Strong Regime, Gravitational Waves)
Exploration of the Unknown
Extremely broad range of science!
Data Flow through the SKA
Footer text
SKA1-LOW
SKA1-MID
~2 Pb/s
8.8 Tb/s
7.2 Tb/s
~50 PFLOPS
~5 Tb/s
100 PFLOPS
Users
130 - 300 PB/yr
Large data sets: satellite observations
Some Machine Learning Methods
Neural networks
K-means clustering
Principal Component Analysis
Boltzmann machinesSupport Vector Machines
Hidden Markov Models
Kalman filters
Decision trees
Bayesian networks
Radial basis functions
Linear regression
Markov random fields
Random forests
The Machine Learning Revolution• Neural networks are just one example of a
Machine Learning (ML) algorithm
• Deep Neural Networks are now exciting the whole of the IT industry since they enable us to:
• Build computing systems that improve with experience
• Solve extremely hard problems
• Extract more value from Big Data
• Approach human intelligence
e.g. natural language processing
• The change in the Word Error Rate (WER) with time for the NIST “Switchboard” data.
• In 2016 Microsoft researchers achieved a word error rate (WER) of 6.3 percent, the lowest in the industry.
HPDA Architectures for High Performance Data Analytics
• Distributed SQLServer cluster/cloud• 50 servers, 1.1PB disk, 500 CPU
• Connected with 20 Gbit/sec Infiniband
• Linked to 1500 core compute cluster
• Extremely high speed seq I/O (75GB/s)
• Balanced: Amdahl number >0.5
• Dedicated to eScience, provide public access through services
• Funded by Moore Foundation, Microsoft and Pan-STARRS
• Winner of SC08 Storage Challenge!
An Example:The JASMIN Environmental Science
Super Data Cluster
http://ndg.nerc.ac.uk
British Atmospheric Data Centre
British Oceanographic Data Centre
Simulations
Assimilation
The e-Science NERC DataGrid Project +
Centre for Environmental Data Analytics
JASMIN Super-Data Cluster infrastructure
The UK Met Office UPSCALE campaign
10
01
00
10
00
01
110101
5 TB
per
day
Data conversion & compression
2.5
TB JASMINData transfer
HERMIT @ HLRS
Automation controller
Clear data from HPC once successfully transferred and
data validated
Example Data Analysis
• Tropical cyclone tracking has become routine; 50 years of N512 data can be processed in 50 jobs in one day
• Eddy vectors; analysis we would not attempt on a server/workstation (total of 3 months of processor time and ~40 GB memory needed) completed in 24 hours in 1,600 batch jobs
• JASMIN HPDA architecture has clearly demonstrated the value of cluster computing to data processing and analysis.
M Roberts et al: Journal of Climate 28 (2), 574-596
Big Scientific Data from Large Experimental Facilities
UK Science and Technology Facilities Council (STFC)
Daresbury LaboratorySci-Tech Dasresbury CampusWarrington, Cheshire
Big Data and Cognitive Computing:Hartree Centre collaboration with IBM Research
Central Laser Facility
ISIS (SpallationNeutron Source)
Diamond Light Source
LHC Tier 1 computingJASMIN Super-Data-Cluster
Rutherford Appleton Laboratory
Diamond Light Source
Science Examples
Pharmaceutical manufacture &
processing
Casting aluminium
Structure of the Histamine H1
receptor
Non-destructive imaging of fossils
Detector data rates increasing faster than Moore’s Law
1
10
100
1000
10000
2007 2012
Detector Performance (MB/s)
Data Rates at Diamond
Thanks to Mark Heron
Thanks to Mark Heron
0
1
2
3
4
5
6
Cumulative Amount of Data Generated at Diamond
Data
Siz
e in P
B
Nucleous
Cryo-SXT Data
● Noisy data, missing wedge artifacts, missing
boundaries
● Tens to hundreds of organelles per dataset
● Tedious to manually annotate
● Cell types can look different
● Few previous annotations available
● Automated techniques usually fail
Segmentation
Neuronal-like mammalian cell line; single slice
Nucleus
Cytoplasm
Challenges:
Data
● B24: Cryo Transmission X-ray Microscopy beamline at DLS
● Data Collection: Tilt series from ±65° with 0.5° step size
● Reconstructed volumes up to 1000x1000x600 voxels
● Voxel resolution: ~40nm currently
● Total depth: up to 10μm
● GOAL: Study structure and morphological changes of whole cells
3D Volume Data
Segmentation of Cryo-Soft X-ray Tomography (Cryo-SXT) data
Computer VisionLaboratory
B24 beamlineData Analysis Software Group
Nucleous
Workflow
Data Preprocessing
Data Representation
Feature Extraction
User’s Manual Segmentations
Classification
Tomographic Cell Analysis: Feature ExtractionFeatures are extracted from voxels to represent their appearance:
● Intensity-based filters (Gaussian Convolutions)
● Textural filters (eigenvalues of Hessian and Structure Tensor)
User Annotation + Machine Learning
Refinement
User Annotations
Predictions Refinement
Using few user annotations as an input:
● Machine learning classifier (Random Forest) trained to discriminate between
Nucleus and Cytoplasm and predict the class of each SuperVoxel
● Markov Random Field then used to refine the predictions
The Cryo-Electron Microscopy Revolution
• CMOS sensors have revolutionised TEM
• FALCON 1 generates 300 Mbps
• FALCON 2 will generate 180 Gbps
• Exciting scientific results on membrane protein structures already published and much more to come
With thanks to Nicola Guerrini
➢ Faster and less noisy sensors for better performance are the way forward
➢ Systems becoming widely available and will generate huge datasets
ISIS
Peak Assignment in Inelastic Neutron Scattering
• Vibrational motion of atoms crucial for many properties of a material -e.g., how well it conducts electricity or heat
• Peaks in INS spectrum correspond to specific atomic vibrations
• Peak assignment: what specific vibrational motions of atoms give rise to specific peaks ?
INS Spectrum of crystalline benzene
S. Parker and S. Mukhopadhyay (ISIS)
Modelling & Simulation forINS Peak Assignment Calculated INS Spectrum of crystalline benzene
• INS spectra can be computed for a given atomic structure
• Calculations allow us to see what specific vibrational motion of atoms occur, and at what frequency
L. Liborio
Materials Workbench
K. Dymkowski
The Central Laser Facility
• National imaging facility with peer-reviewed, funded access
• Located in Research Complex at Harwell
• Cluster of microscopes and lasers and expert end-to-end multidisciplinary support
• Operations and some development funded by STFC
• Key developments funded through external grant – BBSRC, MRC
OCTOPUS Facility in the CLF
With thanks to Dan Rolfe
Example: EGFR cell signalling in cancer• Driven OCTOPUS single molecule
developments
• User in plant cell imaging now catching up in scale of challenge
• Part of a PhD project:
• 1 experimental technique
• 50 experimental conditions
• 30 datasets for each condition
• 1000 single molecule tracks for each condition
• Multiple properties & events of interest in each track
• Comparison of just one property…
With thanks to Dan Rolfe
Multidimensional single molecule tracking
• Automated registration & tracking in multiple channels
• Computer vision
• Bayesian feature detection from astronomical galaxy detection
• Instrumental metadata from acquisition
• Flexible specification of many instrument configurations
Rolfe et al 2011, Euro Biophys J, 2011With thanks to Dan Rolfe
Big Scientific Data Benchmarks
How can Academia compete with Industry on Machine Learning and AI?
Companies like Facebook, Google, Amazon, Baidu and Microsoft have three key advantages over academia:
1. These companies all have many, very large, private datasets that they will never make publicly available
2. Each of these companies employs many hundreds of computer scientists with PhDs in Machine Learning and AI
3. Their researchers and developers have essentially unlimited computing power at their disposal
➢ ImageNet example for computer vision community
• ImageNet is an image dataset organized according to WordNet hierarchy. There are more than 100,000 WordNet concepts.
• ImageNet provides 1000 images of each concept that are quality-controlled and human-annotated.
• In competitions, ImageNet offers tens of millions of sorted images for concepts in the WordNet hierarchy.
➢ The ImageNet dataset has proved very useful for advancing research in computer vision
A Particle Physics ExampleDataset for Machine Learning
The Higgs Challenge
Machine Learning winners of the Higgs Challenge• Winner Gábor Melis, a graduate in software engineering and
mathematics, developed an algorithm that is an ensemble ofdeep neural networks trained on random subsets of dataprovided with very little feature engineering and no physicsknowledge
• Runner-up Tim Salimans, who has a PhD in Econometrics andworks as a data science consultant, developed a solution hedescribes as a combination of a large number of boosteddecision tree ensembles
• A Special High Energy Physics meets Machine Learning Awardwas presented to Tianqi Chen and Tong He of Team Crowwork.Their XG Boost algorithm was an excellent compromisebetween performance and simplicity, which could improvetools currently used in high-energy physics.
Winners of the Higgs Machine
Learning Challenge: Gábor
Melis and Tim Salimans (top
row), Tianqi Chen and Tong
He (bottom row).
The STFC Big Scientific Data Benchmarks?
Use open Scientific Datasets for ‘Experimental’ Data Science
• The idea is to create scientific datasets that are sufficiently large and complex to provide a realistic testing ground for ML algorithms.
• These open datasets can form the basis for training academics and industry to understand which is the best algorithm and hardware execution platform to find different features in the data.
➢ Use experimental data from STFC Large Scale Facilities to create a set of scientific ‘benchmark’ datasets
➢Complement the computational benchmarks from the Hartree Centre
Experimental Data Science Datasets• Astronomy datasets from LSST, SKA
• Particle Physics LHC datasets from ATLAS, CMS
• Large Scale Facilities datasets – DLS, ISIS, CLF and Hartree
• Environmental datasets from JASMIN data
• Fusion datasets from Culham
➢ The creation of such curated datasets will allow experimentation and training in Machine Learning technologies executed on different hardware architectures
➢Use these datasets as basis for training courses in Data Science for both academia and industry
A Fusion Big Scientific Data Benchmark?
• Filamentary plasma structures play important role in turbulent particle transport
• Archive of 400GB of video data from MAST Tokomak at Culham
• Developing synthetic data training set of simulated filaments with known properties
• Promising exploration of applicability of Machine Learning techniques
Proof of Concept: An Initial Set of Benchmarks
1. Particle Physics – LHC datasets (Particle tracking, Supersymmetric particles)
2. Astronomy – LSST, SKA simulated datasets (STFC CDTs in Data Intensive Science)
3. Diamond – Cryo-SXT dataset (e.g. Mark Basham, DLS)
4. Diamond - Cryo-EM dataset (e.g. Dave Stuart, DLS)
5. CLF – Octopus, single molecule tracking dataset (e.g. Dan Rolfe, CLF)
6. ISIS – Peak detection with noisy datasets (e.g. Anders Markvardsen, ISIS)
7. Environment – Extreme weather events, air quality satellite data (JASMIN/CEDA, RAL Space)
8. Fusion – MAST Filament Video Dataset (Rob Akers, Culham)
The Ada Lovelace Center
The Data Analysis Gap• Complex Data
• Too big to move in some cases• High CPU / memory requirements• May need to combine data from different sources
• Complex software environments• Variation in users’ knowledge of HPC• Variation in home computing environments• Variation in the availability of Analysis and modelling
Software• Diverse science communities supported by the Facilities
• Different analysis software requirements
➢Users’ access to usable computing to handle experimental science a barrier to science
With thanks to Brian Matthews, SCD
The ALC - Towards a “Super-facility”?
“A network of connected facilities, software and expertiseto enable new modes of discovery”
Katie Antypas, Inder Monga, Lawrence Berkeley National Laboratory
Infrastructure + Software + Expertise
With Common Interfaces and Transparent Access
Data
Catalogue
Petabyte
Data storage
Parallel
File system
HPC
CPU+GPU
VisualisationData
Catalogue
Petabyte
Data storageParallel
File systemHPC
CPU+GPUVisualisationSoftware
Data
Acquisition