05/05/14 Emmanuel Gangler – Ljubljana seminar 1/27
Emmanuel Gangler – LPC – Clermont-Ferrand (France)
An era of Big Data in astronomy
05/05/14 Emmanuel Gangler – Ljubljana seminar 2/27
* The project* The science* The Big Data
05/05/14 Emmanuel Gangler – Ljubljana seminar 3/27
In short : ● A stage-4 survey :
● 8.4 m telescope● Cerro Pachon (Chile)● (Very) wide-field astronomy● 9.6□ camera
● 0.2 '' pixel
● All visible sky in 6 bands (ugrizy) (~20000□)
● 15 s exposure, 1 visit / 3 days
● During 10 years !
● 60 Pbytes of raw data
05/05/14 Emmanuel Gangler – Ljubljana seminar 4/27
Observation strategy●Main survey: 90% of time
● Pairs of 15 seconds exposure● 2 visits separated by 1H● mag r~24.5 on 1 exposure● r~27.5 after 10 yr (150 visits)
●Deep drilling survey: 10% of time● r~26 ; 30 fields, 300□
● Continuous exposures 1H/night
05/05/14 Emmanuel Gangler – Ljubljana seminar 5/27
The LSST consortium
R&D Construction Operations
● Non-profit organization
● 35 institutions, with major US contribution:
● SLAC, Google...
● Non-US : Chilean republic, France/CNRS/IN2P3
● + international partners (UK, ...)
● Expect ~900 scientist involved
● NSF/DOE/Private funding : ~670 M$
● Operations: 10 years (2022-2031)
05/05/14 Emmanuel Gangler – Ljubljana seminar 6/27
The Cerro Pachon site
05/05/14 Emmanuel Gangler – Ljubljana seminar 7/27
The camera and telescope● Camera is located in the telescope beam● Constraints on
● Enveloppe (ø<1.6m)● mass (3t)● Heat dissipation● Lifetime (10 yr) and maintenance
●Median seeing : 0.6'' → 0.2'' pixel●Minimum pixel size 10μm●Plate scale → 10.3m focal length●Depth requirement : aperture 6.5m→ focal ratio <1.5
●FOV 3.5° → 3.2 Gpix, → 63cm ø focal plane
●Fast slew (5°/sec) → Compact design
05/05/14 Emmanuel Gangler – Ljubljana seminar 8/27
The focal plane
189 CCDs
05/05/14 Emmanuel Gangler – Ljubljana seminar 9/27
* The project* The science* The Big Data
05/05/14 Emmanuel Gangler – Ljubljana seminar 10/27
Which science will LSST address ?
Slide fromIvezic
05/05/14 Emmanuel Gangler – Ljubljana seminar 11/27
The LSST science book● 4 major themes
● Dark Energy, Dark matter● Mapping Milky Way● Transient optical sky● Solar system
● 11 science collaborations● Weak Lensing● BAO● Supernovae● Strong lensing● Galaxies● AGN● Milky way and the local volume structure● Stellar populations● Transient/variable stars● Solar system● Informatics and statistics arXiv:0912.0201
ArXiv 1211.0310
05/05/14 Emmanuel Gangler – Ljubljana seminar 12/27
Slide fromIvezic
05/05/14 Emmanuel Gangler – Ljubljana seminar 13/27
05/05/14 Emmanuel Gangler – Ljubljana seminar 14/27
LSST dark energy probes● Quantitative and qualitative step:
● ~50 000 SN deep field (2013 : 500 SN)
→ homogeneity test
● ~10 B galaxies (10 M DESI ; 1M BOSS)
→ Structure growth
→ Redshift tomography
→ GR consistency checksBAO
05/05/14 Emmanuel Gangler – Ljubljana seminar 15/27
05/05/14 Emmanuel Gangler – Ljubljana seminar 16/27
TransientsGRB Orphan afterglows
GRB afterglow
?
● Transient detection● High cadence in deep drilling field● High rate of false positives
● Follow-up will be the key● LSST releases the alerts within 1 min● Spectroscopic● Other wavelength
● LSST is also a follow-up instrument !
Kasliwal
Prsa
05/05/14 Emmanuel Gangler – Ljubljana seminar 17/27
05/05/14 Emmanuel Gangler – Ljubljana seminar 18/27
05/05/14 Emmanuel Gangler – Ljubljana seminar 19/27
* The project* The science* The Big Data
05/05/14 Emmanuel Gangler – Ljubljana seminar 20/27
Big Data:when « more is different »
Processing capacity doubles every 18 monthbut data volume doubles every year !
x1000 in 10 years : same trend in astronomy
05/05/14 Emmanuel Gangler – Ljubljana seminar 21/27
Data management is a pillar of the project :
Telescope Caméra Data Management Outreach
“How do you turn petabytes of data into scientific knowledge?”Kirk Borne (George Mason U.)
« The data volumes […] of LSST are so large that the limitation on our ability to do science isn't the ability to collect the data, it's the ability to understand […] the data »
Andrew Conolly (U. Washington)
05/05/14 Emmanuel Gangler – Ljubljana seminar 22/27
05/05/14 Emmanuel Gangler – Ljubljana seminar 23/27
LSST data flow
~ 1/1 000 000 000 of LSST data
Camera : 198 CCD (16 Mpix) read in parallel→ 3,2 G pixels !~ 6 Gbyte / 17 seconds→ 15 TB / night
05/05/14 Emmanuel Gangler – Ljubljana seminar 24/27
LSST data flowCamera : 198 CCD (16 Mpix) read in parallel→ 3,2 G pixels !~ 6 Gbyte / 17 seconds→ 15 TB / night
During 10 years !
~ 1000 visits per field→ opens the time domain
05/05/14 Emmanuel Gangler – Ljubljana seminar 25/27
LSST Data in short● Huge data flow:
● Images : 2x6.4 Gbyte/39 seconds● 15 TB/night● 100 PB image archives ● 40. 109 objets ( 100-200 TB catalog )● 5 000. 109 detections ''sources'' ( 3-5 PB catalog )● 32 000. 109 measurements ''forced sources'' (1-2 PB catalog)● Nightly transient alerts: >2.106
● Big data paradigm: acquire data first, analyze them taler
● Astroinformatics paradigm: characterize first, analyze later● Data anlysis is NOT part of the project
Simulation 1 CCD 4k x 4k
(arXiv:0909.3892)
05/05/14 Emmanuel Gangler – Ljubljana seminar 26/27
Available data:
05/05/14 Emmanuel Gangler – Ljubljana seminar 27/27
How to handle the problem● Interdisciplinary field with statisticians and IT
researchers● Data access
– Which DB model ? (relational, graph, line, column, …)– How to paralellize access
● Data visualization– Interactive data exploring– Explore time domain, 3D– Explore new display (screen walls, pads)
● Data mining– Machine learning : supervised/unsupervized– Sub-linear and approximative algorithms
05/05/14 Emmanuel Gangler – Ljubljana seminar 28/27
Are LSST data « big » ?SDSS LSST 1 yr.
(~2020)LSST 10 yr.
(~2030)
Raw data 14 TB 6 PB 60 PB
Archive (tape) 19 PB 270 PB
Disk (DAC) 16 PB 90 PB
DB (baseline) 7 TB 0,5 PB 5 PB
Moore Equivalent 2014
12 TB 1.2 TB
● 12 TB : >> usual sizes handled by DBMS → Big
● ~6H for a un full scan at 600 Mo/s.
● 110H to index 3TB sous MySQL
● System has to be distributed
● Moore « law » isn't one ...
Answer to Schegel 2012
05/05/14 Emmanuel Gangler – Ljubljana seminar 29/27
QservThe LSST baseline
Unified user interface− User input in SQL− Distributed nature is hidden
− ~1000 nodes in parallel− Fault-tolerant− Commodity hardware
Partitioning :● Geometry (cone searches)● Light-curves (Sources and Object together)
Limitations● Patches are independant on the sky● Results size● Computation time
05/05/14 Emmanuel Gangler – Ljubljana seminar 30/27
An new approach : map/reduce
Move the calculus to the data→ High level computing skills needed
Qserv● Still classical approach
− Select (extract) data− Run user algorithm
● Will fail when selectivity is low !
DBMS
Calculus
Query
x
Map/Reduce● Write algorithm in parallel form
− No data transfer− Use local CPU
● Not all tasks can be written in this form
Controler
(Cartoon from PhD comics)
05/05/14 Emmanuel Gangler – Ljubljana seminar 31/27
Data mining
Astroinformatics point of view:
Borne 2009VO domain
Which knowledge to extract ?● Classical problems in astronomy
– Objects classification● Cluster significance ? (statistical/scientific)● Confusion problems● Efficient algorithms for
– Highly dimensional problems (> 1000 dimensions, >1010 entries)
– 2-points (or N-points) correlations– Rarity detection
● Rarity metric, efficient algorithmic● Discoveries ? Anomalies (detector, software)
– Dimensional reduction● Compact data representation
– Measurements errors● Usually neglected in existing approaches
S. G. Djorgovski,
32/40
Classification examples
● Color-color plot classification
● Star-galaxy (+morphology)● Star/QSO● ...
D. Bard
Classification as a process
● Whole range of classifiers (ANN, SVM, KNN, LDA, QDA, GMM ...)
● Optimum depends on spectific task
● ROC curve (Receiver operating characteristics)
KNN SVM
● Specific training required to be efficient● Methods are not black boxes !→ Astrostatistics
05/05/14 Emmanuel Gangler – Ljubljana seminar
Rarity detection● Transient searches dominated by noise
● Supernovae detection (http://arxiv.org/abs/1106.5491)
● 1000 detections for 1 event
● Millions of detections / night !PTF data
05/05/14 Emmanuel Gangler – Ljubljana seminar
Classifying transients :● A real Big Data Mining issue
05/05/14 Emmanuel Gangler – Ljubljana seminar
New ideas● IT research comes with new approaches
● Tested in Big Data environment. ● Yet to be applied (if makes sense) to astronomy
● Integration of Data Mining and Data Base– Searching for relation between variables (Functional
dependancies)● e.g. Is (u,g,r,i,z) a predictor of 'is Star'
Works blindly : is (c1,c2,c5) predicting (c4) ?
– Skyline or Pareto front approach● e.g. Define a partial order in the data
Extract the extremas under this partial order rule– Graph-based searches (comes from text analysis)
● e.g. Connect objects by their « likeliness »Search for information contained in the « is alike » network
05/05/14 Emmanuel Gangler – Ljubljana seminar
New ideas● General statements about Big Data
● 1-pass (or few-pass) only algorithms– Read the data at most once whenever possible
● Sublinear approach : – Don't even read all data– Approximate methods : degree of approximation is under
control– Some can go down to log-log-N !
● ex. compute an approximate number of non-empty bins in a Nth-dim histogram
– Not all problems can be approximated
Conclusions
● LSST will provide unprecedented data
● Opens up time domain● … and a LOT of scientific opportunities
● Proper knowledge on how to use these data needed
● Training of students● IT cross-disciplinary field : astroinformatics
– Annual conference
– National initiatives
– EU initiatives (COST BigSkyEarth)
● Data access has to be organized
● A European center is under study
– Expression of interest is welcomed !
05/05/14 Emmanuel Gangler – Ljubljana seminar 40/27
Thank you ...