22
Theme 9 - Summary Dissemination Participants: Markus Dolensky (ICRAR), Stefan Gillard (EngineRoom.io), Kim Monks (Systemic), Robert Shen (AAL), Congming Shi (KMUST), Cheqing Jin (ECNU), Séverin Gaudet (CADC)

Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Theme 9 - SummaryDissemination

Participants: Markus Dolensky (ICRAR), Stefan Gillard (EngineRoom.io), Kim Monks (Systemic), Robert Shen (AAL), Congming Shi (KMUST), Cheqing Jin (ECNU), Séverin Gaudet (CADC)

Page 2: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Approach

1. Identify science data product types and SRC services, 2. identify suitable (VO) technologies, 3. dissemination strategy across sites and centres, 4. SRC services vs data management environments. 5. What to change? Lessons learned from precursors and

pathfinders, 6. define data challenges based on above, 7. prioritise a data challenge.

Page 3: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Data Product TypesImage Products 1: Image Cubes

Imaging data for Continuum, as cleaned restored Taylor term images, Residual image, Clean component image, Spectral line cube after continuum subtracted, Residual spectral line image (i.e. residuals after clean applied), Representative Point Spread Function for observations

Image Products 2: UV-grids  

Calibrated visibilities, gridded onto grids at spatial and frequency resolution required by experiment. One grid per facet, Accumulated Weights at each uv cell in each grid (without additional weighting (e.g. uniform) applied).

Calibrated Visibilities Calibrated visibility data (for example for EoR experiments) with direction-dependent calibration information, with time and frequency averaging performed as requested to reduce data volume.

LSM CatalogueProvides a catalogue of a subset of the Global Sky Model (GSM) containing the sources relevant for the scheduling block being processed. These are the sources in the FOV, as well as, potentially, strong sources outside of the current FOV. Initially the LSM is filled from the GSM; during data processing sources found in the images are added to the LSM.

Transient Source Catalogue Time ordered catalogue of candidate transient objects pertaining to each detection alert from the Fast Imaging pipeline.

Pulsar Timing Solutions

For each of the observed pulsars the output data from the pulsar timing section will include the original input data as well as averaged versions of these data products (either averaged in polarisation, frequency or time) in PSRFITs format.The arrival time of the pulse.The residuals from the current best-fit model for the pulsar. An updated model of the arrival times.

Transient Buffer Data Voltage data passed through from the CSP when the transient buffer is triggered.

Sieved Pulsar and Transient Candidates

A data cube which will be folded and dedispersed; A single ranked list of non-imaging transient candidates from each scheduling block.; A set of diagnostics/heuristics

Science Product Catalogue A database relating to all Science Products processed by the SDP and including associated scientific data that can be queried and searched in order for the subsequent delivery of these Data Products.

Calibration Products TBD/TBCQA and Processing Log for each of above

Page 4: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

IVOA Support

• Datalink L4 mechanics and VO context - diagram …

SKA Data Product IVOA Data Discovery

IVOA Data Access IVOA Data Model

Image cubes (image product) TAP, SIAv2, DataLink SODA ObsCore, NDimCubeDM

UV-grids (image product) TAP, SIAv2, DataLink SODA ObsCore, NDimCubeDM

Calibrated visibilities TAP, SIAv2, DataLink SODA ObsCore, NDimCubeDM

GSM/LSM catalogue TAP, DataLink TAP RegTAP, CharDM, STC, PhotDM, VO-DML

Pulsar timing solutions TAP, DataLink TAP Obscore, TimeSeriesDM

Sieved pulsar and transient candidates TAP, DataLink TAP -

Science product catalogue TAP, DataLink TAP RegTAP, STC, PhotDM, VO-

DML,CatalogDM, SourceDM

*) greyed out candidate standards indicate unknowns in data products or future standard extensions

Page 5: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Precursor Lessons learned: MWA DisseminationChen Wu, ICRAR

• File size makes a big difference• Respect transport protocols (e.g. pipelining)• Linux kernel tuning (e.g. net.ipv4.tcp_rmem)• Use checksums to detect data corruption• Benchmarking, simulation, evaluation, etc.

Transfer Perth - Boston: 20,000 km at 400 MB/s

• Data access pattern changes• Optimize at the right place• Fine tune data transfer• Ensure reversibility

Page 6: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Negative Database ND

“As a dedicated synthetic-aperture radio interferometer, the Mingantu Spectral Radioheliogrpah (MUSER) has entered the stage of routine observation. More than 23 million data records in a day should be effectively managed so as to provide high performance data query and retrieval for scientific data reduction. We present a novel data management technique named negative database (ND) used to implement data management system for the MUSER. Based on key-value database, ND technique made fully use of the complementation set of observational data and significantly reduced the storage volume. Meanwhile, the ND attained the similar query performance while comparing with related database management system. The principle of the ND determined that the data must have the feature of the consecutive with fixed time interval. Otherwise, we have no way to deduce the corresponding data based on a few discrete records. Therefore, the application and popularization of the ND would be limited seriously. But fortunately, the scientific data management is suitable for using ND technique. Because the scientific data, e.g., astronomical radio data, are generally time series data. These data have a stable and fixed sampling interval. The ND developed for MUSER may be capable for serving the massive data management of the Square Kilometer Array (SKA)”

Congming Shi, Hui Deng, KMUST

Page 7: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Multi-Version Data Management

• Mo#va#on•  Processingastronomicaldatao3eninvolvesanumberofimagestakenatdifferent#me.Datacompressionwillbehelpfulduetohighsimilarity.

Mainidea:onlyreserveafewbaseversions+lotsofdeltaversions

Cheqing Jin, ECNU

Page 8: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Challenges and Future work

•  Challenges•  Howtobalancethequeryperformanceandstorageperformance

•  Moreoriginalversionswillfastenqueryprocessing,whilefeweroriginalversionhelpstosavespace.

•  Futurework•  Selectbaseversionsbyusingquerylogs.

•  SelectbaseversionsbyusingaMaterializeMatrixthatdescribesthedifferencebetweenversions.

•  Batch-basedsoluBon:selectonebaseversionoutofeachKconsequentversions.�

Cheqing Jin, ECNU

Page 9: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Sample Data Access Policy

Short Name Strawman Requirement

Release Status Each data product shall have a release status and the system shall support scheduled changes to this status information.

Consistent Release StateThe system shall provide methods for consistently setting and updating the release status of shared data products in the context of commensal observing.

User Roles Support an extensible list of defined user roles for the purpose of authorisation.

Data Product Types Support an extensible set of data product types.

Access RulesThe system shall provide an operator interface that allows the creation and subsequent maintenance of access rules based on user role and product type.

Meta Access It shall be possible to define access to metadata and data products separately.

Access Log There shall be a logging facility keeping track of access on a per user and on a per data product basis.

Page 10: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Example of a Service Level Agreement following ITIL V3

1. Determine Service Catalogue• Science Archive: storage, user portal• Logging: of usage• Authorisation, Access Control: data access policy

Other SRC services• Processing environment: re- and post-processing capability• Backup & Disaster Recovery• …

2. Define generic service levels for each service3. Determine organisational responsibilities of SKAO and SRC

Kim Monks, Systemic

Study: From Data Access Policy to an SLA

Page 11: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Kim Monks, Systemic

Mini Study: From Data Access Policy to an SLA contd.

Page 12: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Prototype Components

• Service Catalogue (SLA)• Science Archive: MWA Public Precursor Data Sets (~1 PB)• FAIR Data Principles (Robert)• Workbench: Docker, Logical Graph Editor• VO Portal with Science Product Catalogue• Token Framework (Stefan)• Design Goal: Low Technology Barrier for User (Stefan)

Dissemination Prototyping Scenario… between Australian SDP Preservation System and Chinese SRC

Page 13: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Lowering the Technology BarrierStefan Gillard, EngineRoom.io

SRCs

Service Catalog

PAAS

Science

Capa- bilities

Science Archive

Page 14: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

SKA FAIR Data Principles

• What•  Findable•  Accessible•  Interoperable•  Reusable

• Why•  Toenhancetheabilityofhumanandmachinestoautoma5callyfindandusethedata•  Tosupportdatareusebyindividuals.•  Horizon2020guidelines:FAIRdatamanagementinHorizon2020•  NSF/ARC

Robert Shen, AAL

Page 15: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

•  Findable•  Dataareassignedagloballyuniqueiden2fier.•  Dataaredescribedwithrichmetadata.•  Data/metadataaresearchable.

• Accessible•  Dataareretrievablebytheiriden2fier•  Protocolallowsforanauthen2ca2onandauthoriza2onprocedure•  Metadataareaccessible.

•  Interoperable•  Data/metadatausecontrolledvocabularies.•  Data/metadataincludequalifiedreferencestootherdata/metadata.

• Reusable•  Datahaveaclearlicense•  Dataareassociatedwiththeirprovenance•  Datameetsastronomystandard

SKA FAIR Data Principles contd.Robert Shen, AAL

Page 16: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

IVOA Support

• Datalink L4 mechanics and VO context - diagram …

SKA Data Product IVOA Data Discovery

IVOA Data Access IVOA Data Model

Image cubes (image product) TAP, SIAv2, DataLink SODA ObsCore, NDimCubeDM

UV-grids (image product) TAP, SIAv2, DataLink SODA ObsCore, NDimCubeDM

Calibrated visibilities TAP, SIAv2, DataLink SODA ObsCore, NDimCubeDM

GSM/LSM catalogue TAP, DataLink TAP RegTAP, CharDM, STC, PhotDM, VO-DML

Pulsar timing solutions TAP, DataLink TAP Obscore, TimeSeriesDM

Sieved pulsar and transient candidates TAP, DataLink TAP -

Science product catalogue TAP, DataLink TAP RegTAP, STC, PhotDM, VO-

DML,CatalogDM, SourceDM

*) greyed out candidate standards indicate unknowns in data products or future standard extensions

Page 17: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Lowering the Technology Barrier to Insight

Lowering the barrier to Science / Insight via• Tools, process, standardisation and ability to customise.• Collaboration, operational efficiency• Education, “PlayBooks”, repeatable process• Automation and capacity to fail early, fail fast.

Outcome:1.Repeatable pipelines at scale.2.Repeatable results3.A change in adhoc behaviour, increase in accelerated

science outcomes & failures.

Page 18: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Lowering the barrier to insight via • Tools, process, standardisation and ability to customise. 1. Examples – Docker / VM standardisation. Modular pipeline

workflows. A Science Insight workbench that is essentially a giant Mechano set.

2. Playbook architecture that enables users to generate Insight faster. 3. Reusable architecture. • Collaboration, operational efficiency • Education, “PlayBooks”, repeatable process • Automation and capacity to fail early, fail fast.

Outcome: • Repeatable pipelines at scale. • Repeatable results

Lowering the Technology Barrier to Insight

Page 19: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Great in Theory but…………..

……………………………Will it scale in the real world?

Lowering the Technology Barrier to Insight

Page 20: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

• 760 analysts, 6 countries, 2.6PB under management - 1 central source of truth (pipelines and PAAS)

• Multiple dockerised / VM instances (up to over 1K instances at one time) • Service Catalogue inventory of over 140 (COTS & custom) applications, Machine Learning

and AI Libraries. • Hybrid architecture of GPU, CPU cluster nodes from 2GB to 2TB of RAM, NAND FLASH,

SSD, 15k SAS, SATA, LTO5 • Users RDP / XRDP to their workbench, adjacent to Data, and adjacent to Processing. • A centralised service catalogue of options. (Mechano style) • Cost Recovery using a “Token” Project approach. Everything has a price either Time or

Cost. • 70% lower support costs • Integrated knowledge management / transfer / Best Practice as standard. • A reusable “Data Lake” that accommodates quick customised workflows and data stream

inquiry. • Currently after 3 years processing 40x the data, across 150% more client “Insight”

engagements, with only 12% increase in FTE staff. • Playbooks and Knowledge dissemination key for high performance between teams.

• OUTCOME: Asking Bigger Questions --- Faster

Case Study Example

Page 21: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

• Resource consumption • Pipeline “Playbooks” • Reusable code base / libraries • ML / AI workbench ie www.datarobot.com • Interactive version control • Inline github code repository • External data output / simulation reference schema • Pipeline / resource calculator • DataStore / Datalake inventory

Page 22: Theme 9 - Summary Dissemination - SHAOcenter.shao.ac.cn/CRATIV/file/Day4-Report-Theme_9.pdf · 1. Examples – Docker / VM standardisation. Modular pipeline workflows. A Science

Thank You