31
The Big Data Platform Initiative of the EC Joint Research Centre European Commission, Joint Research Centre Directorate I Competences, Unit I.3 Text and Data Mining EO&SS@BigData Project Joint Research Centre (JRC) Data analytics workshop for official statistics (daWos) Amsterdam. 10/09/2018 URL: https://cidportal.jrc.ec.europa.eu Contact: [email protected]

The Big Data Platform Initiative of the EC Joint Research

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Big Data Platform Initiative of the EC Joint Research

The Big Data Platform Initiative of the EC Joint Research Centre

European Commission, Joint Research Centre

Directorate I Competences, Unit I.3 Text and Data Mining EO&SS@BigData Project

Joint Research Centre (JRC)

Data analytics workshop for official statistics (daWos)

Amsterdam. 10/09/2018

URL: https://cidportal.jrc.ec.europa.eu Contact: [email protected]

Page 2: The Big Data Platform Initiative of the EC Joint Research

Outline

• Project background

• JEODPP platform concept

• Data holdings

• Services

• Outreach

• Project evolution

Page 3: The Big Data Platform Initiative of the EC Joint Research

Project background

• Explosion of digital data sources led to the big data paradigm (Volume, Velocity, and Variety of data streams).

• Earth Observation (EO) entering big data thanks Copernicus Sentinel satellites (full, free, and open data).

• JRC task force recommended in late 2014 to start a big data pilot project on EO and Social Sensing.

• Initial state: fragmented approach hampering collaborative working and knowledge sharing.

• Project start: January 2015.

Page 4: The Big Data Platform Initiative of the EC Joint Research

Policy context

• REGULATION (EU) No 377/2014 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL 3/4/14 establishing the Copernicus Programme and repealing Regulation (EU) No 911/2010. [JRC also mentioned in proposed new space programme regulation to enter into force by 1.1.2021]

• Communication of the Commission on Data, information and knowledge management at the European Commission (COM(2016)6626-final)

• Communication from the Commission on the European Cloud Initiative (COM(2016) 178 final): The Commission and participating Member States should develop and deploy a large scale European HPC, data and network infrastructure, including: the establishment of a European Big Data centre, E.g. hosted by JRC for multidisciplinary data but focused on INSPIRE/GEOSS/Copernicus spatial data [COM(2016 178 final].

• Communication from the Commission on Artificial Intelligence for Europe (COM(2018) 237 final).

Page 5: The Big Data Platform Initiative of the EC Joint Research

Project milestones

• 2015: survey of user needs and proposal of solutions addressing their needs; endorsement of the concept of JRC Earth Observation Data and Processing Platform (JEODPP)

• 2016: procurement of hardware and first batch processing service with massive runs

• 2017: release of interactive visualisation/analysis and deployment of remote desktop services

• 2018: multi-petabyte extension, development of machine learning capabilities, JIPlib release, user basis in continuation expansion

Page 6: The Big Data Platform Initiative of the EC Joint Research

Indicators

Decisions

Big data

Big geospatial data for policy

Policy relevant information

Data

Volu

me,

Velo

city,

Variety

atmosphere

marine

land

climate

emergency

security

Exploit data volume, velocity, and variety to generate policy relevant information

• Using FAIR data principles (findable, accessible, interoperable, reusable) • With data mining competence in shared and collaborative environment • Relying on reproducible workflows

directives, legislations, communications, …

Earth Observation, in situ, crowd sourcing, social sensing, text data, web scrapping, …

Page 7: The Big Data Platform Initiative of the EC Joint Research

JRC Big Data Platform: Conceptual representation

Page 8: The Big Data Platform Initiative of the EC Joint Research

Infrastructure

Based on commodity hardware and open-source software stack:

• Storage

• CERN EOS distributed file system

• Currently 5 PiB net capacity

• 2 more PiB net for development/testing

• Processing servers (batch processing)

• 1,400 cores over 35 nodes

• 3 GPU servers

• extensions including further GPU servers in late 2018

Page 9: The Big Data Platform Initiative of the EC Joint Research

JEODPP in

As of September 2017

As of September 2018

Page 10: The Big Data Platform Initiative of the EC Joint Research

Main software stack

Source: Soille et al., Future Generation of Computer Systems, 2017 DOI: 1010.1016/j.future.2017.11.007 (in press)

Page 11: The Big Data Platform Initiative of the EC Joint Research

JEODPP access modes [WIKI Link]

• EOS CIFS mount from desktop client (read-only)

• Netapp CIFS mount (read/write) for data transfer

• Terminal service (remote desktop) https://cidportal.jrc.ec.europa.eu/apps/terminal/

• Document & data sharing based on NextCloud https://cidportal.jrc.ec.europa.eu/apps/cloud/

planned federation with JRCBox

• FTPS for file transfer to EOS

• JHub https://cidportal.jrc.ec.europa.eu/jhub/ for

• interactive visualisation and analysis

• tailored Docker containers for development

Page 12: The Big Data Platform Initiative of the EC Joint Research

JEODPP current space usage

Page 13: The Big Data Platform Initiative of the EC Joint Research

Connecting storage and processing via cloud sharing services

Page 14: The Big Data Platform Initiative of the EC Joint Research

Low-level batch processing

• Running large-scale data processing tasks in a cluster environment

• Docker containers for flexible management of processing environments

• Custom builds for different requirements

• Facilitates upgrades of processing environment (libraries, tools)

• Run through a workload manager

• HTCondor scheduler

• Extensive use for large scale processing/analysis

Page 15: The Big Data Platform Initiative of the EC Joint Research

JEODPP Batch Processing System

Diverse user environments originating from different: • libraries • tools • software • versions • distros: Debian/Centos

Docker images are built based on user requirements

Container-based cluster management

REPOSITORY TAG Info SIZE

jipl_S1toolbox-dev 2.0 snap 4.0 6.269 GB

jipl_S1toolbox-dev 1.0.1 snap 2.0.2 6.282 GB

ghsl_se2cor-dev 1.0 snap 2.0 4.742 GB

critech_ipython_deltares-dev 1.0 python 2.7 6.939 GB

marsec_MCR 1.0 MatLab run time 2015b 3.082 GB

jipl-dev 1.0 3.666 GB

marsec_sumo-dev 1.0 java 1.8 2.842 GB

canhemon_grass-dev 1.0 debian testing, python 3.0 3.397 GB

cloudmask-download v0_2 74994254f754 11 weeks ago 444.8 MB1.0.1 3.421 GB

cloudmask-download 1.0.0 3.421 GB

sentinel-download 1.0 3.121 GB

Page 16: The Big Data Platform Initiative of the EC Joint Research

Examples of batch processing scientific workflows on JEODPP

Page 17: The Big Data Platform Initiative of the EC Joint Research

JEODPP batch processing monitoring

Page 18: The Big Data Platform Initiative of the EC Joint Research

JEODPP Terminal Service via Web https://cidportal.jrc.ec.europa.eu/apps/terminal/

• A pool of Docker containers running next to the data

• Linux desktop environment • Standard software installed

QGIS, GRASS IDL/ENVI, Matlab (personalised licenses) R (R, R Commander, Rstudio) Python, Jupyter-lab, Jupyter-notebook Additions on request

• Relies on HTML5 and runs in FF, IE, and Chrome

• For prototyping, ad hoc products’ analysis/visualisation, and launch batch processing

Page 19: The Big Data Platform Initiative of the EC Joint Research

JEODPP users • 35 use-cases • From 16 units • Across 8 directorates

Page 20: The Big Data Platform Initiative of the EC Joint Research

Interactive visualization and analysis with Jupyter

• Web interface to visualize and analyze any kind of data in a single document called a Jupyter notebook

• Jupyter notebooks integrate live code, equations, visualizations, and narrative text.

• Facilitate knowledge sharing, collaborative working, and reproducible workflows.

• Suitable to non-programmers by integrating GUIs based on widgets (buttons, sliders, etc.).

Page 21: The Big Data Platform Initiative of the EC Joint Research

Jupyter ecosystem

http://jupyter.org/

Page 22: The Big Data Platform Initiative of the EC Joint Research

JupyterLab ecosystem (evolution of Jupyter)

Page 23: The Big Data Platform Initiative of the EC Joint Research

ipyleaflet

https://github.com/ellisonbg/ipyleaflet

Page 24: The Big Data Platform Initiative of the EC Joint Research

ipywidgets and bqplot

https://github.com/jupyter-widgets/ipywidgets https://github.com/bloomberg/bqplot

Page 25: The Big Data Platform Initiative of the EC Joint Research

From big data to interactive rendering and analysis

Source: FGCS, 2017, DOI: 10.1016/j.future.2017.11.007

+ in Situ data

Page 26: The Big Data Platform Initiative of the EC Joint Research

Global Human Settlement Layer with Global Surface Water Occurence on top of Global S1 mosaic

Page 27: The Big Data Platform Initiative of the EC Joint Research

Html export to facilitate outreach (example with ALOS DEM)

Page 28: The Big Data Platform Initiative of the EC Joint Research

Execution of arbitrary python code in interactive mode (e.g. for MSPA)

Page 29: The Big Data Platform Initiative of the EC Joint Research

Takeaway messages

• Exponential growth of data and data sources.

• The big data paradigm is permeating all fields.

• FAIR data principles also applies to data analysis.

• Challenge of turning data into insights facilitated by platforms with data co-located with processing.

• Jupyter notebooks contributes to reproducible analysis as well as knowledge sharing and collaborative working.

• Importance of interactive analysis and visualisation.

• Open standards including open API are needed to avoid platform lock-in.

Page 30: The Big Data Platform Initiative of the EC Joint Research

Project evolution: Big Data Analytics (2019-2020)

• Innovative approaches (AI/machine learning) for combining large amounts of data originating from different sources

• Enabled by the JRC Big Data Platform (JEODPP)

• Initial focus on geospatial data and their combination with other data sources

• Key enabler of data and knowledge sharing across JRC and towards partners

• Link with DIAS (support to DG GROW and possible partnership with WEkEO DIAS)

• Key role of openEO H2020 project (definition of common API)

Page 31: The Big Data Platform Initiative of the EC Joint Research

Thank you for your attention!

EO&SS@BigData pilot project Unit I.3 Text and Data Mining Unit Directorate I Competences

GEO-WEEK, Washington DC, Oct 2017

https://doi.org/10.1016/j.future.2017.11.007 Publication list: https://cidportal.jrc.ec.europa.eu/home/publications