Efficient and effective: can we combine both to realize high-value, open, scalable, multi-disciplinary data and compute infrastructures?

Efficient and effective: can we combine both to realize high-value, open, scalable, multi-

disciplinary data and compute infrastructures?

RIA-653549

Davide SalomoniINDIGO-DataCloud Project Coordinator

[email protected]

FAIR data management, RDA National EventFirenze, 14-15 November 2016

mailto:[email protected]

mailto:[email protected]

Efficient and Effective

14-15/11/2016 Efficient and Effective: INDIGO-DataCloud 2

Something is still missing in the Cloud world…

Source: http://goo.gl/wT8XEq


was

http://goo.gl/wT8XEq

http://goo.gl/wT8XEq

What are the main missing points?

• Open interoperation / federation across (proprietary) CLOUD solutions at

• IaaS,• PaaS,• and SaaS levels

• Managing multitenancy• At large scale…• … and in heterogeneous environments

• Dynamic and seamless elasticity• For both private and public cloud…• … and for complex or infrequent requirements

• Data management in a Cloud environment• Due to technical…• … as well as to legal problems


Filling these gaps should lead to:

• Interoperable PaaS/SaaS services addressing both public and private Cloud infrastructures.

• Migration of legacy applications to the Cloud.

• Increased focus on user-oriented, high-value solutions.

Source:https://goo.gl/cWZhKN

https://goo.gl/cWZhKN



INDIGO-DataCloud(INtegrating Distributed data Infrastructures for Global ExplOitation)

• An H2020 project approved in the EINFRA-1-2014 call• 11.1M€, 30 months (from April 2015 to September 2017)

• Who: 26 European partners in 11 European countries• Coordination by INFN (Italian National Inst. for Nuclear Physics)• Including developers of distributed software, industrial partners,

research institutes, universities, e-infrastructures

• What: develop an open source Cloud platform for computing and data (“DataCloud”), tailored to science.

• Where: deployable on hybrid (public or private) Cloud infrastructures

• For: multi-disciplinary scientific communities• E.g. structural biology, earth science, physics, bioinformatics, cultural

heritage, astrophysics, life science, climatology.

• Why: to answer to the technological needs of scientists seeking to easily and efficiently exploit distributed compute and data resources.

5Efficient and Effective: INDIGO-DataCloud14-15/11/2016

INDIGO-DataCloud Positioning• INDIGO aims to:

1. Develop open, interoperable solutions for scientific data.

2. Support open science organizing the European data space.

3. Enable collaborations across diverse scientific communities worldwide.

• INDIGO offers its architecture, analysis, expertise and software components as a concrete step toward the definition and implementation of a European Open Science Cloud and Data Infrastructure.


Publicly funded e-infrastructures(EGI, EUDAT, GEANT, PRACE, RI,

etc.)

Private or CommercialClouds (Public, PCP-based,

etc.)

Scientific Users

Adopt, Use

Deployed on

Exploiting

To produce

Scientific Results

INDIGO Advanced Components and Solutions

Datasets, Resources

The INDIGO Foundations


Put Users First

Exploit Software Development

Know-how

Fill Technology Gaps

Validate through

Concrete Use Cases

Extend and Reuse Open

Source Software

Be Multidisciplinary

, Standards-based

Put Users First

• Requirements come from research communities• “The proposal is oriented to support the use of different e-

infrastructures by a wide-range of scientific communities, and aims to address a wide range of challenging requirements posed by leading-edge research activities” (From the DoW)

• We gathered use cases from many scientific communities.

• LifeWatch, EuroBioImaging, INSTRUCT, LBT, CTA, WeNMR, ENES, eCulture Science Gateway, ELIXIR, EMSO, DARIAH, WLCG.

• We grouped ~100 distinct requirements into 3 categories: Computational requirements, Storage requirements, Requirements on infrastructures, and associated each one with a ranking (mandatory / convenient / optional).


From Deliverable D2.1

Translating requirements into concrete solutions:From the architecture…


This is the INDIGO-DataCloud General Architecture*

*: see details in http://arxiv.org/abs/1603.09536 or in https://www.indigo-datacloud.eu/documents-deliverables

http://arxiv.org/abs/1603.09536

http://arxiv.org/abs/1603.09536

https://www.indigo-datacloud.eu/documents-deliverables

https://www.indigo-datacloud.eu/documents-deliverables

… to the implementation…


This is our software improvement cycle and the integration / release / software quality processes

… to INDIGO Releases…

Releasing software components implementing the INDIGO architecture and providing concrete solutions to the requirements of scientific communities is the primary goal of the project.


See https://www.indigo-datacloud.eu/communication-kit

https://www.indigo-datacloud.eu/communication-kit

https://www.indigo-datacloud.eu/communication-kit

… and results.


Excerpt from an INDIGO Report detailing how scientific communities are implementing their own requirements into applications using INDIGO-DataCloud components.

From Deliverable D2.10

Four main “solution blocks”:• Data Center Solutions• Data / Storage

Solutions• Automated Solutions• User-Oriented

SolutionsAnd “common solutions”:• Authentication and

Authorization


Putting everything together:

Index of Services in INDIGO MidnightBlue


INDIGO Components and Patches AlreadyMerged in Upstream Open Source Projects• OpenStack (https://www.openstack.org)

• Nova Docker• Heat• OpenID-Connect for Keystone• Pre-emptible instances support (under

discussion)

• OpenNebula (http://opennebula.org) • OneDock

• Infrastructure Manager (http://www.grycap.upv.es/im/index.php)

• Clues (http://www.grycap.upv.es/clues/eng/index.php)

• Onedata (https://onedata.org)

• TOSCA adaptor for JSAGA (http://software.in2p3.fr/jsaga/dev/)

• OCCI implementation for OpenStack (https://github.com/openstack/ooi)

• Extended AWS support for rOCCI in OpenNebula. Python and Java libraries for OCCI support.

• CDMI and QoS extensions for dCache (https://www.dcache.org)

• Workflow interface extensions for Ophidia (http://ophidia.cmcc.it)

• OpenID Connect Java implementation for dCache (https://www.dcache.org)

• MitreID (https://mitreid.org/) and OpenID Connect (http://openid.net/connect/) libraries


https://www.openstack.org)/

https://www.openstack.org)/

http://opennebula.org)/

http://opennebula.org)/

http://www.grycap.upv.es/im/index.php)

http://www.grycap.upv.es/im/index.php)

http://www.grycap.upv.es/clues/eng/index.php)



https://onedata.org/#/home)

http://software.in2p3.fr/jsaga/dev/)

http://software.in2p3.fr/jsaga/dev/)

https://github.com/openstack/ooi)

https://github.com/openstack/ooi)

https://www.dcache.org)/


http://ophidia.cmcc.it)/

http://ophidia.cmcc.it)/



https://mitreid.org/)

http://openid.net/connect/)

On Data Ingestion and Data Management


• Constantly align the vision of the different research communities with current recommendations, in particular of the Research Data Alliance (RDA).

• The exploitation of INDIGO-DataCloud solutions requires a careful consideration of data management issues along the full data life cycle to prepare proper Data Management Plans (DMP).

• There is certainly the need for further work to inform the different Research Communities of current recommendations on data management, the need to carefully take them into account, and to further detail those data management needs as requirements to software developers.

• Most of the initial requirements have been already satisfied in the INDIGO MidnightBlue release! However, more work is needed in many areas.

6 INDIGO-related proposals submitted to the RDA open call for collaboration

Data Life Cycle


Plan •Tool•Deploy and StoreCollect •Store raw data

•Manage raw data

Curate •Filtering•ConversionsAnalyze

•Get derived values•Monitor•Run models, etc.

Publish•Findable•Accessible•Interoperable•ReusablePreserve

Ingested Data in the Life Cycle scheme(vastly simplified)


For more details: D2.11, https://owncloud.indigo-datacloud.eu/index.php/s/lLNAczJNBNLmLLG

https://owncloud.indigo-datacloud.eu/index.php/s/lLNAczJNBNLmLLG

https://owncloud.indigo-datacloud.eu/index.php/s/lLNAczJNBNLmLLG

Conclusions• It is often complicated to combine efficient and effective solutions when trying to

exploit distributed data/compute resources.• INDIGO-DataCloud has defined and developed a comprehensive open architecture

to handle distributed data and workloads, extending open source products. • It has already released a novel and rich set of components, that multiple research

communities are adopting for the deployment of scientific applications on hybrid Grid/Cloud infrastructures.

• INDIGO-DataCloud will now focus on consolidating its software, adding requested new features, deploying it in production e-infrastructures and addressing exploitation through concrete links to commercial companies, to other projects or organizations and to current / upcoming EU calls.

• You are all welcome to contribute and share your views and requirements!


Thank you

https://www.indigo-datacloud.euBetter Software for Better Science.

20Efficient and Effective: INDIGO-DataCloud14-15/11/2016

@indigodatacloud www.indigo-datacloud.eu https://www.facebook.com/indigodatacloud/

https://www.indigo-datacloud.eu/



http://www.indigo-datacloud.eu/

https://www.facebook.com/indigodatacloud/

https://www.facebook.com/indigodatacloud/

Data & Analytics

Efficient and effective: can we combine both to realize high-value, open, scalable, multi-disciplinary data and compute infrastructures?