Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
D3.3 Data Virtualization SDK
prototype
Project Acronym DITAS
Project Title Data-intensive applications Improvement by moving
daTA and computation in mixed cloud/fog
environmentS
Project Number 731945
Instrument Collaborative Project
Start Date 01/01/2017
Duration 36 months
Thematic Priority
Website:
ICT-06-2016 Cloud Computing
http://www.ditas-project.eu
Dissemination level: Public
Work Package WP3 Data virtualization
Due Date: M30
Submission Date: 02/07/2019
Version: 1.0
Status Final for submission
Author(s): Alexandros Psychas (ICCS), Achilleas Marinakis (ICCS),
Vrettos Moulos (ICCS), Jose Antonio Sanchez (ATOS),
Frank Pallas (TUB), Sebastian Werner (TUB), Maya
Anderson (IBM), Mattia Salnitri (POLIMI), Giovanni
Meroni (POLIMI), Monica Vitali (POLIMI)
Reviewer(s) Monica Vitali (POLIMI), Maya Anderson (IBM)
This project has received funding by the European Union’s Horizon
2020 research and innovation programme under grant agreement
No. 731945
© Main editor and other members of the DITAS consortium
2 D3.3 Data Virtualization SDK prototype
Version History Version Date Comments, Changes, Status Authors, contributors,
reviewers
0.1 01/05/2019 ToC creation Alexandros Psychas
(ICCS)
0.2 20/05/2019 Blueprint sections 1,5 Achilleas Marinakis
(ICCS)
0.3 31/05/2018 Component Architecture, DAL,
Application Profiling
Alexandros Psychas
(ICCS), Achilleas
Marinakis (ICCS),
Vrettos Moulos (ICCS),
Jose Antonio Sanchez
(ATOS), Frank Pallas
(TUB), Sebastian Werner
(TUB), Maya Anderson
(IBM), Mattia Salnitri
(POLIMI), Giovanni
Meroni (POLIMI),
Monica Vitali (POLIMI)
0.7 7/06/2019 Internal review version Alexandros Psychas,
Achilleas Marinakis,
(ICCS)
0.91 21/06/2019 Internal review comments Monica Vitali (POLIMI),
Maya Anderson (IBM)
0.92 24/06/2019 Addressed internal review
comments
Alexandros Psychas,
Achilleas Marinakis,
(ICCS)
0.93 28/06/2019 Final version for quality check Alexandros Psychas,
Achilleas Marinakis
(ICCS)
1.0 02/07/2019 Quality check. Document ready for
submission.
Enric Pages, María
Teresa García (ATOS)
© Main editor and other members of the DITAS consortium
3 D3.3 Data Virtualization SDK prototype
Contents Version History 2
Contents 3
List of Figures 4
List of tables 4
Executive Summary 5
1 Introduction 6
1.1 Glossary of Acronyms 8
2 Final Abstract VDC Blueprint Schema 9
2.1 Internal Structure (Blueprint Section 1) 9
2.2 Data Management (Blueprint Section 2) 20
2.3 Abstract Properties (Blueprint Section 3) 22
2.5 Cookbook Appendix (Blueprint Section 4) 23
2.6 Exposed API (Blueprint Section 5) 32
3 Final Component Architecture and specification 35
3.1 VDC Blueprint Repository Engine 36
3.1.1 VDC Blueprint Validator 36
3.2 VDC Blueprint Resolution Engine 37
3.2.1 Content Based Resolution 38
Component API 39
3.2.2 Data Utility Resolution Engine 39
Component API 41
3.2.3 Privacy Security Evaluator 41
Component API 43
3.2.4 Recommendation Component 44
Component API 46
4 Data Access Layer (DAL) 47
Component API 51
5 Application Profiling and Deployment Strategies 54
6 DITAS SDK 56
7 Conclusions 58
8 References 59
Appendix 60
© Main editor and other members of the DITAS consortium
4 D3.3 Data Virtualization SDK prototype
List of Figures Figure 1: Abstract VDC Blueprint lifecycle .................................................................. 7 Figure 2: Resolution Engine Architecture .................................................................. 35 Figure 3: Creation & Storage phase of the Blueprint Lifecycle ............................. 36 Figure 4: Content Based Search Sequence Diagram ............................................ 38 Figure 5: Data Utility Resolution Sequence Diagram .............................................. 40 Figure 6: Simplified Architecture................................................................................. 42 Figure 7: Filtering process of PSE Sequence Diagram ............................................. 43 Figure 8: Recommendation Component Sequence Diagram ............................. 45 Figure 9: Initialization of DAL Data Movement Sequence Diagram ..................... 48 Figure 10: Finalization of DAL Data Movement Sequence Diagram .................... 48 Figure 11: Data transformation Sequence Diagram ............................................... 49 Figure 12: DAL Interconnection with CAF and Privacy Enforcement Engine ...... 50
List of tables Table 1. Acronyms .......................................................................................................... 8
© Main editor and other members of the DITAS consortium
5 D3.3 Data Virtualization SDK prototype
Executive Summary
This document is responsible for describing all the technical information about
the components created in the context of WP3 and DITAS SDK. More specifically
this document describes all the changes that took place in the development
and implementation of the components responsible for the lifecycle of the VDC,
from the creation to the final deployment. As far as the VDC Blueprint is
concerned, all the changes in the schema in order to better describe the
structure of the VDC and also to address the requirements of the component,
are described. All the changes in the functionalities and the new features of the
components are also described in the document. In the context of WP3 Data
Access Layer (DAL) is fully described and analyzed. This component was
developed and implemented in order to aid the Data Administrator to dictate
the policies, purpose and the security measures needed in order to expose the
data. Finally, all the services, UIs and guidelines that will aid each stakeholder
(Data Administrator, Application developer, Application Designer and DITAS
Operator) involved in the creation, storage, deployment, selection and
operation of the VDC are described in the DITAS SDK section.
© Main editor and other members of the DITAS consortium
6 D3.3 Data Virtualization SDK prototype
1 Introduction One of the main missions of WP3 is to produce the DITAS-SDK which will contain
all the needed information, services, guidelines, and tools in order to support in
the creation of a complete solution for the data intensive Application Designer.
More specifically, DITAS-SDK aims at improving the productivity of the Application
Designer in the process of developing and deploying a data intensive
application. Another important task for the SDK is to help the Data Administrator
to enhance the data management in the cloud and fog environments. In order
to reach these objectives, the DITAS-SDK components were created in order to
support the full lifecycle of the VDC.
The VDC provides an abstraction layer that takes care of retrieving, processing
and delivering data with the proper quality level, while in parallel putting special
emphasis on data security, performance, privacy, and data protection. The
VDC, acting as a middleware and takes the responsibility for providing this data
timely, securely and accurately by hiding the complexity of the underlying
infrastructure, to the Application Designer, who is only obligated to define the
requirements of the application in order to find the most appropriate VDC. The
infrastructure could consist of different platforms, storage systems, and network
capabilities. The VDC Blueprint describes thoroughly the VDC, since it includes,
among others, information about the business characteristics of it, about the
data sources that the VDC connects to, how to deploy it as well as the API that
the data administrator exposes to the data consumers. Conclusively, when we
are talking about the VDC lifecycle we mainly talk about the lifecycle of the
Abstract VDC Blueprint, which is the definition, and the description of the VDC.
Abstract Blueprint lifecycle represents all the phases and components involved
from the creation to discovery and finally the deployment of a VDC Blueprint.
This lifecycle was established and described in the D3.2 Section 1[2], deliverable.
The main developments in this process were mainly in the individual components
without affecting the general Architecture.
© Main editor and other members of the DITAS consortium
7 D3.3 Data Virtualization SDK prototype
Figure 1: Abstract VDC Blueprint lifecycle
Although VDC is responsible for the complete orchestration and management
of the data, a layer responsible for managing which data are exposed and what
are the privacy and security parameters in order to expose them is needed. In
the context of WP3, the Data Access Layer (DAL) was created which has the
fundamental role of exposing the data provided by the Data Administrator to
the DITAS-EE infrastructure without violating any privacy and security constraints.
More specifically, the DAL contains the Privacy Enforcement Layer, which is the
component in charge of modifying the SQL query generated by a data
consumer willing to access the stored data. The query is executed in order to
satisfy the call coming from the Processing Layer. However, different consumers
might have different rights on the data visualization. To achieve this customized
access to the data, the DAL transforms the original query in a SQL query that
applies filters to avoid returning data that cannot be seen externally. This filtering
is also affected by the location of the VDC (e.g., some data cannot be accessed
from outside a safe location). Since the DAL is a fairly new component, a full
architecture, functionalities, and API description are presented in this document.
The rest of the document is structured as follows. Section 2 describes thoroughly
the blueprint schema and the changes that took place in the course of the
project, which components are involved with the blueprint and how the
requirements are addressed. In section 3 the Final architecture of the
components involved in the Blueprint lifecycle are described. Most importantly
the updates and changes in the components are described as well as the final
© Main editor and other members of the DITAS consortium
8 D3.3 Data Virtualization SDK prototype
API. Section 4 contains all the architecture and component specifications of the
Data Access Layer (DAL) component. Section 5 presents all the work done for
the application profiling and deployment strategies in the context of WP3. In
section 6 the SDK of DITAS is introduced. It is important to mention that in the
deliverable the main characteristics and the idea behind the creation and
implementation of the SDK is described, the complete SDK is a wiki page
continuously updated with new information and functionalities as the project
evolves, making it accessible to a wider audience through the DITAS web page.
Finally, the conclusions and future steps are presented in section 7 of this
deliverable.
1.1 Glossary of Acronyms
Acronym Definition
API Application Programming Interface
CAF Common Accessibility Framework
CLI Command-Line Interface
CRUD Create Read Update Delete
D Deliverable
DAL Data Access Layer
DB Data Base
DME Data Movement Enactor
DS4M Decision System for Data and Computation Movement
DUE Data Utility Evaluator
DUR Data Utility Refinement
DURE Data Utility Resolution Engine
EE Execution Environment
GUI Graphical User Interface
HTTP Hypertext Transfer Protocol
IAM Identity Access Management
JSON JavaScript Object Notation
N/A Not Applicable
PSE Privacy & Security Evaluator
PSES Privacy & Security Evaluator Service
REST Representational State Transfer
SDK Software Development Kit
SLA Service-Level Agreement
SQL Structured Query Language
UI User Interface
URI Uniform Resource Identifier
URL Uniform Resource Locator
UUID Universally Unique IDentifier
VDC Virtual Data Container
WP Work Package Table 1. Acronyms
© Main editor and other members of the DITAS consortium
9 D3.3 Data Virtualization SDK prototype
2 Final Abstract VDC Blueprint Schema An abstract VDC Blueprint captures all the properties of a VDC and is developed
according to the abstract VDC Blueprint schema. The latter is a general schema,
through which all the abstract VDC Blueprints are created. It follows the JSON
semi-structured format in order to address requirements T3.9, T3.11 and T3.12, with
regard to the notation language that should be accurate, user friendly, human
readable and able to be parsed efficiently and fast. The Blueprint consists of five
distinct sections, each one of which is used by different DITAS roles or
components. For instance, the Content Based Resolution component of the
DITAS architecture uses section 1 to filter the blueprints based on the content they
provide, whereas the Data Utility Resolution Engine component uses section 2 to
match the application requirements with the data administrator capabilities. The
following tables describe each field of the blueprint schema (Requirement B3.4),
focusing on the updates compared to D3.2. The traceability of each field with
the requirements is also presented. The complete list of the requirements can be
found in Annex 1 of D1.2 [4].
2.1 Internal Structure (Blueprint Section 1) While the scope of this blueprint section remains the same with respect to the
version described in D3.2 Section 3.1 [2], the table that analyzes each one field
of the blueprint has been enriched, in order to include additional information
about who (Role/Component column), when and why (Phase/Process column)
is using this specific field.
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
Overview Object Information about
the content of the
data served by
the VDC
Resolution
Engine
Blueprint
Selection
Phase
Overview.name String The name of the
VDC Blueprint
N/A N/A
Overview.description String Textual
description of the
VDC Blueprint
Resolution
Engine
Blueprint
Selection
Phase
© Main editor and other members of the DITAS consortium
10 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
Overview.tags Array Each element of
this array contains
some keywords
that describe the
exposed data of
each one VDC
method
Resolution
Engine
Blueprint
Selection
Phase
Overview.tags.metho
d_id
String The id
(operationId) of
the method (as
indicated in the
EXPOSED_API.pat
hs field)
Resolution
Engine
Blueprint
Selection
Phase
Overview.tags.tags Array Keywords that
describe the
exposed data of
this specific VDC
method
Resolution
Engine
Blueprint
Selection
Phase
With respect to the version described in D3.2 Section 3.1 [2], the Overview field
contains the same information
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
Data_Sources Array Data sources used
by this VDC
Deployment
Engine
VDC
deployme
nt and
movement
Data_Sources.items.
properties.id
String A unique identifier
of this data source
Deployment
Engine
VDC
deployme
nt and
movement
© Main editor and other members of the DITAS consortium
11 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
Data_Sources.items.
properties.descriptio
n
String Description Deployment
Engine
VDC
deployme
nt and
movement
Data_Sources.items.
properties.location
Enum Cloud/Edge Deployment
Engine
VDC
deployme
nt and
movement
Data_Sources.items.
properties.class
Enum "relational
database",
"object storage",
"time-series
database", "api",
"data stream"
Deployment
Engine
VDC
deployme
nt and
movement
Data_Sources.items.
properties.type
Enum "MySQL", "Minio",
"InfluxDB",
"rest", "other"
Deployment
Engine
VDC
deployme
nt and
movement
Data_Sources.items.
properties.paramet
ers
Object Connection
parameters
Deployment
Engine
VDC
deployme
nt and
movement
Data_Sources.items.
properties.schema
Object Schema Deployment
Engine
VDC
deployme
nt and
movement
Additional properties have been added to the Data_Sources field, for the use of
the deployment engine when configuring the VDC, such as data source id, and
for the use of DS4M. This addresses the requirement T2.1 for metadata describing
the data sources that must be available.
© Main editor and other members of the DITAS consortium
12 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
Methods_Input Object This field contains
the part of the
data source that
each method
needs to be
executed
DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods
Array The list of methods DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods.method_id
String The id
(operationId) of
the method (as
indicated in the
EXPOSED_API.pat
hs field)
DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods.dataSources
Array The list of data
sources required
by the method
DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods.dataSources.
dataSource_id
String The id of the data
sources (as
indicated in the
Data_Sources
field)
DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods.dataSources.
dataSource_type
String The type of the
data sources
(relational/not_rel
ational/object)
DS4M Data and
computati
on
movement,
Monitoring
Data and
computati
© Main editor and other members of the DITAS consortium
13 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
on
movement,
Monitoring
Methods_Input.Met
hods.dataSources.
database
Array the list of
databases
required by a
method in a data
source
DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods.dataSources.
database.databas
e_id
String The id of the
database
DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods.dataSources.
database.tables
Array the list of
tables/collections
required by a
method in a data
source
DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods.dataSources.
database.tables.ta
ble_id
String The id of the
tables/collection
DS4M Data and
computati
on
movement,
Monitoring
Methods_Input.Met
hods.dataSources.
database.tables.c
olumns
Array The IDs of the
column/field to
be moved
DS4M Data and
computati
on
movement,
Monitoring
The Methods_Input field describes what are the parts of each data source that
are used for each method. It was inserted in the blueprint to allow the Decision
System for Data and Computation Movement (DS4M) to move only the portion
of data that is used [3]. Such new section allows reducing the amount of data to
be transferred and stored when data sources are replicated (Objective 2.5) and
to correctly estimate the data utility for each method (Requirement T3.18).
© Main editor and other members of the DITAS consortium
14 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
Flow Object The data flow that
implements the
VDC
N/A N/A
Flow.platform Enum Spark or Node-
RED
N/A N/A
Flow.parameters Object Platform details
(for Spark)
N/A N/A
Flow.source_code Any JSON
structure
The flow JSON file
(for Node-RED)
N/A N/A
With respect to the version described in D3.2 Section 3.1 [2], the Flow field
contains the same information
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
DAL_Images Object Set of Docker
images to include
in the DALs
associated to this
VDC. The key is a
unique DAL
identifier while the
values are the
image
information
Deployment
Engine
VDC
Deployme
nt and
movement
DAL_Images.[dal_i
d].original_ip
string IP where the
original DAL has
been deployed
Deployment
Engine
VDC
Deployme
nt and
movement
DAL_Images.[dal_i
d].images
Object Set of images to
deploy in this DAL
implementation.
N/A N/A
© Main editor and other members of the DITAS consortium
15 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
The key is a unique
identifier and the
values are the
image
information
DAL_Images.[dal_i
d].images.[image_i
d].image
string The Docker image
name in standard
format
[repository]/[grou
p]/<image_name
>:[version]
Deployment
Engine
VDC
Deployme
nt and
movement
DAL_Images.[dal_i
d].images.[image_i
d].internal_port
Int The port in which
the software of
the image will be
listening, if any.
This port won’t be
exposed outside,
but it will receive
data through
redirection.
Deployment
Engine
VDC
Deployme
nt and
movement
DAL_Images.[dal_i
d].images.[image_i
d].external_port
Int The port in which
the image will be
accessible. It will
redirect any
request to this port
to the one
specified in
internal_port
Deployment
Engine
VDC
Deployme
nt and
movement
DAL_Images.[dal_i
d].images.[image_i
d].environment
Object Environment
variables to pass
to the image in
key-value format.
Deployment
Engine
VDC
Deployme
nt and
movement
The DAL_Images field is new, and it has been introduced in this version of the
schema.
© Main editor and other members of the DITAS consortium
16 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
VDC_Images
Object Set of Docker
images to include
in the DAL. The
key is a unique
image identifier
while the values
are the image
information
Deployment
Engine
VDC
Deployme
nt and
movement
VDC_Images.[im
age_id].image
string The Docker image
name in standard
format
[repository]/[grou
p]/<image_name
>:[version]
Deployment
Engine
VDC
Deployme
nt and
movement
VDC_Images.[im
age_id].internal_
port
Int The port in which
the software of
the image will be
listening, if any.
This port will not
be exposed
outside but it will
receive data
through
redirection.
Deployment
Engine
VDC
Deployme
nt and
movement
VDC_Images.[im
age_id].internal_
port
Int The port in which
the image will be
accessible. It will
redirect any
request to this port
to the one
specified in
internal_port
Deployment
Engine
VDC
Deployme
nt and
movement
VDC_Images.[im
age_id].internal_
port
Object Environment
variables to pass
Deployment
Engine
VDC
Deployme
© Main editor and other members of the DITAS consortium
17 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
to the image in
key-value format.
nt and
movement
The VDC_Images field is new, and it has been introduced in this version of the
schema.
Field Type
(JSON
Format)
Description Role/Compone
nt
Phase/Proc
ess
Identity_Access_Ma
nagement
Object Information about
identity and
access
management of
this VDC
Access
Management,
Request
Monitor
Blueprint
Selection
Phase
Identity_Access_Ma
nagement.jwks_uri
String The JWKS URL for
getting
verification keys
Application,
Request
Monitor, DAL,
Any that needs
to validate a
token
Blueprint
Selection
Phase
Identity_Access_Ma
nagement.iam_end
point
String The endpoint of
the IAM server
Application,
Request
Monitor, DAL,
Any that needs
to validate a
token
Blueprint
Selection
Phase
Identity_Access_Ma
nagement.roles
List of
Strings
A set of roles that
a user might have.
Request
Monitor, PSES
Blueprint
Selection
Phase
Identity_Access_Ma
nagement.provider
List of
Objects
A list of identity
provider that can
be used. Can be
empty if only the
DITAS internal one
is used.
N/A Blueprint
Selection
Phase
© Main editor and other members of the DITAS consortium
18 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/Compone
nt
Phase/Proc
ess
Identity_Access_Ma
nagement.provider[i
].name
String Name of the
provider
Request
Monitor
Blueprint
Selection
Phase
Identity_Access_Ma
nagement.provider[i
].type
String Type of the
provider to use.
Only OAuth
supported for
now.
Request
Monitor
Blueprint
Selection
Phase
Identity_Access_Ma
nagement.provider[i
].uri
String Address of the
provider.
Request
Monitor
Blueprint
Selection
Phase
Identity_Access_Ma
nagement.provider[i
].portal
Login Portal for
that provider.
Request
Monitor
Blueprint
Selection
Phase
The Identity_Access_Management field is new, and it has been introduced in this
version of the schema. It describes how identity access is managed. It was
inserted in the blueprint for two main purposes:
● Giving App Developers all information necessary to authenticate against
the VDC
● Enable pre-filtering of Blueprints that a Developer has no access to
All technical details about the identity access management are stored in the
cookbook section of the blueprint under the same field name, as the other
information is not relevant to an application developer but the DITAS runtime.
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
Testing_Output_D
ata
Array Sample dataset
per VDC method
DUE Blueprint
Selection
Phase
Testing_Output_D
ata.method_id
String The id of this
exposed VDC
method
DUE Blueprint
Selection
Phase
© Main editor and other members of the DITAS consortium
19 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
Testing_Output_D
ata.attributes
Array The attributes of
the output data
returned by the
method that are
required by the
end user
DUE Blueprint
Selection
Phase
Testing_Output_D
ata.zip_data
String The URI to the zip
sample data for
this exposed VDC
method
DUE Blueprint
Selection
Phase
Testing_Output_D
ata.history_time
Integer The time interval,
expressed in
seconds before
the current time,
to compute the
availability of the
method.
SLA Manager Monitoring
Testing_Output_D
ata.history_invoc
ations
Integer The maximum
number of
invocations of the
method, to
compute its
availability
SLA Manager Monitoring
With respect to the version described in D3.2 Section 3.1 [2], the
Testing_Output_Data field contains almost the same information.
The attributes, history_time and history_invocations subfields are obtained from
the application requirements. Therefore, they are present only in the
intermediate and concrete blueprint. The attributes subfield is used by the Data
Utility Evaluator (DUE) to calculate the data quality only for those output
attributes that are relevant for the application designer [1]. Similarly, the
history_time and history_invocations subfields are used by the SLA manager to
compute the availability for that method based on, respectively, the indicated
time interval and number of service invocations.
The zip_data subfield, instead, may be used by the data administrator in order to
provide a reference to a file that contains a representative sample of the dataset
that a specific VDC method exposes (Requirement T2.2), and by the DUE to
correctly estimate the data utility for each method (Requirement T3.18).
© Main editor and other members of the DITAS consortium
20 D3.3 Data Virtualization SDK prototype
2.2 Data Management (Blueprint Section 2) The Data Management section of the Blueprint specifies, for each method, the
guaranteed levels of data quality, security and privacy. Such information will be
used (i) for filtering the blueprints that do not fit the Application Designer
requirements; (ii) for the data and computation movement; (iii) as specifications
of metric thresholds agreed with the application developer. For further details,
please see D3.2 Section 3.2 Table 6 and Table 7 [2].
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
method_id String The id of this
exposed VDC
method
DURE, SLA
Manager,
DS4M
Blueprint
Selection
Phase,
Data and
computatio
n
movement,
Monitoring
attributes Object Data utility,
security and
privacy attributes
for this exposed
VDC method
DURE, SLA
Manager,
DS4M
Blueprint
Selection
Phase,
Data and
computatio
n
movement,
Monitoring
attributes.dataUtility Array A list with all the
metrics related to
data quality. The
JSON schema for
each one metric
is presented in the
next table
DURE, SLA
Manager,
DS4M
Blueprint
Selection
Phase,
Data and
computatio
n
movement,
Monitoring
© Main editor and other members of the DITAS consortium
21 D3.3 Data Virtualization SDK prototype
Field Type
(JSON
Format)
Description Role/
Component
Phase/
Process
attributes.security Array A list with all the
properties related
to security. The
JSON schema for
each one
property is
identical to the
schema used for
the data utility
metrics
DURE, SLA
Manager,
DS4M, PSE
Blueprint
Selection
Phase,
Data and
computatio
n
movement,
Monitoring
attributes.privacy Array A list with all the
properties related
to privacy. The
JSON schema for
each one
property is
identical to the
schema used for
the data utility
metrics
DURE, SLA
Manager,
DS4M, PSE
Blueprint
Selection
Phase,
Data and
computatio
n
movement,
Monitoring
With respect to the contents described in D3.2 Section 3.2 [2], the data structure
was slightly changed. In particular, properties associated to a metric are no
longer defined as JSON objects. Instead, they are defined as JSON properties.
Changes in the JSON schema are highlighted in bold.
{
"type":"object",
"properties":{
"id":{
"description":"id of the metric",
"type":"string"
},
"name":{
"description":"name of the metric",
"type":"string"
},
"type":{
"description":"type of the metric",
"type":"string"
},
"properties":{
© Main editor and other members of the DITAS consortium
22 D3.3 Data Virtualization SDK prototype
"description":"properties related to the metric",
"type":"object",
"additionalProperties":{
"type":"object",
"properties":{
"unit":{
"description":"unit of measure of the property",
"type":"string"
},
"maximum":{
"description":"lower limit of the offered property",
"type":"number"
},
"minimum":{
"description":"upper limit of the offered property",
"type":"number"
},
"value":{
"description":"value of the property",
"anyOf":[
{
"type":"string"
},
{
"type":"object"
},
{
"type":"array"
}
]
}
}
}
}
}
}
2.3 Abstract Properties (Blueprint Section 3) This section contains the goal model that specifies the non-functional application
requirements that the blueprint is expected to fulfill once the concrete blueprint
is instantiated. Such goal model is used by the SLA Manager to detect violations,
and by the DS4M to identify the best data and computation movement actions.
For further details on how the goal model is encoded, please see D3.2 Section
3.3 [2].
This section remains empty in the abstract blueprint. Once VDC Blueprint
resolution takes place, an intermediate blueprint is generated. In particular, the
Data Utility Resolution Engine, which is furtherly discussed in section 3.2.2, inserts
in this section a subset of the goal model taken from the application designer
requirements.
© Main editor and other members of the DITAS consortium
23 D3.3 Data Virtualization SDK prototype
2.5 Cookbook Appendix (Blueprint Section 4)
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Deployment Object Information about the
available
infrastructures and the
software running on it
N/A N/A
Deployment.id String Identifier of the
deployment
N/A N/A
Deployment.na
me
String Human friendly name
of the deployment
N/A N/A
Deployment.infr
astructures
Object Set of clusters
deployed with this
blueprint. The key is the
infrastructure identifier
and the value is the
cluster nodes
information
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.id
String Identifier of the cluster
infrastructure
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.n
ame
String Human readable
name of the cluster
infrastructure
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.ty
pe
String Type of the cluster. For
example: cloudsigma,
aws or baremetal
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.n
odes
Object Nodes present in the
cluster indexed by
node role. The value is
a list of nodes
information
N/A N/A
© Main editor and other members of the DITAS consortium
24 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Deployment.infr
astructures.<infr
astructure_id>.n
odes.<role>.host
name
String Hostname of the node N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.n
odes.<role>.role
String Role on the node in
the cluster. In case of
Kubernetes it will be
master or slave
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.n
odes.<role>.ip
String External IP of the node N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.n
odes.<role>.driv
e_size
Int Size of the boot drive in
bytes
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.n
odes.<role>.dat
a_drives
Array List of data drives
attached to the node
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.n
odes.<role>.dat
a_drives[i].name
String Name of the data
drive
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.n
odes.<role>.dat
a_drives[i].name
String Size of the data drive N/A N/A
© Main editor and other members of the DITAS consortium
25 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Deployment.infr
astructures.<infr
astructure_id>.n
odes.<role>.extr
a_properties
Object Arbitrary properties
associated to this node
in key-value format
whose key and values
are strings. It can be
used to set labels, for
example.
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.st
atus
String Status of the cluster. A
healthy cluster should
be in running status.
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.v
dcs
Object Object containing
information about the
VDCs deployed in the
cluster. The key is the
VDC identifier while the
value is the information
relative to this
particular VDC
N/A N/A
Deployment.infr
astructures.<infr
astructure_id>.v
dcs.<vdc_id>.po
rts
Object Information about
which port is assigned
to each image running
in a VDC. The key is the
image identifier in
Docker format and the
value the port in which
it can be reached in
any node of the
cluster.
N/A N/A
© Main editor and other members of the DITAS consortium
26 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Deployment.infr
astructures.<infr
astructure_id>.e
xtra_properties
Object Arbitrary properties
associated to this
cluster in key-value
format whose key and
values are strings. It
can be used to set
labels, for example.
N/A N/A
Deployment.extr
a_properties
Object Arbitrary properties
associated to this multi
cluster deployment in
key-value format
whose key and values
are strings. It can be
used to set labels, for
example.
N/A N/A
Deployment.stat
us
String General status of the
whole multi cluster
deployment. A healty
deployment should be
in the “running” state
N/A N/A
The Deployment field has been added to this version of the blueprint to provide
information to the different components running in the VDCs about the clusters
available to them. This field will be automatically generated by the Deployment
Engine once the clusters are initialized and it will be part of the concrete
blueprint.
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Resources Object Set of resources
available to deploy
VDCs. These resources
are machines with
attached disks grouped
as clusters
(infrastructures) that will
Deployment
Engine
Deploym
ent
© Main editor and other members of the DITAS consortium
27 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
form different
Kubernetes clusters.
Resources.descri
ption
string Optional description for
the whole resource set
Deployment
Engine
Deploym
ent
Resources.infrast
ructures
array List of available clusters
to create or use
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].descri
ption
string Optional description for
the cluster
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].name
string Unique name for the
cluster. It will be used to
form the machines
hostnames if it need to
be created
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].type
string Type of infrastructure. A
“Cloud” value means
that the resources are
not initialized and so the
deployment engine
needs to create them as
Virtual Machines and
initialize the Kubernetes
cluster over them.
“Edge” means that the
machines are already
configured as a cluster
and the data in the
“resources” section is just
informative.
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].provid
er
object Information about the
cloud or edge provider
Deployment
Engine
Deploym
ent
© Main editor and other members of the DITAS consortium
28 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Resources.infrast
ructures[i].provid
er.api_endpoint
string Endpoint to use in case
of a cloud provider
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].provid
er.api_type
string The type of provider such
as AWS, GCP,
Cloudsigma, etc
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].provid
er.credentials
Object A key-value map with
the credentials to access
the cloud provider and
be able to create the
cluster. In case of an
“Edge” cluster type, the
existing k8s cluster
credentials must be
provided here.
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].provid
er.secret_id
string If the deployment
engine is configured to
use a vault, the
credentials can be
provided as a link to the
vault to retrieve them
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces
array List of Virtual Machines to
instantiate, in case of a
“Cloud” deployment, or
to use in case of an
“Edge” one.
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].cores
int Number of cores of the
VM
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].cpu
int CPU speed in MHz Deployment
Engine
Deploym
ent
© Main editor and other members of the DITAS consortium
29 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Resources.infrast
ructures[i].resour
ces[i].disk
int Boot disk size in MB Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].image_id
string Identifier of the boot
image to use
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].ip
string IP to assign to the VM. If
not present, a random
one will be chosen. The
provider has to have
enough free public IPs for
all of the machines since
they need to have a
fixed IP
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].name
string Unique name of the
machine. Along with the
infrastructure name, it will
be used to compose the
machine hostname
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].ram
int RAM size in MB Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].role
string Role in the kubernetes
cluster. It can be
“master” or “worker”
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].type
string Optionally the type of
machine (i.e. n1-small)
can be provided here
instead of providing the
individual features of
RAM and CPU
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].drives
array Set of data drives
attached to the
machine
Deployment
Engine
Deploym
ent
© Main editor and other members of the DITAS consortium
30 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Resources.infrast
ructures[i].resour
ces[i].drives[i].n
ame
string Unique name for the
data drive
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].drives[i].siz
e
int Size in MB of the data
drive
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].drives[i].ty
pe
string Type of the data drive
(HDD or SDD)
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].resour
ces[i].extra_pro
perties
string Key-value map to
provide arbitrary tags to
the machine. It can be
used to provide
information to the
deployment engine
about the features
installed or configured in
the boot disk or to mark
a particular node with a
particular tag.
Deployment
Engine
Deploym
ent
Resources.infrast
ructures[i].extra_
properties
object A key-value map that
can be used to provide
information to the
deployment engine or
components running
inside a VDC. This is a
place to put arbitrary
tags that can be useful
such as describe if a
cluster must be treated
as the default one to
deploy VDCs, if it is a
Deployment
Engine
Deploym
ent
© Main editor and other members of the DITAS consortium
31 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
trustable or untrustable
zone, etc.
Previously, the whole Cookbook_Appendix section was composed of the
information of this Resources field. Now this field has been added to represent
the available resources to initialize while the Deployment field represents these
same resources once they have been initialized.
The following is a continuation of the Identity_Access_Management field,
included in the Internal Structure section of the blueprint. The Cookbook
Appendix field is concerned with information relevant to the runtime of a VDC.
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Identity_Access_
Management
Object Information about
identity and access
management of this
VDC
Access
Manageme
nt, Request
Monitor
Usage
Phase
Identity_Access_
Management.v
alidation_keys
List of Strings Set of Public keys that
can be used to validate
a key if the JWKS is
unreachable. (optional)
Application,
Request
Monitor,
DAL, Any
that needs
to validate a
token
Usage
Phase
Identity_Access_
Management.m
apping
List of
Objects
Set of automatic role
mappings for a provider
(optional)
In case more than one
provider is used in the
configuration, a
component can use
these rules to
automatically associate
roles based on the origin
of a token.
Request
Monitor
Usage
Phase
Identity_Access_
Management.m
apping[i].provid
er
String Name, needs to match
one from the provider list
Request
Monitor
Usage
Phase
© Main editor and other members of the DITAS consortium
32 D3.3 Data Virtualization SDK prototype
Field Type (JSON
Format)
Description Role/
Component
Phase/
Process
Identity_Access_
Management.m
apping[i].roles
List of String Roles that this mapping
can apply
Request
Monitor
Usage
Phase
Identity_Access_
Management.m
apping[i].role_m
ap
List of
Object
Rules that can be
evaluated based for
each token.
Request
Monitor
Usage
Phase
Identity_Access_
Management.m
apping[i].role_m
ap[j].matcher
String Anko Script Rule1 Request
Monitor
Usage
Phase
This field is new and it has been introduced in this version of the schema, in order
to describe how identity access is managed and configured for this VDC. It was
inserted in the blueprint for two main purposes:
● Allowing the Request Monitor to automatically enforce access control
● Enable pre-filtering of Blueprints that a Developer has no access to
2.6 Exposed API (Blueprint Section 5) According to the DITAS architecture, the VDC interacts with the applications
through the Common Accessibility Framework API, the programming model of
which is REST-oriented. The data administrator is in charge of designing the API as
well as making it publicly available. Therefore, this section of the abstract
blueprint includes all the information about the methods, through which the
administrator exposes totally or partially the data that are stored in the sources
that he/she controls [4]. The CAF RESTful API of the VDC is written according to
the OpenAPI Specification (originally known as the Swagger specification), so
that big vendors and also new providers are able to publish their services and
components (Requirement B3.3). The OAS was extended regarding the
operation object, which in the context of DITAS corresponds to the exposed VDC
method. The JSON schema of the latter is depicted in the table below, whereas
the complete schema of the EXPOSED_API section of the blueprint is presented
in the appendix.
1 https://github.com/mattn/anko
© Main editor and other members of the DITAS consortium
33 D3.3 Data Virtualization SDK prototype
method
Field Type
(JSON
Format)
Description Comments
summary String A short summary
of what the
operation does
mandatory field
operationId String Unique string
used to identify
method
the same id is used to
identify each one
exposed VDC method
throughout the whole
abstract VDC blueprint,
mandatory field
parameters Array A list of the input
parameters for
this method
optional field
responses Object The list of possible
responses as they
are returned from
calling this
method
mandatory field
responses.200(or
201).description
String A short
description of the
response
mandatory field
responses.200(or
201).content.application/json
.schema
Object the schema of
the data,
included in the
response payload
mandatory field to
enable the developer
to conclude whether
the method fits to
his/her application
© Main editor and other members of the DITAS consortium
34 D3.3 Data Virtualization SDK prototype
method
Field Type
(JSON
Format)
Description Comments
x-data-sources Array An array that
contains all the
identifiers of the
data sources (as
indicated in the
INTERNAL_STRUCT
URE.Data_Sources
field) that are
accessed by the
method
mandatory field to
enable the developer
to conclude whether
the method fits to
his/her application
x-iam-roles Array A list to show that
a client needs to
have one of
these roles to be
able to call the
method
successfully
optional field
© Main editor and other members of the DITAS consortium
35 D3.3 Data Virtualization SDK prototype
3 Final Component Architecture and specification The VDC Blueprint (as described in the previous section) has an intricate schema
that contains a lot of information in order to be able to describe all the
functionalities and features of a VDC. Application Designers are called to select
the most suitable Blueprint and by extend VDC that fulfils their Application
Requirements. This task requires from the Application Designer to not only define
in a structured manner the requirements for the application he/she develops but
also to be able to discover in a database full of intricate blueprints the one that
suits the application. In order to aid Application Designers in the process of
finding the most suitable blueprint, the Resolution Engine was created (Figure 2).
Figure 2: Resolution Engine Architecture
As depicted in the image above (also described in D3.2 Section 4.2 [2]) the
resolution process consists of different interconnected components aiming at
filtering and ranking the Blueprints according to the Application Requirements.
Different aspects and features of the Blueprint feed the Resolution Engine.
● Content Based Resolution: The scope of this component is finding out the
most appropriate blueprints based on the type of data the VDC delivers.
This process receives as input free text that the user provides and delivers
to the follow up component a list of blueprints that matches the content,
based on the users input.
● Data Utility Resolution Engine (DURE): Filtering out based on the type of
data is a crucial step, but it is not enough in order to fulfill the needs of an
application. The quality of the data is an equally important factor that is
taken care of by this component.
● Privacy and security Evaluator (PSE): This component is responsible for
filtering and ranking security and privacy aspects of the blueprint.
● Recommendation: Given that the requirements of the application are
matched with the appropriate blueprints, it is important to be able to
recommend and rank the best candidate blueprints in order to give the
Application Developer a more sophisticated and personalized solution.
© Main editor and other members of the DITAS consortium
36 D3.3 Data Virtualization SDK prototype
● Repository Engine: This component handles all the CRUD operations for
the blueprint repository. It is connected with the resolution process in order
to retrieve the blueprints for evaluation.
In the following sections, these components are analyzed. More specifically, we
will focus on the changes made throughout the second period of the project, on
the motivations for creating these changes, and on how they address the
requirements of the project.
3.1 VDC Blueprint Repository Engine The Data Administrator interacts with the Repository Engine, in order to perform
CRUD operations on his/her abstract VDC Blueprint(s). Using the interface, he/she
submits a blueprint that, after being evaluated by the Blueprint Validator and
found valid, is stored in the Blueprints Repository. After the Blueprint is stored, the
Repository Engine sends back a unique blueprint id, through which the
administrator is able to read, update, or delete his/her blueprint. Other DITAS roles
or components, such as the Resolution Engine, use the id to retrieve the Blueprint
from the Repository. The Creation & Storage phase of the Blueprint lifecycle is
presented in the figure below (being part of the whole lifecycle that is depicted
in Figure 1):
Figure 3: Creation & Storage phase of the Blueprint Lifecycle
With respect to the version described in D3.2 Section 4.1 [2], the component was
slightly changed, mainly to adapt to the final abstract VDC Blueprint schema
that is analyzed in this deliverable. The complete API, written according to the
Swagger specification, of the VDC Blueprint Repository Engine can be found
here:
https://github.com/DITAS-Project/VDC-Blueprint-Repository-
Engine/blob/master/VDC_Blueprint_Repository_Engine_Swagger_v3.yaml
3.1.1 VDC Blueprint Validator
The goal of this subcomponent is to enforce that the inserted or updated
blueprints are valid before they are stored in the Blueprint Repository and in case
of Bad POST or PATCH Requests to provide descriptive and helpful error messages
in order to assist the data administrator to create valid blueprints.
The validator takes as input an abstract VDC Blueprint and validates it against all
limitations and requirements that are defined. A v4 JSON schema is used to
describe the format of a blueprint along with other grammar standards and
© Main editor and other members of the DITAS consortium
37 D3.3 Data Virtualization SDK prototype
specifications. The validator also checks about logical requirements such as the
following:
● each one exposed VDC method must have a unique (operation) id
○ every “method_id” that is used throughout the blueprint must be
also defined as an “operationId” in the EXPOSED_API section
● each one data source must have a unique id
○ every data source id that is used throughout the blueprint must be
also declared in the INTERNAL_STRUCTURE.Data_Sources section
3.2 VDC Blueprint Resolution Engine All of the components of the Resolution Engine (architecture, functionalities and
updates) are described in the subsections below. All these components are
connected and with a single input, the resolution produces a list of ranked
Blueprints for the Application Designer to select.
Component API
All the individual component services are described in the individual dedicated
sections. These services are initiated by a single API.
Table x: Resolution Engine method documentation
/searchBlueprintByReq
Purpose
Input Json file with the Application
Requirements:https://github.com/DITAS-Project/VDC-
Resolution-
Engine/blob/master/src/main/resources/user_reqs.json
Candidates
Array of Blueprint
Output ResultSet
Array of Blueprints and scores:
schema:
type: array
items:
type: object
properties:
blueprint:
type: object
score:
type: number
methodNames:
type: array
items:
type: string
© Main editor and other members of the DITAS consortium
38 D3.3 Data Virtualization SDK prototype
3.2.1 Content Based Resolution
Content Based Resolution is a component created in order to filter the VDC
blueprints based on the content they provide (requirement T3.10). In order to
achieve this goal elasticSearch2 was integrated in the components, which is one
of the leading solutions for content-based search.
Figure 4: Content Based Search Sequence Diagram
As depicted in Figure 4, the Content Resolution uses as input the application
requirements and more specifically the functional requirements expressed by the
application designer. The Content Resolution component then forms the
appropriate queries in order to retrieve the most suitable Blueprints from the
elasticSearch DB. The elasticSearch DB contain only snippets of the VDC
Blueprints, more specifically it contains all the fields that describe the content of
the blueprint. In this way, the search and also the retrieval of the Blueprints can
be done a lot faster. After the successful retrieval of the blueprint IDs, the Content
Resolutions contacts the repository engine in order to get the full Abstract
blueprints from the Blueprint Repository. All these blueprints as well as the
Application Requirements are then reformed in order to ensure interoperability
between the components (requirements T3.21, T3.22) and are sent to DURE for
further filtering and ranking.
As far as the architecture and interoperability with other components is
concerned, the Content based Resolution remained the same with small fixes in
the communication with the other resolution components. The main focus for this
component evolvement was more on the technologies used. More specifically,
in the second iteration of this component elasticSearch was upgraded to version
7. The reason for this upgrade was the new features of elasticSearch that were
needed for the Resolution Process in terms of functionality and performance.
Features as High level REST client as well as the script score query helped in the
2 https://www.elastic.co/
© Main editor and other members of the DITAS consortium
39 D3.3 Data Virtualization SDK prototype
query building and the overall interoperability in the components involved in the
resolution process. In addition, the ranked features of the new elasticSearch
made the filtering and ranking of the blueprints much more efficient and easy to
produce. Finally, the faster retrieval of top hits boosted the performance and
made the component even faster. It is important to mention that all these
updates are also important not only to the Content Based Resolution, but also for
the recommendation system implemented in the resolution process.
As far as the component API is concerned the service URL, the inputs, and the
outputs remain the same.
Component API
/searchBlueprintByReq_ESresponse
Purpose This method searches into specific fields on the blueprint, the
tags and also the description of the blueprint.
Input Free text: “blood test from italy”
Candidates
Array of Blueprint
Output ResultSet
Array of BlueprintUUID and scores of relevance: { "_index": "vdc_search", "_id": "VyOReGAB1xWEy8e1njck", "_score": 2.463948, }, { "_index": "vdc_search", "_id": "OiOReGAB1xWEy8e1LDdq", "_score": 1.7260926, }, { "_index": "vdc_search", "_id": "ViOReGAB1xWEy8e1lDfm", "_score": 1.1507283, }, }
3.2.2 Data Utility Resolution Engine
The Data Utility Resolution Engine (DURE) is used in the blueprint selection phase
of DITAS. This component is responsible for filtering out blueprints that do not fulfill
non-functional application requirements. In addition, the DURE ranks blueprints
based on how well they fulfill non-functional requirements.
© Main editor and other members of the DITAS consortium
40 D3.3 Data Virtualization SDK prototype
Figure 5: Data Utility Resolution Sequence Diagram
Starting from the implementation described in D3.2 Sections 4.2.2 [2], we furtherly
improved it by dynamically updating the data utility of each blueprint before the
assessment takes place.
To this aim, application requirements are firstly passed to the Data Utility
Refinement (DUR) module that is in charge of rebalancing weights in the goal
model based on the type of application developed by the application
developer.
In addition, for each blueprint, the DURE computes its data utility by taking into
account only the columns that are relevant for the output desired by the
application developer. To do so, the DURE specifies in the
INTERNAL_STRUCTURE.Testing_Output_Data section of the blueprint which
attributes are relevant for the desired output. Then, it invokes the Data Utility
Evaluator (DUE) module that computes the new data utility values and updates
the DATA_MANAGEMENT section of the blueprint.
Once the data utility has been computed, the DURE assigns a score in the 0-1
range to the blueprint based on how well it fulfills the non-functional application
requirements. To compute the score, the DURE relies on the internal Ranker
component. The Ranker firstly transforms the goal tree into an expression whose
factors are represented by the requirements (we refer to D3.2 Sections 4.2.2 for
the details on how the expression was produced). Then, for each requirement,
the Ranker estimates how well the blueprint fulfills it. This is done by the Ranker
itself for requirements related to data utility. Instead, the assessment of security
and privacy requirements is performed by invoking the Privacy and Security
Evaluator (PSE) module.
© Main editor and other members of the DITAS consortium
41 D3.3 Data Virtualization SDK prototype
Once the assessment is done, if the rank assigned to the blueprint is 0, the
blueprint is discarded. Otherwise, the goal model specified in the application
requirements is customized for that specific blueprint, and then is inserted into the
ABSTRACT_PROPERTIES section of the blueprint. To do so, the DURE relies on
internal Pruner component. The Pruner customizes the goal tree by pruning
leaves associated to the requirements that cannot be fulfilled by the blueprint.
Component API
POST /v2/filterBlueprints
Purpose This method allows to filter and rank blueprints according to
non-functional requirements
Input ApplicationRequirements
JSON document specifying the application requirements
Candidates
Array of pairs Blueprint, MethodName
Blueprint
JSON document describing the blueprint (see Chapter 3 of
this document)
MethodNames
Array of string indicating the name of the methods in the
blueprint that fulfill the functional requirements
Output ResultSet
Array of tuples Blueprint,Score, MethodNames
Blueprint
JSON document describing the blueprint (see Chapter 3 of
this document), updated with goal model and non-
functional application requirements
Score
Double indicating the rank (in the 0-1 range) assigned to
the blueprint
MethodNames
Array of string indicating the name of the methods in the
blueprint that fulfill the functional requirements
3.2.3 Privacy Security Evaluator
The Privacy Security Evaluator Service (PSES) is used in the blueprint selection
phase of DITAS. Specifically, the PSES addresses the DITAS requirement T3.19, by
determining if and how well privacy and security attributes fit the application
designer requirements.
The PSES, therefore, is responsible for filtering and ranking the security- and
privacy-related properties (see section 2.2) of a Blueprint. The PSES is mainly used
by the DURE during Blueprint selection.
© Main editor and other members of the DITAS consortium
42 D3.3 Data Virtualization SDK prototype
The PSES is built as a stateless microservice, which allows it to be easily scalable
and deployable. It is built on top of Java Spring boot and provides a REST API to
perform the filtering and ranking process. We investigated both serverless
functions-as-a-service approaches for the PSES as well as a containerized
microservice. Serverless approaches would have the benefit to offer high
scalability with an attractive cost model during low usage of the service [5][6].
However, for the current DITAS provider model, it is more sensible for now to use
the self-managed approach mainly because DITAS does not come with an
infrastructure that is required for such an approach. Such as a serverless
framework (e.g., Fission3, KNative4) and supporting services (e.g., scalable data
storage, scalable messaging). Adding such an infrastructure only for one
component is not cost effective but could be beneficial for future versions of
DITAS.
The service consists of the following components (Figure 6): A REST controller
handles parsing incoming requests as well as all result representations. An
evaluator service, in turn, uses the filter services in combination with the ranking
service to generate the final result.
The filter services apply several field filter to each blueprint metric to remove
blueprints that do not match the required security or privacy. Lastly, the Ranking-
Service can use a Ranking-Strategy to order the remaining blueprint properties.
We implemented multiple strategies that can be changed at runtime,
depending on the needs of the DITAS administrator (e.g. enforcement of
minimum security standards). The overall process can be seen in Figure 7.
Figure 6: Simplified Architecture
3 https://fission.io/ 4 https://cloud.google.com/knative/
Rest Controller
Evaluator Service
Filter Service
Ranking Service
Ranking Strategy
Field Filter
1
1
1 1
n
© Main editor and other members of the DITAS consortium
43 D3.3 Data Virtualization SDK prototype
Figure 7: Filtering process of PSE Sequence Diagram
Component API
POST /v1/filter
Purpose This method filters and ranks one or more security or
privacy related blueprints properties in terms of specified
user requirements.
Input User Requirement (Object) and blueprintMetrics (Array)
{ "requirement": { "id": "1", "name": "<any>", "type": "<any>", "properties": { { "name": "<any>", "unit": "<any>", "value": "<any>" }, ... } }, "blueprintMetrics": [ { "type": "object", "description": "<any>",
© Main editor and other members of the DITAS consortium
44 D3.3 Data Virtualization SDK prototype
"properties": { "id": "blue1", "name": "<any>", "type": "<any>", "properties": { { "name": "<any>", "unit": "<any>", "value": "<any>" }, ... } } }, ... ] }
Output Result [ { "blueprint": { "id": "blue1", ... ] }, "score": <num> }, ... ]
3.2.4 Recommendation Component
Recommendation Component is the final step of the resolution process. It takes
as input the list of ranked and filtered Blueprints, as well as the Application
Requirements from the previous steps and reforms them (Requirements T3.21,
T3.22) in order to produce the complex queries that will produce the user-based
score(recommendation) of the blueprints.
© Main editor and other members of the DITAS consortium
45 D3.3 Data Virtualization SDK prototype
Figure 8: Recommendation Component Sequence Diagram
The architecture as well as the technologies and features of these components
were described in the previous deliverable (D3.2 Section 4.2 [2]). As described in
section 3.2.1, the main focus of the development was the incorporation of
ElasticSearch 7 as well as the interoperability with the other resolution
components. In Figure 8, the sequence in which the recommendation system
produces the final blueprint score is depicted. For every Blueprint that is in the list
of candidates, the recommendation module queries the purchase repository in
order to find the purchase history of every blueprint and also the application
requirements of the users that bought it. After correlating the stored application
requirements with the current ones, it produces a score taking under strong
consideration this correlation. In this way, it produces a recommendation that
depends on what the users needed from the specific Blueprint and how well this
Blueprint fulfilled the needs of the application. By correlating the different
requirements, the system takes under strong consideration the scores of the users
that had similar application requirements. This allows the recommendation
system to produce more user centric recommendations rather than the
technical filtering and raking that the other components produce.
© Main editor and other members of the DITAS consortium
46 D3.3 Data Virtualization SDK prototype
Component API
POST /rateBlueprint
Purpose This method compares the user requirements of other users that
acquired and used each proposed blueprint with the user
requirements of our current user and rates them according to their
similarity rating in combination with their user rating.
Input The proposed blueprint list from DURE as a JSON Array and a JSON
object representing the user requirements of our current user.
{ "requirements": { "id": "1", "name": "<any>", "type": "<any>", "properties": { { "name": "<any>", "unit": "<any>", "value": "<any>" }, ... } }, "blueprintList": [ { "blueprint": { "id": "blue1", ... ] }, "score": <num> }, ... ] }
Output [ { "blueprint": { "id": "blue1", ... ] }, "score": <num>, "rating": <num> }, ... ]
© Main editor and other members of the DITAS consortium
47 D3.3 Data Virtualization SDK prototype
4 Data Access Layer (DAL) As already described in the Architecture deliverable D1.2 [4], DAL is an element
of a VDC, whose role is to expose the data provided by the Data Administrator
of the DITAS-EE infrastructure without violating privacy and security constraints. In
fact, the DAL includes the Privacy Enforcement Layer, which is the component
in charge of rewriting the SQL, which is required to be executed in order to satisfy
the call coming from the Processing Layer, to a SQL that avoids returning the
data that should not be seen externally for a given purpose. This filtering is
affected mainly by the location of the VDC and purpose of the access. In fact,
there is a possibility to move the computation, i.e., the processing and the CAF
layer, and this could affect the data that can be transmitted. For this reason, an
important assumption about the DAL requires that this layer is deployed in the
same place, where the data is stored, i.e., it is invariant of the computation
movement. DAL is always deployed in the same security and privacy realm of
the data source made available by the data administrator and it is in providing
the required connectivity of the data source to the VDC processing while
enforcing the privacy policies.
DAL is also used in data movement by the Data Movement Enactor. Focusing on
the data movement, in case the strategy is to duplicate the data source
somewhere else (e.g., on the premises of the consumer), the DAL firstly ensures
that only the data that can be stored at that location are replicated. Secondly,
a new instance of the DAL is instantiated at the new location to perform access
control after the data is moved. Data movement process initiated by Data
Movement Enactor (DME) which uses DAL API to move the data from original
data source to target data source. If original data source can still change, data
movement will be performed continuously step by step, each step part of the
data will be moved. Data that should be moved during single data movement
step will be described by SQL query. As data movement is a continuous process,
DME (or other component) will contact DAL repeatedly to keep data movement
going. Both the DAL at the original location and the DAL at the data movement
target have to expose the same API to the processing layer of the VDC, but they
might have to comply with different restrictions based on their privacy zone. For
example, data might be moved from the private hospital cloud to the public
cloud, and the same VDC for the researcher application should be able to
retrieve data from it. However, whereas in the private cloud the data might have
been stored in plaintext, in the public cloud it might be stored encrypted. Then
the DAL should be able to operate on plaintext data in the private cloud and on
encrypted data in the public cloud. To this end, the DAL should contain both the
flow of access to plaintext data and the flow of access to encrypted data,
basing the choice of the flow on the concrete blueprint, from which it is created.
In the above data movement example, when running in private cloud before
© Main editor and other members of the DITAS consortium
48 D3.3 Data Virtualization SDK prototype
the movement the DAL would use the plaintext mode, and running in the public
cloud after the movement it will use the encrypted mode.
Figure 9: Initialization of DAL Data Movement Sequence Diagram
Figure 10: Finalization of DAL Data Movement Sequence Diagram
Privacy Enforcement Engine acts as a proxy before executing the query over the
data. It rewrites the query so that it returns only data compliant with privacy
policies, evaluated together with user identity information. To this end, the
© Main editor and other members of the DITAS consortium
49 D3.3 Data Virtualization SDK prototype
original query is augmented with filters based on policies and on additional
attributes of the request or the data, such as the data subject consent.
Figure 11: Data transformation Sequence Diagram
In addition, the Privacy Enforcement Engine creates encryption properties that
are later used by the DAL in order to activate decryption when reading data
frames by DAL upon application data access, and encryption when writing data
frames by DAL during data movement.
© Main editor and other members of the DITAS consortium
50 D3.3 Data Virtualization SDK prototype
Figure 12: DAL Interconnection with CAF and Privacy Enforcement Engine
The protocol of communication between the DAL and the rest of the VDC is
gRPC since, on the one hand, it is generic enough and supports well both
request-response model and streaming and, on the other hand, it can be more
efficient than plain REST over HTTP. The interface of the DAL to the processing
layer is described by a protobuf, and both server and client code are generated
based on it. This helps maintain consistency between the DAL API and the data
processing DAL client.
The DAL component indirectly addresses the requirement T3.15, by allowing
moving computation to a different network, and the VDC would still be able to
access the data stores.
© Main editor and other members of the DITAS consortium
51 D3.3 Data Virtualization SDK prototype
Component API
service QueryService {
rpc query (QueryRequest) returns (QueryReply) {}
}
Purpose This method runs the supplied query on the data sources
managed by this DAL.
Input message QueryRequest {
DalMessageProperties dalMessageProperties = 1;
DalPrivacyProperties dalPrivacyProperties = 2;
string query = 3;
repeated string queryParameters = 4;
}
DAL message properties include properties common to all
DAL messages.
DAL privacy properties include properties, based on which
Policy Enforcement Engine will make policy decisions, such as
whether the data is in private or public zone.
Query is the query for fetching the data from data sources.
Output message QueryReply { repeated string values = 1;
}
service DataMovementService {
rpc startDataMovement (StartDataMovementRequest) returns
(StartDataMovementReply) {}
rpc finishDataMovement (FinishDataMovementRequest) returns
(FinishDataMovementReply) {}
}
This DAL service enacts part of the operations needed for data movement. Its
methods are called by the data movement enactor to start data movement
and to finish it. When data movement is started, a parquet file will be created by
the source DAL with all the data that needs to move. When data movement is
finished, the parquet file will be read by the target DAL and persisted at the data
sources.
startDataMovement()
Purpose This method is called by the data movement enactor to start data
movement.
© Main editor and other members of the DITAS consortium
52 D3.3 Data Virtualization SDK prototype
Input message StartDataMovementRequest {
DalMessageProperties dalMessageProperties = 1;
DalPrivacyProperties sourcePrivacyProperties = 2;
DalPrivacyProperties destinationPrivacyProperties = 3;
string query = 4;
repeated string queryParameters = 5;
string sharedVolumePath = 6;
}
Source and destination privacy properties specify whether the
source and target data sources are in private or public zone. Policy
enforcement engine will use this information to base its policy
decisions on.
The query specifies the query to run on the data source in order to
extract the data to be moved.
Shared volume path is the volume shared between the source and
the target DAL for sharing the data to be moved.
Output message StartDataMovementReply {
}
finishDataMovement()
Purpose This method is called by the Data Movement Enactor to finish data
movement.
Input
message FinishDataMovementRequest {
DalMessageProperties dalMessageProperties = 1;
DalPrivacyProperties sourcePrivacyProperties = 2;
DalPrivacyProperties destinationPrivacyProperties = 3;
string query = 4;
repeated string queryParameters = 5;
string sharedVolumePath = 6;
string targetDatasource = 7;
}
© Main editor and other members of the DITAS consortium
53 D3.3 Data Virtualization SDK prototype
Source and destination privacy properties specify whether the
source and target data sources are in private or public zone. Policy
enforcement engine will use this information to base its policy
decisions on.
The query specifies the query to run on the data frame in the file
shared between the source and target DALs.
Shared volume path is the volume shared between the source and
the target DAL for sharing the data to be moved. Target datasource specifies which datasource will accept the persisted data.
Output message FinishDataMovementReply { }
© Main editor and other members of the DITAS consortium
54 D3.3 Data Virtualization SDK prototype
5 Application Profiling and Deployment Strategies The decision about the deployment of the VDC and all the components related
to the management of the access to the data sources is not trivial. When a
blueprint is selected by an application developer and during its lifecycle,
decisions on the deployment should be based on the knowledge about the
typical usage of the VDC by the application requiring access to the data source.
The usage of this knowledge enables to exploit relevant information such as the
frequency of the requests to access the data source, the portion of the data
source usually accessed by the application, and the typical violations of the
Data Utility expressed by the consumer.
In order to support the decisions about the VDC deployment, relevant
information is collected by the Application Profiling activity.
The application profiling aims at gathering together relevant information
collected from different repositories (e.g., the concrete blueprint, the monitoring
data, and the analytics) to make them available as an overview of the VDC
instance behavior. This information is useful to describe the requirements of the
application using the data through the VDC, as well as the typical interaction
between the application and the VDC in accessing these data.
The information collected in the Application Profile can be used in two different
phases:
● at deployment time: the application profile provides valuable input for the
deployment decisions.
● at run time: the application profile supports the Decision System for Data
and Computation Movement (DS4M) when selecting a movement action
for satisfying the application requirements.
The application profile is indirectly created by the interaction between the DITAS
platform and the Application Owner. It can be considered as a virtual metadata,
since it is generated by gathering together data already generated by other
components of the DITAS architecture.
More in detail, the Application Profile is composed of:
● a task description. When expressing the application requirements, the
application designer describes the application requiring access to the
data. This description, used by the DURE and by the DUR is important in the
deployment phace to select the Data Utility metrics relevant for the
application purposes and is stored as an information, which can be used
also at run time.
● the application requirements. Together with a general classification of the
task, the application developer expresses the functional and non-
functional requirements for the application. While the functional
requirements are used at deployment time to filter the VDC blueprints
fitting the application request, the non-functional requirements are used
also at run-time to validate the proper management of the application.
For each metric composing the application requirements, a value
constraint is expressed representing the desired upper or lower limit for that
metric. This information is stored in the profile.
● the application SLA established at deployment time. Similarly, to the
application requirements, the SLA contains the upper or lower limit for the
© Main editor and other members of the DITAS consortium
55 D3.3 Data Virtualization SDK prototype
Data Utility dimensions, but this value represents the agreement between
the application designer requirements and the data administrator
capabilities and might differ from the initial requirements. In addition, this
information is stored in the application profile.
● the application execution logs. At run time, the monitoring component
and the analytics collect data about the requests of the application to
the platform and their outcome in terms of Data Utility. The execution logs
are relevant to discover typical issue related to the application requests,
persistent in time, and can be exploited to improve the requirements
satisfaction by suggesting a new data source or by modifying the
application deployment.
Analyzing the information collected by the Application Profiling, it is possible to
give insights on which is the typical behavior that might be expected by the
system focusing on two main aspects:
● Relevant data: not all the data provided by the data source are used by
the application. When deciding which data to move from the edge to
the cloud and vice-versa, the knowledge about the frequently accessed
data should be taken into account. This knowledge is relevant to improve
the performance of the data retrieval. As an example, when a data
movement is needed in a different location, we can copy to the new
source the data that are more likely to be used in the next future instead
of moving the whole data source.
● Data and computation resources reliability: in a fog environment, we are
seldom subject to not reliable connections between the cloud and the
edge. It means that some resources can be offline at some point in time
making the communication between the cloud and the edge impossible.
Observing the typical behavior of the resources in terms of connectivity
and reliability, we can prevent connectivity issues by using this information
when deciding where to place the data and the computation.
© Main editor and other members of the DITAS consortium
56 D3.3 Data Virtualization SDK prototype
6 DITAS SDK The major goal of the DITAS SDK is to manage the life-cycle of the VDC that is
directly connected to the life-cycle of the VDC Blueprint (i.e., the descriptor of
the VDC) which goal is manifold:
● to describe the characteristics of the exposed data sources
● to support the application designer when looking for the dataset that
could be interesting for his/her purposes.
● to support the DITAS execution environment to properly deploy all the
components composing the VDC needed to expose the data.
Different roles are involved in the management of the VDC:
● The data administrator is the owner of data sources and has complete
knowledge of them. The data administrator takes advantage of DITAS to
enable the provisioning of some of the internal data that s/he would like
to make accessible by other subjects. Depending on the subject and the
consent of usage, the visibility on these data can be partial or total. With
DITAS, the data administrator can simplify the process of making her/his
data available as, through the VDC, the DITAS platform is able to optimize
the data provisioning by means of data and computation movement. In
fact, the data administrator has only the task to define the exposed API,
i.e., the Common Access Framework (CAF), reflecting the methods to
access the data.
● The application developer is the actor in charge of creating the VDC.
Based on the data sources made available by the data administrator s/he
responsible for defining the code able to expose the API defined by the
data administrator. Depending on the case, the data processing
developed can be a simple connection to the provided data sources or
complex data analytics. As a result, the application developer is able to
provide a complete specification of a VDC. It is worth noting, that in
several cases the same actor will hold both the data administrator and
the application developer roles.
● The application designer represents the service consumer and her/his
goal is twofold. On the one hand, the goal is to select the most suitable
VDC with respect to her/his requirements. For this reason, the DITAS
platform has to provide a matchmaker able to compare the application
requirements and the capabilities offered by a VDC. This matchmaking is
mainly driven by the data utility, which encompasses the quality of
service, quality of data, and reputation aspects. On the other hand,
she/he has to check if the VDC is really providing what has been promised
both according to functional and non-functional perspective.
● The DITAS operator is responsible for the run-time platform; this includes the
responsibility for maintaining the applications running. The system operator
has no specific application or data knowledge, but rather dependent on
the monitoring tools to verify that all the applications are properly running,
to monitor the corrective actions the DITAS platform is taking, and to
© Main editor and other members of the DITAS consortium
57 D3.3 Data Virtualization SDK prototype
provide feedback at design-time by suggesting refinements of the data
utility specification.
For each of these roles, DITAS provides a dedicated SDK5, which simplifies the life
of the actors involved in the management of the related VDC. Depending on
the role, the SDK is provided in different flavors: e.g., CLI, GUI, web applications.
Details on the SDK offered for each of the roles follow:
● SDK for Data Administrator6
● SDK for VDC Developer (Application Developer)7
● SDK for Application Designer8
● SDK for DITAS Operator9
5 https://www.ditas-project.eu/wiki/ditas-sdk/ 6 https://www.ditas-project.eu/wiki/guide-for-data-administrator/ 7 https://www.ditas-project.eu/wiki/guide-for-vdc-developer/ 8 https://www.ditas-project.eu/wiki/guide-for-application-designer/ 9 https://www.ditas-project.eu/wiki/guide-for-ditas-operator/
© Main editor and other members of the DITAS consortium
58 D3.3 Data Virtualization SDK prototype
7 Conclusions Virtualizing the data sources and creating an end-to-end system that facilitates
all the appropriate functionalities in order to create, discover, deploy, and
monitor a VDC requires several components. Creating all these components as
well as ensuring the interoperability between them is one of the main focuses of
the development in this work package. Although creating the components is
essential to the project, creating an SDK that can aid all the appropriate parties
to reproduce and run the system, is of great importance also. This SDK contained
all the services, guidelines and UI documentation essential to the usability of the
DITAS platform. Taken under consideration all the established requirements as
well as the new and reformed ones that were shaped throughout the project,
the components were extended or reshaped in order to fulfill them. Also, in the
context of this document, the DAL, which is a component that was established
later in the course of the project, was fully described, with all the functionalities
that it provides. As far as the SDK is concerned, since the project is evolving
rapidly and new functionalities or changes in the existing ones are made, a
number of dedicated wiki pages that can be easily updated and can be
accessed by the public were established in order to document all the
appropriate information for the SDK.
© Main editor and other members of the DITAS consortium
59 D3.3 Data Virtualization SDK prototype
8 References
[1] Deliverable D2.2 of DITAS project: “DITAS Data Management – second
release”. © DITAS Consortium, 2018.
[2] Deliverable D3.2 of DITAS Project: “Data Virtualization SDK prototype (initial
version)”. © DITAS Consortium, 2018.
[3] Deliverable D4.2 of DITAS Project: “Execution environment prototype (first
release)”. © DITAS Consortium, 2018.
[4] Deliverable D1.2 of DITAS Project: “Final DITAS architecture and validation
approach”. © DITAS Consortium, 2019.
[5] Werner, Sebastian, Jörn Kuhlenkamp, Markus Klems, Johannes Müller, and
Stefan Tai. "Serverless Big Data Processing using Matrix Multiplication as
Example." In 2018 IEEE International Conference on Big Data (Big Data),
pp. 358-365. IEEE, 2018.
[6] Kuhlenkamp, Jörn, and Sebastian Werner. "Benchmarking FaaS Platforms:
Call for Community Participation." In 2018 IEEE/ACM International
Conference on Utility and Cloud Computing Companion (UCC
Companion), pp. 189-194. IEEE, 2018.
© Main editor and other members of the DITAS consortium
60 D3.3 Data Virtualization SDK prototype
Appendix
Final Abstract VDC Blueprint Schema
{
"type":"object",
"description":"This is a VDC Blueprint which consists of five
sections",
"properties":{
"INTERNAL_STRUCTURE":{
"type":"object",
"description":"General information about the VDC
Blueprint",
"properties":{
"Overview":{
"type":"object",
"properties":{
"name":{
"type":"string",
"description":"This field should contain the
name of the VDC Blueprint"
},
"description":{
"type":"string",
"description":"This field should contain a
short description of the VDC Blueprint"
},
"tags":{
"type":"array",
"description":"Each element of this array
should contain some keywords that describe the functionality of each
one exposed VDC method",
"items":{
"type":"object",
"properties":{
"method_id":{
"type":"string",
"description":"The id (operationId) of
the method (as indicated in the EXPOSED_API.paths field)"
},
"tags":{
"type":"array",
"items":{
"type":"string"
},
© Main editor and other members of the DITAS consortium
61 D3.3 Data Virtualization SDK prototype
"minItems":1,
"uniqueItems":true
}
},
"additionalProperties":false,
"mandatory":[
"method_id",
"tags"
]
},
"minItems":1,
"uniqueItems":true
}
},
"additionalProperties":false,
"required":[
"name",
"description",
"tags"
]
},
"Data_Sources":{
"type":"array",
"items":{
"type":"object",
"properties":{
"id":{
"type":"string",
"description":"A unique identifier"
},
"description":{
"type":"string"
},
"location":{
"enum":[
"cloud",
"edge"
]
},
"class":{
"enum":[
"relational database",
"object storage",
"time-series database",
"api",
"data stream"
]
© Main editor and other members of the DITAS consortium
62 D3.3 Data Virtualization SDK prototype
},
"type":{
"enum":[
"MySQL",
"Minio",
"InfluxDB",
"rest",
"other"
]
},
"parameters":{
"type":"object",
"description":"Connection parameters"
},
"schema":{
"type":"object"
}
},
"required":[
"id"
]
},
"minItems":1,
"uniqueItems":true
},
"Methods_Input":{
"type":"object",
"description":"This filed contains the part of the
data source that each method needs to be executed",
"properties":{
"Methods":{
"type":"array",
"description":"The list of methods",
"items":{
"type":"object",
"properties":{
"method_id":{
"type":"string",
"description":"The id (operationId) of
the method (as indicated in the EXPOSED_API.paths field)"
},
"dataSources":{
"type":"array",
"description":"The list of data
sources required by the method",
"items":{
"type":"object",
© Main editor and other members of the DITAS consortium
63 D3.3 Data Virtualization SDK prototype
"properties":{
"dataSource_id":{
"type":"string",
"description":"The id of the
data sources (as indicated in the Data_Sources field)"
},
"dataSource_type":{
"type":"string",
"description":"The type of
the data sources (relationa/not_relational/object)"
},
"database":{
"type":"array",
"description":"the list of
databases required by a method in a data source",
"items":{
"type":"object",
"properties":{
"database_id":{
"type":"string",
"description":"The
id of the database"
},
"tables":{
"type":"array",
"description":"the
list of tables/collections required by a method in a data source",
"items":{
"type":"object",
"properties":{
"table_id":{
"type":"string",
"description":"The id of the tables/collection "
},
"columns":{
"type":"array",
"items":{
"type":"object",
"properties":{
"column_id":{
© Main editor and other members of the DITAS consortium
64 D3.3 Data Virtualization SDK prototype
"type":"string",
"description":"The id of the column/field"
},
"computeDataUtility":{
"type":"boolean",
"description":"True if it is required for data utility computation"
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
},
"Flow":{
"type":"object",
"description":"The data flow that implements the
VDC",
"properties":{
"platform":{
"enum":[
"Spark",
"Node-RED"
]
},
"parameters":{
"type":"object"
},
"source_code":{
}
}
© Main editor and other members of the DITAS consortium
65 D3.3 Data Virtualization SDK prototype
},
"DAL_Images":{
"description":"Docker images that must be deployed in
the DAL indexed by DAL name. It will be used to compose the service
name and the DNS entry that other images in the cluster can access
to.",
"type":"object",
"additionalProperties":{
"description":"Information about the DAL including
its original location",
"type":"object",
"required":[
"original_ip"
],
"properties":{
"original_ip":{
"description":"IP of the original DAL's
location",
"type":"string"
},
"images":{
"description":"Set of images to deploy
indexed by the image identifier",
"type":"object",
"additionalProperties":{
"description":"ImageInfo is the
information about an image that will be deployed by the deployment
engine",
"type":"object",
"required":[
"image"
],
"properties":{
"external_port":{
"description":"Port in which this
image must be exposed. It must be unique across all images in all
the ImageSets defined in this blueprint. Due to limitations in k8s,
the port range must be bewteen 30000 and 32767",
"type":"integer",
"format":"int64"
},
"image":{
"description":"Image is the image
name in the standard format [group]/<image_name>:[release]",
"type":"string"
},
"internal_port":{
© Main editor and other members of the DITAS consortium
66 D3.3 Data Virtualization SDK prototype
"description":"Port in which the
docker image is listening internally. Two images inside the same
ImageSet can't have the same internal port.",
"type":"integer",
"format":"int64"
}
}
}
}
}
}
},
"VDC_Images":{
"$ref":"#/properties/INTERNAL_STRUCTURE/properties/DAL_Images/additi
onalProperties/properties/images"
},
"Identity_Access_Management":{
"type":"object",
"properties":{
"jwks_uri":{
"type":"string"
},
"iam_endpoint":{
"type":"string"
},
"roles":{
"type":"array",
"items":{
"type":"string"
},
"minItems":1
},
"provider":{
"type":"array",
"items":{
"type":"object",
"properties":{
"name":{
"type":"string"
},
"type":{
"type":"string"
},
"uri":{
"type":"string"
},
© Main editor and other members of the DITAS consortium
67 D3.3 Data Virtualization SDK prototype
"loginPortal":{
"type":"string"
}
},
"required":[
"name",
"uri"
]
},
"minItems":1
}
},
"required":[
"jwks_uri",
"iam_endpoint"
]
},
"Testing_Output_Data":{
"type":"array",
"items":{
"type":"object",
"properties":{
"method_id":{
"type":"string",
"description":"The id (operationId) of the
method (as indicated in the EXPOSED_API.paths field)"
},
"zip_data":{
"type":"string",
"description":"The URI to the zip testing
output data for each one exposed VDC method"
}
},
"additionalProperties":false,
"required":[
"method_id",
"zip_data"
]
},
"minItems":1,
"uniqueItems":true
}
},
"additionalProperties":false,
"required":[
"Overview",
"Data_Sources"
© Main editor and other members of the DITAS consortium
68 D3.3 Data Virtualization SDK prototype
]
},
"DATA_MANAGEMENT":{
"description":"list of methods",
"type":"array",
"items":{
"type":"object",
"properties":{
"method_id":{
"description":"The id (operationId) of the method
(as indicated in the EXPOSED_API.paths field)",
"type":"string"
},
"attributes":{
"type":"object",
"description":"goal trees",
"properties":{
"dataUtility":{
"type":"array",
"items":{
"type":"object",
"description":"definition of the metric",
"properties":{
"id":{
"description":"id of the metric",
"type":"string"
},
"name":{
"description":"name of the metric",
"type":"string"
},
"type":{
"description":"type of the metric",
"type":"string"
},
"properties":{
"type":"object",
"description":"properties related
to the metric",
"additionalProperties":{
"type":"object",
"description":"properties
related to the metric",
"properties":{
"unit":{
"description":"unit of
measure of the property",
© Main editor and other members of the DITAS consortium
69 D3.3 Data Virtualization SDK prototype
"type":"string"
},
"maximum":{
"description":"lower limit
of the offered property",
"type":"number"
},
"minimum":{
"description":"upper limit
of the offered property",
"type":"number"
},
"value":{
"description":"value of
the property",
"type":[
"string",
"number",
"array",
"boolean"
]
}
}
}
}
}
}
},
"security":{
"$ref":"#/properties/DATA_MANAGEMENT/items/properties/attributes/pro
perties/dataUtility"
},
"privacy":{
"$ref":"#/properties/DATA_MANAGEMENT/items/properties/attributes/pro
perties/dataUtility"
}
}
}
},
"required":[
"method_id",
"attributes"
]
}
},
© Main editor and other members of the DITAS consortium
70 D3.3 Data Virtualization SDK prototype
"ABSTRACT_PROPERTIES":{
},
"COOKBOOK_APPENDIX":{
"description":"CookbookAppendix is the definition of the
Cookbook Appendix section in the blueprint",
"type":"object",
"required":[
"Resources",
"Deployment"
],
"properties":{
"Identity_Access_Management":{
"type":"object",
"properties":{
"validation_keys":{
"type":"array",
"items":{
"type":"object"
}
},
"mapping":{
"type":"array",
"items":{
"oneOf":[
{
"type":"object",
"properties":{
"provider":{
"type":"string"
},
"roles":{
"type":"array",
"items":{
"type":"string"
}
},
"role_map":{
"type":"array",
"items":{
"type":"object",
"properties":{
"matcher":{
"type":"string"
},
"roles":{
"type":"array",
© Main editor and other members of the DITAS consortium
71 D3.3 Data Virtualization SDK prototype
"items":{
"type":"string"
}
},
"priority":{
"type":"number"
}
}
}
},
"mapping_url":{
"enum":[
""
]
}
},
"required":[
"role_map"
]
},
{
"type":"object",
"properties":{
"provider":{
"type":"string"
},
"roles":{
"type":"array",
"items":{
"type":"string"
}
},
"mapping_url":{
"type":"string"
},
"role_map":{
"enum":[
""
]
}
},
"required":[
"mapping_url"
]
}
]
}
© Main editor and other members of the DITAS consortium
72 D3.3 Data Virtualization SDK prototype
}
},
"required":[
"mapping"
]
},
"Deployment":{
"description":"DeploymentInfo contains information of
a deployment than may compromise several clusters",
"type":"object",
"required":[
"id"
],
"properties":{
"extra_properties":{
"type":"object",
"title":"ExtraPropertiesType represents extra
properties to define for resources, infrastructures or deployments.
This properties are provisioner or deployment specific and they
should document them when they expect any.",
"additionalProperties":{
"type":"string"
}
},
"id":{
"description":"Unique ID for the deployment",
"type":"string",
"uniqueItems":true
},
"infrastructures":{
"description":"Lisf of infrastructures, each
one representing a different cluster.",
"type":"object",
"additionalProperties":{
"type":"object",
"title":"InfrastructureDeploymentInfo
contains information about a cluster of nodes that has been
instantiated or were already existing.",
"required":[
"id",
"type",
"provider",
"Nodes"
],
"properties":{
"Nodes":{
© Main editor and other members of the DITAS consortium
73 D3.3 Data Virtualization SDK prototype
"description":"Set of nodes in the
infrastructure indexed by role",
"type":"object",
"additionalProperties":{
"type":"array",
"items":{
"description":"NodeInfo is the
information of a virtual machine that has been instantiated or a
physical one that was pre-existing",
"type":"object",
"required":[
"ip",
"drive_size"
],
"properties":{
"cores":{
"description":"Number of
cores.",
"type":"integer",
"format":"int64"
},
"cpu":{
"description":"CPU speed
in Mhz.",
"type":"integer",
"format":"int64"
},
"data_drives":{
"description":"Data drives
information",
"type":"array",
"items":{
"description":"DriveInfo is the information of a drive that has been
instantiated",
"type":"object",
"required":[
"name",
"size"
],
"properties":{
"name":{
"description":"Name of the data drive",
"type":"string",
"uniqueItems":true
© Main editor and other members of the DITAS consortium
74 D3.3 Data Virtualization SDK prototype
},
"size":{
"description":"Size of the disk in bytes",
"type":"integer",
"format":"int64"
}
}
}
},
"drive_size":{
"description":"Size of the
boot disk in bytes",
"type":"integer",
"format":"int64",
"uniqueItems":true
},
"extra_properties":{
"type":"object",
"title":"ExtraPropertiesType represents extra properties to define
for resources, infrastructures or deployments. This properties are
provisioner or deployment specific and they should document them
when they expect any.",
"additionalProperties":{
"type":"string"
}
},
"hostname":{
"description":"Hostname of
the node.\nrequiered:true",
"type":"string",
"uniqueItems":true
},
"ip":{
"description":"IP assigned
to this node.",
"type":"string",
"uniqueItems":true
},
"ram":{
"description":"RAM
quantity in bytes.",
"type":"integer",
"format":"int64"
},
"role":{
© Main editor and other members of the DITAS consortium
75 D3.3 Data Virtualization SDK prototype
"description":"Role of the
node. Master or slave in case of Kubernetes.",
"type":"string",
"example":"master"
}
}
}
}
},
"VDM":{
"description":"Set weather the VDM is
running in this cluster or not",
"type":"boolean"
},
"extra_properties":{
"type":"object",
"title":"ExtraPropertiesType
represents extra properties to define for resources, infrastructures
or deployments. This properties are provisioner or deployment
specific and they should document them when they expect any.",
"additionalProperties":{
"type":"string"
}
},
"id":{
"description":"Unique infrastructure
ID on the deployment",
"type":"string",
"uniqueItems":true
},
"name":{
"description":"Name of the
infrastructure",
"type":"string"
},
"provider":{
"description":"CloudProviderInfo
contains information about a cloud provider",
"type":"object",
"required":[
"api_endpoint"
],
"properties":{
"api_endpoint":{
"description":"Endpoint to use
for this infrastructure",
"type":"string"
© Main editor and other members of the DITAS consortium
76 D3.3 Data Virtualization SDK prototype
},
"api_type":{
"description":"Type of the
infrastructure. i.e AWS, Cloudsigma, GCP or Edge",
"type":"string"
},
"credentials":{
"description":"Credentials to
access the cloud provider. Either this or secret_id is mandatory.
Each cloud provider should define the format of this element.",
"type":"object",
"additionalProperties":{
"type":"string"
}
},
"secret_id":{
"description":"Secret identifier
to use to log in to the infrastructure manager.",
"type":"string"
}
}
},
"status":{
"description":"Status of the
infrastructure",
"type":"string"
},
"type":{
"description":"Type of the
infrastructure: cloud or edge",
"type":"string",
"pattern":"cloud|edge"
},
"vdcs":{
"description":"Configuration of VDCs
running in the cluster, indexed by VDC identifier.",
"type":"object",
"additionalProperties":{
"description":"VDCInfo contains
information about related to a VDC running in a kubernetes cluster",
"type":"object",
"properties":{
"Ports":{
"type":"object",
"additionalProperties":{
"type":"integer",
"format":"int64"
© Main editor and other members of the DITAS consortium
77 D3.3 Data Virtualization SDK prototype
}
}
}
}
}
}
}
},
"name":{
"description":"Name of the deployment",
"type":"string"
},
"status":{
"description":"Global status of the
deployment",
"type":"string"
}
}
},
"Resources":{
"description":"Deployment is a set of infrastructures
that need to be instantiated or configurated to form clusters",
"type":"object",
"required":[
"name",
"infrastructures"
],
"properties":{
"description":{
"description":"Optional description",
"type":"string"
},
"infrastructures":{
"description":"List of infrastructures to
deploy for this hybrid deployment",
"type":"array",
"items":{
"description":"InfrastructureType is a set
of resources that need to be created or configured to form a
cluster",
"type":"object",
"required":[
"name",
"resources"
],
"properties":{
"description":{
© Main editor and other members of the DITAS consortium
78 D3.3 Data Virtualization SDK prototype
"description":"Optional description
for the infrastructure",
"type":"string"
},
"extra_properties":{
"type":"object",
"title":"ExtraPropertiesType
represents extra properties to define for resources, infrastructures
or deployments. This properties are provisioner or deployment
specific and they should document them when they expect any.",
"additionalProperties":{
"type":"string"
}
},
"name":{
"description":"Unique name for the
infrastructure",
"type":"string",
"uniqueItems":true
},
"provider":{
"description":"CloudProviderInfo
contains information about a cloud provider",
"type":"object",
"required":[
"api_endpoint"
],
"properties":{
"api_endpoint":{
"description":"Endpoint to use
for this infrastructure",
"type":"string"
},
"api_type":{
"description":"Type of the
infrastructure. i.e AWS, Cloudsigma, GCP or Edge",
"type":"string"
},
"credentials":{
"description":"Credentials to
access the cloud provider. Either this or secret_id is mandatory.
Each cloud provider should define the format of this element.",
"type":"object",
"additionalProperties":{
"type":"string"
}
},
© Main editor and other members of the DITAS consortium
79 D3.3 Data Virtualization SDK prototype
"secret_id":{
"description":"Secret identifier
to use to log in to the infrastructure manager.",
"type":"string"
}
}
},
"resources":{
"description":"List of resources to
deploy",
"type":"array",
"items":{
"type":"object",
"title":"ResourceType has
information about a node that needs to be created by a deployer.",
"required":[
"name",
"disk",
"image_id"
],
"properties":{
"cores":{
"description":"Number of
cores. Ignored if type is provided",
"type":"integer",
"format":"int64"
},
"cpu":{
"description":"CPU speed in
Mhz. Ignored if type is provided",
"type":"integer",
"format":"int64"
},
"disk":{
"description":"Boot disk size
in Mb",
"type":"integer",
"format":"int64"
},
"drives":{
"description":"List of data
drives to attach to this VM",
"type":"array",
"items":{
"description":"Drive holds
information about a data drive attached to a node",
"type":"object",
© Main editor and other members of the DITAS consortium
80 D3.3 Data Virtualization SDK prototype
"required":[
"name",
"size"
],
"properties":{
"name":{
"description":"Unique name for the drive",
"type":"string"
},
"size":{
"description":"Size
of the disk in Mb",
"type":"integer",
"format":"int64"
},
"type":{
"description":"Type
of the drive. It can be \"SSD\" or \"HDD\"",
"type":"string",
"pattern":"SSD|HDD",
"example":"SSD"
}
}
}
},
"extra_properties":{
"type":"object",
"title":"ExtraPropertiesType
represents extra properties to define for resources, infrastructures
or deployments. This properties are provisioner or deployment
specific and they should document them when they expect any.",
"additionalProperties":{
"type":"string"
}
},
"image_id":{
"description":"Boot image ID
to use",
"type":"string"
},
"ip":{
"description":"IP to assign
this VM. In case it's not specified, the first available one will be
used.",
"type":"string"
},
© Main editor and other members of the DITAS consortium
81 D3.3 Data Virtualization SDK prototype
"name":{
"description":"Suffix for the
hostname. The real hostname will be formed of the infrastructure
name + resource name",
"type":"string",
"uniqueItems":true
},
"ram":{
"description":"RAM quantity
in Mb. Ignored if type is provided",
"type":"integer",
"format":"int64"
},
"role":{
"description":"Role that this
VM plays. In case of a Kubernetes deployment at least one \"master\"
is needed.",
"type":"string"
},
"type":{
"description":"Type of the VM
to create i.e. n1-small",
"type":"string",
"example":"n1-small"
}
}
}
},
"type":{
"description":"Type of the
infrastructure: Cloud or Edge: Cloud infrastructures mean that the
resources will be VMs that need to be instantiated. Edge means that
the infrastructure is already in place and its information will be
added to the database but no further work will be done by a
deployer.",
"type":"string"
}
}
}
},
"name":{
"description":"Name for this deployment",
"type":"string",
"uniqueItems":true
}
}
}
© Main editor and other members of the DITAS consortium
82 D3.3 Data Virtualization SDK prototype
}
},
"EXPOSED_API":{
"title":"CAF API",
"type":"object",
"description":"The CAF RESTful API of the VDC, written
according to the current version (3.0.1) of the OpenAPI
Specification (OAS), but also adapted to DITAS requirements",
"properties":{
"paths":{
"type":"object",
"patternProperties":{
"^/":{
"type":"object",
"patternProperties":{
"^get$":{
"allOf":[
{
"$ref":"#/properties/EXPOSED_API/definitions/method"
},
{
"properties":{
"parameters":{
}
}
}
]
},
"^post$":{
"allOf":[
{
"$ref":"#/properties/EXPOSED_API/definitions/method"
},
{
"properties":{
"requestBody":{
"type":"object",
"properties":{
"content":{
"$ref":"#/properties/EXPOSED_API/definitions/content"
}
}
}
© Main editor and other members of the DITAS consortium
83 D3.3 Data Virtualization SDK prototype
},
"required":[
"requestBody"
]
}
]
}
}
}
}
}
},
"definitions":{
"method":{
"title":"An Exposed VDC Method",
"type":"object",
"description":"Corresponds to the Operation Object
defined in the OpenAPI Specification (OAS) version 3.0.1",
"properties":{
"summary":{
},
"operationId":{
},
"responses":{
"type":"object",
"patternProperties":{
"^200$|^201$":{
"type":"object",
"properties":{
"content":{
"$ref":"#/properties/EXPOSED_API/definitions/content"
}
},
"required":[
"content"
]
}
}
},
"x-data-sources":{
"type":"array",
"description":"An array that contains all the
identifiers of the data sources (as indicated in the
© Main editor and other members of the DITAS consortium
84 D3.3 Data Virtualization SDK prototype
INTERNAL_STRUCTURE.Data_Sources field) that are accessed by the
method",
"items":{
"type":"string"
},
"minItems":1,
"uniqueItems":true
},
"x-iam-roles":{
"type":"array",
"items":{
"type":"string"
}
}
},
"required":[
"summary",
"operationId",
"responses",
"x-data-sources"
]
},
"content":{
"type":"object",
"patternProperties":{
"^application/json$":{
"type":"object",
"properties":{
"schema":{
"type":"object"
}
},
"required":[
"schema"
]
}
}
}
}
}
},
"additionalProperties":false,
"required":[
"INTERNAL_STRUCTURE",
"DATA_MANAGEMENT",
"ABSTRACT_PROPERTIES",
"COOKBOOK_APPENDIX",
© Main editor and other members of the DITAS consortium
85 D3.3 Data Virtualization SDK prototype
"EXPOSED_API"
]
}