PROJECT DELIVERABLE - CORDIS...Deliverable Due date (and month since project start): 2013-08-31, m9 Deliverable Version: v1.0 Ref. Ares(2013)2985670 - 03/09/2013 BiobankCloud – Object

BiobankCloud – Object model for biobank data sharing page 1/28

317871

Project number: 317871

Project acronym: BIOBANKCLOUD

Project title: Scalable, Secure Storage Biobank

Project website URL: http://www.biobankcloud.com/

Project Coordinator Name and Organisation: Jim Dowling, KTH

E-mail: [email protected]

WORK PACKAGE 1 :

Regulatory and Ethical Requirements for Biobanking Data Storage and Analysis

Work Package Leader Name and Organisation: JAN-ERIC LITTON, Karolinska Institute (KI)

E-mail: [email protected]

PROJECT DELIVERABLE

D1.2 Object model for biobank data sharing

Deliverable Due date (and month since project start): 2013-08-31, m9 Deliverable Version: v1.0

Ref. Ares(2013)2985670 - 03/09/2013

http://www.biobankcloud.com/


317871

Document history Version Date Changes By Reviewed

0.1 2013-06-11 First draft

Roxana Merino Martinez

Jane Reichel

Karin Zimmermann

Lora Dimitrova

Michael Hummel

0.1 First draft Jan-Eric Litton

2013-08-19 Final Jan-Eric Litton


317871

Executive Summary This deliverable is a continuation of D1.1 which provided the information

model that should be handled by the platform in order to guarantee sharing of data among biobanks and researchers as well as the ethics

regulations involved in this issue. This deliverable will provide the design

of standard forms required by the BiobankCloud Model Data Management

Policy (MDMP) specified in D1.5. A more complete data model for biobank

data sharing based on MIABIS 2.0 and also based on additional

requirements to share data related to omics experiments, will be also

provided.


317871

Table of Contents 1 Introduction ......................................................................................... 5 2 BiobankCloud Model Data Management Policy (MDMP): Standard forms ........ 6

2.1 Organization application .................................................................. 6 2.1.1 Classes and properties ............................................................... 6

2.2 Individual membership application .................................................... 8 2.2.1 Classes and properties ............................................................... 9

2.3 Standard Research Project Approval (SRPA) ..................................... 10 2.4 Standard Information About Consent (SIAC)..................................... 10 2.5 Standard Information on Non-Consented data (SINC) ........................ 11 2.6 Agreement between the controller & processor ................................. 12 2.7 Summary..................................................................................... 14

3 Data Model ......................................................................................... 15 3.1 Omics representation .................................................................... 22 3.2 Summary..................................................................................... 26

4 Conclusions ........................................................................................ 27 5 References ......................................................................................... 28


317871

1 Introduction

The implementation of the BiobankCloud platform requires a data model

that helps to determine the structure of the data. The data model

provided by this deliverable is limited to the representation of objects and

relationships related to biobank data. Other data models related to

security, storage, management, etc. will be assumed by other work

packages.

The BiobankCloud Model Data Management Policy (MDMP) specified in

D1.5 concluded that six standard electronic forms for BiobankCloud

membership request and data submission should be designed:

Organization application

Individual membership application

SRPA: Standard Research Project Approval SIAC: Standard Information About Consent

SINC: Standard Information on Non-Consented data

Agreement between the controller & processor

Section 2 we provides the design elements of these forms.

Section 3 covers some aspects of the data model specification based on

MIABIS 2.0: Minimum Information About Biobank data Sharing [1], and

also based on specific requirements of the platform. Several projects in

Nordic countries and Europe are adopting this formalism to represent

biobank data sharing. Making the BiobankCloud data model MIABIS

compliant will ensure integration with other biobank information systems.


317871

2 BiobankCloud Model Data Management Policy (MDMP): Standard forms

The deliverable D1.5 specified the MDMP based on the BiobankCloud

platform legal framework. In order to guarantee a controlled management

of data flow in the platform, six standard forms need to be designed to

keep track of the use of data. D1.1 provided more information about what

should be included in the standard forms.

The standard forms are implicit in the data model but need to be explicitly

available as printable documents.

The standard forms related to personal data protection are useful only for

data generated by experiments done on human samples.

2.1 Organization application

This form will collect information about an organization member of the

platform. An organization requesting membership to the platform will

provide the information through a requester, which is a person,

designated by the organization.

Required information:

Name of the organization

Organization Identification

Organization website Organization contact person:

o Contact information (name, e-mail, telephone, address)

o Position

o Affiliation (Organization, Department)

The aim and purpose of the organization. If the organization pursues

any aim other than biomedical research, this must be explicitly

documented.

Forms of organisation (public or private)

Means of funding of the organization

All public bodies exercising supervision over the organization (Data

Protection Authority, Health Boards, Boards for Higher Education,

etc.)

2.1.1 Classes and properties

Entity Class Name Attribute Name Attribute Description Attribute Type

Organization

Name Name of the organization

Text


317871

Table 1. Classes and properties involved in the Organization application

Number Organization Identification number (VAT identification number)

Identifier

Website URL Text

OrganizationType Public or Private Text

Aim Aim of the organization Text

AimReference Reference (document, link, contact information) if the organization pursues any aim other than biomedical research

Reference

Contact FirstName Contact person first name

Text

LastName Contact person last name

Text

Email Contact person email Text Position Position in the

organization Text

Department Department of the contact person

Text

Street Street address Text

ZipCode Address Zip code Text

City Address city Text

Phone Contact person phone number

Text

Country Name Country name Text

Code Country code Text

PublicBody Name Name of public body Text

Reference Reference to the public body

Reference

MeansOfFunding Name Name of mean of funding

Text

Reference Reference to mean of funding

Reference


317871

Fig.1 Classes involved in the Organization application

Once the access committee has approved an organization as a member, the contact person for the organization will be requested to appoint at

least one “controller”. The appointed “controller” (or controllers) will then

apply for individual membership.

2.2 Individual membership application

This form will collect information about a researcher applying for

membership. A researcher requesting membership to the platform has to

belong to an organization whose membership has been already granted. As defined in D1.1 (Section 6.1), a researcher has two main roles:

controller or trusted researcher. The controller role is appointed by the

organization and can storage, analyse and download data to or from the

platform. A trusted researcher needs to be approved by the controller and

can only analyse data and download analysis results associated to the

data uploaded by the controller.

Required general information for controller and trusted researcher:

Contact information (name, e-mail, telephone, address)

Position

Affiliation (Organization, Department)

The platform provides two main services: storage and analysis. In order to

use the platform services, the researcher should request the specific


317871

service after login into the platform. Only appointed controllers can upload

data to the platform.

Required information for controller:

Purpose of the use of the platform (storage, analysis, both)

Scientific aims of the investigation

Standard Research Project Approval (SRPA) (Reference)

Agreement between controller and processor

2.2.1 Classes and properties

Table 2. Classes and properties involved in the individual membership application


Organization

Organization Organization already accepted as member

Organization

Contact FirstName Contact person first name

Text

LastName Contact person last name

Text

Email Contact person email Text

Position Position in the organization

Text

Department Department of the contact person

Text

Street Street address Text

ZipCode Address Zip code Text City Address city Text

Country Name Country name Text

Code Country code Text

Researcher Agreement Reference to document agreement between controller and processor

Reference

Purpose [*] Storage, Analysis, both Identifier

Study Aim Scientific aims of the investigation

Text

SRPA Reference to the Standard Research Project Approval

Reference


317871

The property “Purpose” is limited by the role of the user. If a researcher is

requiring storage capability, he has to be approved as a “Controller” by

the organization that the researcher belongs to.

Once a researcher has been grant as a controller, the “storage”, “analysis”

and “download” functionalities are available to this researcher.

A trusted researcher has access to the “analysis” and “download” services.

The “download” is limited to the analysis results. This applies if the data is

related to human samples. Otherwise, the trusted researcher can use the

rest of the functionalities.

The D1.1, Section 4.2 defined the data model for biobank data sharing based on MIABIS 2.0. The standard forms should be related to the entity

“study” as specified in Table 2. The Aim and SRPA properties are provided

each time a new study is created in the platform. The SIAC and SINC are

provided if the study is done on human samples.

2.3 Standard Research Project Approval (SRPA)

Each study has at least a SRPA related to it. When a controller requests a

storage service, the system will request the SRPA information.

Class and properties

Table 3. Class SRPA

2.4 Standard Information About Consent (SIAC)

Each study on human samples has at least a SIAC related to it. When a

controller requests a storage service, the system will request the SIAC

information.


SRPA

Registration date When the form was registered in the system

Date

Controller Controller reference Contact

Document Reference to the research approval document

Reference


317871


Table 4. Class SIAC

2.5 Standard Information on Non-Consented data (SINC)

D1.1, Section 6.2 provides the use case “Store study descriptive

metadata”. If not all the human samples have been consented for the

study; a SINC is required.


Table 5. Class SINC


SIAC


Date


Document Reference to the information about consent document

Reference

Limit of use Limit of time for storing the data

Date

Remark Comment or remark about use of data

Text


SINC


Date


Document Reference to the information about non-consented data document

Reference

Limit of use Limit of time for storing the data

Date

Remark Comment or remark about use of data

Text


317871

2.6 Agreement between the controller & processor

As stated in Deliverable 1.5, Section 2.3, The Data Protection Directive

defines the controller as ‘the natural or legal person, public authority,

agency or any other body which alone or jointly with others determines

the purposes and means of the processing of personal data’, (Article 2.d).

A processor is ‘the natural or legal person, public authority, agency or any

other body which processes personal data on behalf of the controller’

(Article 2 e). According to Article 17.1 of the Data Protection Directive, the Member States are to provide that the controller must implement

appropriate technical and organizational measures to protect personal

data. If the controller does not conduct the processing him/herself but

leaves this to a processor, Article 17.3 stipulates that the processing of

data must be governed by a contract or legal act binding the processor to

the controller and stipulating, that the processor shall act only on

instructions from the controller and in conformity with the measure set

out by the Member States in accordance to Article 17.1.

For the sake of establishing a transparent and reliable chain of command

between controller and processors, the BiobankCloud should take care to

allocate all responsibilities stemming from the Data Protection Directive in a clear and concise manner1. The agreements entered into by the

BiobankCloud with the users wishing to upload and analyse data, at this

stage Charité in Germany, should therefore state the division of labour

between the parties, making clear that it is the user, Charité, that

continues to be the controller of the data, and that the BiobankCloud is

the processor [10].

It is a future task of this work package to develop a draft model

agreement for the users/controller to enter into with the

platform/processor.

1See further Article 29 Working Party Group, Opinion 5/2012 on Cloud Computing, adopted on 1st of July 2012, p. 8 and 12.


317871

Fig 2. Class association representing standard forms


317871

2.7 Summary

The standard forms are meant to hold information required by the

BiobankCloud Model Data Management Policy (MDMP).

The SIAC and SINC are needed only for data related to experiments

done on human samples.

The SRPA form is required by all the studies uploaded to the

platform. The agreement between controller and processor is not related to

the study but to the researcher that holds a controller role.

A controller for one study data can be a trusted researcher

associated to another study data.

The controller of the study data provides the information needed by

SRPA, SIAC and SINC.

The contact person of an organization provides the information for

the Organization application.

All the researchers (controller or trusted researcher) need to provide

the information required for the Individual membership application

related to personal information as specified in Section 2.2.

Deliverable 1.3 will provide a draft model agreement between controller and processor.

It is recommendable to use specific entities holding the information

related to the standard forms (Fig 2) to facilitate searching.

For non-human experiment data, the controller role has not the

same relevance but it would be a good practice to use this role as

the owner of the data to decide which data can be shared and by

whom. A researcher can use the trusted researcher role to analyse

and download data once the data has been authorized for sharing.

Only the owner of the data can decide how to share his/her data.


317871

3 Data Model

D1.1 brought the first draft of the data model for biobank data

management in the BiobankCloud platform based on MIABIS 2.0. Some

modifications should be done in order to adapt the model to the platform

requirements.

Fig 3. The white blocks represent extensions to be made to MIABIS 2.0

MIABIS 2.0 [1] is designed for sharing human samples. The donors and

the samples are represented at the aggregated levels and the species and

anatomical location of the sample are not included in the model.

As shown in Fig 3, each Study has associated a set of Donors and a set of

Samples. One solution is that each set of Samples has the species and

anatomical site. Because the Samples are represented at aggregated level, the Species and Anatomical sites will also be represented at

aggregated level. Then, the species and anatomical sites should also be

specified in each omics descriptive metadata as shown in Fig 4.


317871

Fig 4. Species and anatomical part properties at aggregated and data levels. Properties in

class Study should be defined based on MIABIS 2.0 (table 6)

Having Species and Anatomical part in both levels (aggregated and data)

can help a query engine to search into studies without having to search

into the omics descriptive metadata.

It is a suggestion to use the species list from Ensembl database [3] or

NCBI Taxonomy Database to define the species annotation.

The specification of the anatomical parts to describe the origin of the

sample is a complicated matter. Even more if several species can be

handled by the system. A simplified solution could be to create a

dictionary of anatomical terms used by the system and provide the

flexibility to add new anatomical parts when it is required. In order to use

a controlled vocabulary it is recommendable to use terms from established

ontologies or standards as NCI Thesaurus [5], SNOMED CT [6], AEO [7] etc.

For the aggregated level of the data model (Fig 1), the classes Biobank,

SampleCollection and Study should be defined according to MIABIS 2.0

(Table 6). For more information:

http://bbmri-wiki.wikidot.com/en:dataset

Name Possible values Explanation Data level

Biobank ID Text

Text string of letters starting with the country code (according to standard ISO1366 alpha2) followed by the underscore “_” and post-fixed by a biobank ID or name specified by its juristic person (nationally specific) Biobank

Name of biobank Free text in English Text string of letters denoting the name of the biobank in the local language Biobank

ContactPerson Structured data Name, email, address, phone number, affiliation Biobank

http://bbmri-wiki.wikidot.com/en:dataset


317871

Sample Collection ID Free text in any language Text string depicting the unique ID or acronym for the sample collection or study

Sample Collection

ContactPerson Structured data Name, email, address, phone number, affiliation

Sample Collection

Sample Collection Description

Free text in English Text string of letters describing the sample collection or study aim (max 200 characters)

Sample Collection

Sample Collection Responsible

Text Text string of letters denoting the name of the sample collection responsible or principal investigator

Sample Collection

Sample Collection Contact Information

Text Structured information about the contact person including address, phone, email, organization, department

Sample Collection

Type of Collection

Case-control, Cohort, Cross-sectional, Longitudinal, Twin-

study, Quality control, Population-based, Disease

specific, Other

Text string of letters denoting the type of sample collection or study design. Can be one or several of the following values: Case-control, Cohort, Cross-sectional, Longitudinal, Twin-study, Quality control, Population-based, Other. Definitions for the values are as follows: Case-control = A case-control study design compares two groups of subjects: those with the disease or condition under study (cases) and a very similar group of subjects who do not have the disease or condition (controls), Cohort = A group of individuals identified by a common characteristic (e.g. demographic, exposures, illness etc.), Cross-sectional = A study in which participants are examined at only a single time for characteristics of a disease, Longitudinal = Research studies involving repeated observations of the same entity over time. In the biobank context, longitudinal studies sample a group of people in a given time period, and study them at intervals by the acquisition and analyses of data and/or samples over time, Twin-study = A twin study design is a study design in behavior genetics which aid the study of individual differences between genetically identical twins by highlighting the role of environmental and genetic causes on behavior, Quality Control = A quality control testing study design type is where some aspect of the experiment is quality controlled for the purposes of quality assurance and Population-based = Multidisciplinary study done at the population level or among the population groups, generally to find the cause, incidence or spread of the disease or to see the response to the treatment, nutrition or environment, Disease specific = A study or biobank for which material and information is collected from subjects that have already

Sample Collection


317871

developed a particular disease, Other

Collection start Date Date in ISO-standard (8601) time format specifying when the sample collection starts

Sample Collection

Collection end Date Date in ISO-standard (8601) time format specifying when the sample collection ends, if applicable

Sample Collection

Sex

Female, Male

Text string of letters denoting the sex of the sample donors. Can be one or both of the following values: Female, Male

Sample Collection

Age interval [001-999] Age interval of youngest to oldest participant in sample collection

Sample Collection

Average age Real Average age of all sample donors in the sample collection

Sample Collection

Main diagnosis Text Diagnosis system defined by RD-Connect. Can be several values

Sample Collection

Categories of data collected

Biological samples, Register data, Survey

data, Physiological

measurements, Imaging data, Medical records,

Other

Can be one or several of the following values: Biological samples, Register data, Survey data, Physiological measurements, Imaging data, Medical records, Other

Sample Collection

Material type

Whole blood, Plasma,

Serum, Urine, Saliva, CSF, DNA, RNA, Tissue,

Faeces, Other

Text string of letters denoting the nature of the biological samples that make up the sample collection. Can be one or several of the following values: Whole blood, Plasma, Serum, Urine, Saliva, CSF, DNA, RNA, Tissue, Faeces, Other

Sample Collection


317871

Survey data

Individual Disease

History, Individual History of Injuries,

Medication, Perception of

Health, Women's Health, Reproductive History,

Familial Disease History,

Life Habits/Behaviours, Sociodemographic

Characteristics,

Socioeconomic Characteristics, Physical

Environment, Mental

Health, Other

Text string of letters covering additional information existing about the sample donors. Can be one or several of the following values: Individual Disease History, Individual History of Injuries, Medication, Perception of Health, Women's Health, Reproductive History, Familial Disease History, Life Habits/Behaviors, Sociodemographic Characteristics, Socioeconomic Characteristics, Physical Environment, Mental Health, Other.

Sample Collection

Medical records Text Free text specifying which medical records are available in the sample collection/study

Sample Collection

Registers Text Free text specifying which registers are available in the sample collection/study

Sample Collection

Sample handling Text

Text string of letters describing how the samples in the sample collection have been handled as an indication of sample quality. Can be one or several of the following values: Freeze chain, indicating if the samples in the collection have been kept cool from needle to freezer. Freeze time, time in hours from needle to freezer. SPREC compliant, if the samples are labeled according to SPREC, Other

Sample Collection

Current sampled individuals

Integer

Number of individuals with biological samples in the study at the date of Last update (also see Planned sampled individuals)

Sample Collection

Current total individuals

Integer Total number of individuals in the study at the date of Last updated (also see Planned total individuals)

Sample Collection

Hosting biobank Text Text string of letters of the biobank/s storing the biological samples that are part of the sample collection. Can be several

Sample Collection

Date of entry Date Date in ISO-standard (8601) time format when data about the sample collection was reported into a database

Sample Collection

Last updated Date Date in ISO-standard (8601) time format when data about the sample collection was last updated in a database

Sample Collection

Study ID* Free text in any language Text string depicting the unique ID or acronym for the study. Can be generated by the system Study


317871

Study Description* Free text in English Text string of letters describing the sample collection or study aim (max 200 characters) Study

Principal Investigator*

Text Text string of letters denoting the name of the sample collection responsible or principal investigator Study

Study Contact Person*

Text Structured information about the contact person including address, phone, email, organization, department Study

Sex Female, Male Text string of letters denoting the sex of the sample donors. Can be one or both of the following values: Female, Male Study

Age interval* [001 – 999] Age interval of youngest to oldest participant in sample collection Study

Main diagnosis* Text ICD-10 codes for the studied diagnoses. Can be several values Study

Categories of data collected

Biological samples, Register data, Survey data, Physiological measurements, Imaging data, Medical records, Other

Can be one or several of the following values: Biological samples, Register data, Survey data, Physiological measurements, Imaging data, Medical records, Other Study

Material type* Whole blood, Plasma, Serum, Urine, Saliva, CSF, DNA, RNA, Tissue, Faeces, Cell line, Other

Text string of letters denoting the nature of the biological samples that make up the sample collection. Can be one or several of the following values: Whole blood, Plasma, Serum, Urine, Saliva, CSF, DNA, RNA, Tissue, Faeces, Cell line, Other Study

Survey data

Individual Disease History, Individual History of Injuries, Medication, Perception of Health, Women's Health, Reproductive History, Familial Disease History, Life Habits/Behaviours, Sociodemographic Characteristics, Socioeconomic Characteristics, Physical Environment, Mental Health, Other

Text string of letters covering additional information existing about the sample donors. Can be one or several of the following values: Individual Disease History, Individual History of Injuries, Medication, Perception of Health, Women's Health, Reproductive History, Familial Disease History, Life Habits/Behaviors, Sociodemographic Characteristics, Socioeconomic Characteristics, Physical Environment, Mental Health, Other. Study

Current sampled individuals

Integer

Number of individuals with biological samples in the study at the date of Last updated (also see Planned sampled individuals) Study

Current total individuals

Integer Total number of individuals in the study at the date of Last updated (also see Planned total individuals) Study

Study name* Free text in English Text string of letters denoting the name of the study in English Study

Planned sampled individuals

Integer Number of individuals with biological samples planned for the study (also see Current sampled individuals) Study

Planned total individuals

Integer Number of individuals planned for the study (also see Current total individuals) Study

Comorbidity Yes/No Text string of letters indicating if information about comorbidity is available. Can be Yes or No Study


317871

Storage temperature Room temperature, +4C,

-18C to -35C, -60C to -

85C, Liquid nitrogen, Other

Text string of letters with the temperature for the long-term storage of the biospecimens in the sample collection. Can be one or several of the following values: Room temperature, +4 °C, -18 °C to -35 °C, -60 °C to -85 °C, Liquid nitrogen, Other. The intervals are chosen according to SPREC. Study

Omics experiments*

Genomics,

Transcriptomics,

Proteomics,

Metabolomics, Lipidomics, Other

Text string of letters denoting the -omics experiment(s) that have been performed on the samples in the sample collection. Can be one or several of the following values: Genomics, Transcriptomics, Proteomics, Metabolomics, Other. Definitions for the values are as follows: Genomics = The study of an organism's entire genome. Transcriptomics = The study of the transcription, i.e., the expression levels of mRNAs in a given organism, tissue, etc. (under a specific set of conditions). Proteomics = The study of proteins, their structures, and their functions, namely the study of the proteome and Metabolomics = The identification, quantification, and characterization of the small molecule metabolites in the metabolome (i.e., the set of all small molecule metabolites found in a specific cell, organ, or organism), Other Study

Table 6. MIABIS 2.0: Biobank, Sample Collection and Study (* suggested mandatory

data elements)

The table 6 contains the common data elements to share data among

biobanks according to MIABIS 2.0. For systems like the BiobankCloud

platform, where the main aim is the analysis of experimental data, the

most relevant information is related to the “Study” data level. Information

related to biobanks and sample collections could be difficult to collect. In

the case that it is not realistic to try to collect information about the

biobanks and samples collections, it is recommendable to keep those data

levels even when the most of the information is missing. The most

relevant information in both levels; biobank and sample collection, is the

contact reference to find samples or information about the sample.


317871

Fig 5. Main classes and relationships: MIABIS in yellow, Omics in pink, Regulations in

blue, Standard forms in grey and Management in white.

In D1.1, Section 6.2, the use cases where specified for studies done on

human samples. Keeping in mind that the data to be uploaded can be

related to animal experiments or cell lines, similar use cases should be

design to upload and analyse data from non-human samples. In this case,

the forms related to consent (SIAC, SINC) are not required.

3.1 Omics representation

The analysis pipelines to be initially implemented in the platform belong to the genomics and transcriptomics disciplines.

At least three types of experiments are going to be analyzed by the

platform:

whole genome

chip-seq


317871

transcriptome extraction

Fig 3 shows the hierarchical representation of the data. The omics data

has associated the descriptive metadata, the datasets and the results.

Only the datasets can contain sensitive data. They contain genome

sequences, short DNA sequences, and RNA sequences respectively.

The sequences are pre-processed (alignment, splice alignment, filtering)

and depending on the type of genomics experiment, other processes are

executed generating more results. For instance, from the whole genome it

is possible to decode a whole new organism. From the alignments the

analyses can be: SNP calling (to get genetic markers), Peak calling (for

epigenetic and Transcription Factor Binding Site (TFBS)) and read

assembly (for novel transcribed regions, differential expression and

alternative splice sites, alternative promoters and terminators). From splice alignments, read assembly to get novel isoforms and fusion proteins

is possible (Fig 6).

Fig 6. Overview of analysis pipelines to be implemented in the platform


317871

In relationship to omics data storage and analysis, this deliverable only

covers the issues regarding personal data protection and biobank data

sharing.

D1.5, Section 3.4.3 states: “For each omics dataset a specification of the

anonymization method should be provided (if needed)”. As stated in

Deliverable 1.5, Section 1.5, there may be occasions where non-

consented data will be processed within the platform, if in accordance with

the law of the controller. This may be the case with for example data that

has been collected a long time ago, where the data-subject is diseased, or

in the case of data on anonymous cell lines. These data may fall outside

the definition of personal data according to Article 2.1.a of the Data

Protection Directive. No further anonymization will then be necessary.

Another issue to take care of is the downloading function. Only the controller can download the original datasets. The analysis results can be

downloaded by both, controller and trusted researcher authorized to

download data from the specific study (Fig 7).

If the omics experiment has been done on non-human species or cell

lines, all the data can be downloaded, including descriptive metadata,

datasets and analysis results. The controller grants these permissions.

Fig 7. Properties to specify omics data, analyses and results


317871

The Fig 7 proposes a class diagram for omics data. The Reference

property could be a reference to a file or any other form of relationship to

the data. For instance, the DescriptiveMetadata can reference a XML file

describing the omics experiment based on an established standard.

For the reporting, exchange, and management of omics data it is a

suggestion to use a standard specification for that omics. For instance, for

microarray experiments use MIAME [8][9] from Functional Genomics Data

Society (FGED). The information could be associated to the descriptive

metadata as an extended XML, RDF, etc.

Some standards can be found at:

http://mibbi.sourceforge.net/portal.shtml

Even when the platform will only implement analysis pipelines for some

omics, it is a good idea to keep the model as abstract as possible to be able to add new omics and new analyses. It could pave the way to the

cross-experiment data analysis in the platform.

A suggestion is to keep the Analysis class as simple as possible and

maintain a reference to a semantic representation of the analysis (XML,

RDF, etc.).

Fig 8. Classes related to Study

http://mibbi.sourceforge.net/portal.shtml


317871

MIABIS 2.0 includes additional information related to publications

generated by omics experiments (Fig 8). It is a suggestion to add these

elements to the data model. It can help to determine if the data can be

available for sharing and also to get more specific information about the

study (http://bbmri-wiki.wikidot.com/en:dataset-study#toc2).

3.2 Summary

The proposed data model is based on MIABIS 2.0. MIABIS was designed to facilitate data sharing among biobanks and researchers. The central

data element is the study. In order to make the BiobankCloud platform

MIABIS compliant some issues should taken into account:

In the proposed data model, only the Dataset element can contain

personal data. The data elements related to MIABIS 2.0 are defined

at aggregated and metadata levels and provide enough information

related to studies done on biobanked samples. Information about

species and anatomical parts related to the origin of the sample

should be defined at aggregated and data levels.

It is recommendable to use a controlled vocabulary for the

annotation of species and anatomical parts. Sex and age for non-human species should be specified differently

(e.g. age can be defined in other units than years)

Information about biobanks and sample collections can be difficult to

collect by the users of the platform. The model should be flexible

enough to capture the relevant information in the study and keep

the possibility of updating biobanks and sample collections at any

time.

It is desirable to follow the MIABIS structure for future integration

with biobank management systems.

Due to the diversity of omics data definitions and formats, it is

suggested to keep the model as abstract as possible for future

adoptions of new omics analysis pipelines and data analysis

integration with other informatics platforms. Use controlled vocabulary and standards for omics data annotation.

http://bbmri-wiki.wikidot.com/en:dataset-study#toc2


317871

4 Conclusions

This deliverable extends D1.1 and covers details about the design of the

data model for data storage and analysis in the platform. It provides WP2,

WP3 with guidelines to implement security and storage data structure as

well as directions about how to implement data searching and sharing.

Section 2 defines the design of the standard forms required by the

BiobankCloud Model Data management Policy. Section 3 provides specific

guidelines on how to adapt MIABIS 2.0 to the platform and also provides

suggestion about the use of standards for data representation.

Regarding the data model, not all the system requirements are captured

in it but only those related to user interaction with the platform, and

processes involving data protection and biobank data management and

sharing.

The “Agreement between the controller & processor” will be provided in deliverable 1.3.

New regulations regarding ethical issues should be added to the

BiobankCloud ethical framework with the help of the BiobankCloud Ethical

Board.


317871

5 References

1. A Minimum Data Set for Sharing Biobank Samples, Information, and Data: MIABIS, Biopreservation and Biobanking. August 2012, 10(4): 343-348. doi:10.1089/bio.2012.0003

2. The Role of “Roles” in Use Case Diagrams. http://infoscience.epfl.ch/record/268/files/WegmannG00.pdf

3. The Ensemble Project: http://www.ensembl.org/index.html 4. Functional Genomics Data Society: http://www.fged.org/ 5. NCI Thesaurus: a semantic model integrating cancer-related

clinical and molecular information. J Biomed Inform. 2007 Feb;40(1):30-43. Epub 2006 Mar 15. http://www.ncbi.nlm.nih.gov/pubmed/16697710

6. Snomed CT implementation. Mapping guidelines facilitating reuse of data. Methods Inf Med. 2012 Dec 4;51(6):529-38. doi: 10.3414/ME11-02-0023. Epub 2012 Oct 1.

7. The AEO, an Ontology of Anatomical Entities for Classifying Animal Tissues and Organs. 2012;3:18. doi: 10.3389/fgene.2012.00018. Epub 2012 Feb 14. http://www.ncbi.nlm.nih.gov/pubmed/22347883

8. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001 Dec;29(4):365-71.

9. The minimum information about a genome sequence (MIGS) specification. Nature Biotechnology 26, 541 - 547 (2008) Published online: 8 May 2008 | doi:10.1038/nbt1360

10. Article 29 Working Party Group, Opinion 5/2012 on Cloud Computing, adopted on 1st of July 2012

Documents

PROJECT DELIVERABLE - CORDIS...Deliverable Due date (and month since project start): 2013-08-31, m9 Deliverable Version: v1.0 Ref. Ares(2013)2985670 - 03/09/2013 BiobankCloud – Object