Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
BiobankCloud – Object model for biobank data sharing page 1/28
317871
Project number: 317871
Project acronym: BIOBANKCLOUD
Project title: Scalable, Secure Storage Biobank
Project website URL: http://www.biobankcloud.com/
Project Coordinator Name and Organisation: Jim Dowling, KTH
E-mail: [email protected]
WORK PACKAGE 1 :
Regulatory and Ethical Requirements for Biobanking Data Storage and Analysis
Work Package Leader Name and Organisation: JAN-ERIC LITTON, Karolinska Institute (KI)
E-mail: [email protected]
PROJECT DELIVERABLE
D1.2 Object model for biobank data sharing
Deliverable Due date (and month since project start): 2013-08-31, m9 Deliverable Version: v1.0
Ref. Ares(2013)2985670 - 03/09/2013
BiobankCloud – Object model for biobank data sharing page 2/28
317871
Document history Version Date Changes By Reviewed
0.1 2013-06-11 First draft
Roxana Merino Martinez
Jane Reichel
Karin Zimmermann
Lora Dimitrova
Michael Hummel
0.1 First draft Jan-Eric Litton
2013-08-19 Final Jan-Eric Litton
BiobankCloud – Object model for biobank data sharing page 3/28
317871
Executive Summary This deliverable is a continuation of D1.1 which provided the information
model that should be handled by the platform in order to guarantee sharing of data among biobanks and researchers as well as the ethics
regulations involved in this issue. This deliverable will provide the design
of standard forms required by the BiobankCloud Model Data Management
Policy (MDMP) specified in D1.5. A more complete data model for biobank
data sharing based on MIABIS 2.0 and also based on additional
requirements to share data related to omics experiments, will be also
provided.
BiobankCloud – Object model for biobank data sharing page 4/28
317871
Table of Contents 1 Introduction ......................................................................................... 5 2 BiobankCloud Model Data Management Policy (MDMP): Standard forms ........ 6
2.1 Organization application .................................................................. 6 2.1.1 Classes and properties ............................................................... 6
2.2 Individual membership application .................................................... 8 2.2.1 Classes and properties ............................................................... 9
2.3 Standard Research Project Approval (SRPA) ..................................... 10 2.4 Standard Information About Consent (SIAC)..................................... 10 2.5 Standard Information on Non-Consented data (SINC) ........................ 11 2.6 Agreement between the controller & processor ................................. 12 2.7 Summary..................................................................................... 14
3 Data Model ......................................................................................... 15 3.1 Omics representation .................................................................... 22 3.2 Summary..................................................................................... 26
4 Conclusions ........................................................................................ 27 5 References ......................................................................................... 28
BiobankCloud – Object model for biobank data sharing page 5/28
317871
1 Introduction
The implementation of the BiobankCloud platform requires a data model
that helps to determine the structure of the data. The data model
provided by this deliverable is limited to the representation of objects and
relationships related to biobank data. Other data models related to
security, storage, management, etc. will be assumed by other work
packages.
The BiobankCloud Model Data Management Policy (MDMP) specified in
D1.5 concluded that six standard electronic forms for BiobankCloud
membership request and data submission should be designed:
Organization application
Individual membership application
SRPA: Standard Research Project Approval SIAC: Standard Information About Consent
SINC: Standard Information on Non-Consented data
Agreement between the controller & processor
Section 2 we provides the design elements of these forms.
Section 3 covers some aspects of the data model specification based on
MIABIS 2.0: Minimum Information About Biobank data Sharing [1], and
also based on specific requirements of the platform. Several projects in
Nordic countries and Europe are adopting this formalism to represent
biobank data sharing. Making the BiobankCloud data model MIABIS
compliant will ensure integration with other biobank information systems.
BiobankCloud – Object model for biobank data sharing page 6/28
317871
2 BiobankCloud Model Data Management Policy (MDMP): Standard forms
The deliverable D1.5 specified the MDMP based on the BiobankCloud
platform legal framework. In order to guarantee a controlled management
of data flow in the platform, six standard forms need to be designed to
keep track of the use of data. D1.1 provided more information about what
should be included in the standard forms.
The standard forms are implicit in the data model but need to be explicitly
available as printable documents.
The standard forms related to personal data protection are useful only for
data generated by experiments done on human samples.
2.1 Organization application
This form will collect information about an organization member of the
platform. An organization requesting membership to the platform will
provide the information through a requester, which is a person,
designated by the organization.
Required information:
Name of the organization
Organization Identification
Organization website Organization contact person:
o Contact information (name, e-mail, telephone, address)
o Position
o Affiliation (Organization, Department)
The aim and purpose of the organization. If the organization pursues
any aim other than biomedical research, this must be explicitly
documented.
Forms of organisation (public or private)
Means of funding of the organization
All public bodies exercising supervision over the organization (Data
Protection Authority, Health Boards, Boards for Higher Education,
etc.)
2.1.1 Classes and properties
Entity Class Name Attribute Name Attribute Description Attribute Type
Organization
Name Name of the organization
Text
BiobankCloud – Object model for biobank data sharing page 7/28
317871
Table 1. Classes and properties involved in the Organization application
Number Organization Identification number (VAT identification number)
Identifier
Website URL Text
OrganizationType Public or Private Text
Aim Aim of the organization Text
AimReference Reference (document, link, contact information) if the organization pursues any aim other than biomedical research
Reference
Contact FirstName Contact person first name
Text
LastName Contact person last name
Text
Email Contact person email Text Position Position in the
organization Text
Department Department of the contact person
Text
Street Street address Text
ZipCode Address Zip code Text
City Address city Text
Phone Contact person phone number
Text
Country Name Country name Text
Code Country code Text
PublicBody Name Name of public body Text
Reference Reference to the public body
Reference
MeansOfFunding Name Name of mean of funding
Text
Reference Reference to mean of funding
Reference
BiobankCloud – Object model for biobank data sharing page 8/28
317871
Fig.1 Classes involved in the Organization application
Once the access committee has approved an organization as a member, the contact person for the organization will be requested to appoint at
least one “controller”. The appointed “controller” (or controllers) will then
apply for individual membership.
2.2 Individual membership application
This form will collect information about a researcher applying for
membership. A researcher requesting membership to the platform has to
belong to an organization whose membership has been already granted. As defined in D1.1 (Section 6.1), a researcher has two main roles:
controller or trusted researcher. The controller role is appointed by the
organization and can storage, analyse and download data to or from the
platform. A trusted researcher needs to be approved by the controller and
can only analyse data and download analysis results associated to the
data uploaded by the controller.
Required general information for controller and trusted researcher:
Contact information (name, e-mail, telephone, address)
Position
Affiliation (Organization, Department)
The platform provides two main services: storage and analysis. In order to
use the platform services, the researcher should request the specific
BiobankCloud – Object model for biobank data sharing page 9/28
317871
service after login into the platform. Only appointed controllers can upload
data to the platform.
Required information for controller:
Purpose of the use of the platform (storage, analysis, both)
Scientific aims of the investigation
Standard Research Project Approval (SRPA) (Reference)
Agreement between controller and processor
2.2.1 Classes and properties
Table 2. Classes and properties involved in the individual membership application
Entity Class Name Attribute Name Attribute Description Attribute Type
Organization
Organization Organization already accepted as member
Organization
Contact FirstName Contact person first name
Text
LastName Contact person last name
Text
Email Contact person email Text
Position Position in the organization
Text
Department Department of the contact person
Text
Street Street address Text
ZipCode Address Zip code Text City Address city Text
Country Name Country name Text
Code Country code Text
Researcher Agreement Reference to document agreement between controller and processor
Reference
Purpose [*] Storage, Analysis, both Identifier
Study Aim Scientific aims of the investigation
Text
SRPA Reference to the Standard Research Project Approval
Reference
BiobankCloud – Object model for biobank data sharing page 10/28
317871
The property “Purpose” is limited by the role of the user. If a researcher is
requiring storage capability, he has to be approved as a “Controller” by
the organization that the researcher belongs to.
Once a researcher has been grant as a controller, the “storage”, “analysis”
and “download” functionalities are available to this researcher.
A trusted researcher has access to the “analysis” and “download” services.
The “download” is limited to the analysis results. This applies if the data is
related to human samples. Otherwise, the trusted researcher can use the
rest of the functionalities.
The D1.1, Section 4.2 defined the data model for biobank data sharing based on MIABIS 2.0. The standard forms should be related to the entity
“study” as specified in Table 2. The Aim and SRPA properties are provided
each time a new study is created in the platform. The SIAC and SINC are
provided if the study is done on human samples.
2.3 Standard Research Project Approval (SRPA)
Each study has at least a SRPA related to it. When a controller requests a
storage service, the system will request the SRPA information.
Class and properties
Table 3. Class SRPA
2.4 Standard Information About Consent (SIAC)
Each study on human samples has at least a SIAC related to it. When a
controller requests a storage service, the system will request the SIAC
information.
Entity Class Name Attribute Name Attribute Description Attribute Type
SRPA
Registration date When the form was registered in the system
Date
Controller Controller reference Contact
Document Reference to the research approval document
Reference
BiobankCloud – Object model for biobank data sharing page 11/28
317871
Class and properties
Table 4. Class SIAC
2.5 Standard Information on Non-Consented data (SINC)
D1.1, Section 6.2 provides the use case “Store study descriptive
metadata”. If not all the human samples have been consented for the
study; a SINC is required.
Class and properties
Table 5. Class SINC
Entity Class Name Attribute Name Attribute Description Attribute Type
SIAC
Registration date When the form was registered in the system
Date
Controller Controller reference Contact
Document Reference to the information about consent document
Reference
Limit of use Limit of time for storing the data
Date
Remark Comment or remark about use of data
Text
Entity Class Name Attribute Name Attribute Description Attribute Type
SINC
Registration date When the form was registered in the system
Date
Controller Controller reference Contact
Document Reference to the information about non-consented data document
Reference
Limit of use Limit of time for storing the data
Date
Remark Comment or remark about use of data
Text
BiobankCloud – Object model for biobank data sharing page 12/28
317871
2.6 Agreement between the controller & processor
As stated in Deliverable 1.5, Section 2.3, The Data Protection Directive
defines the controller as ‘the natural or legal person, public authority,
agency or any other body which alone or jointly with others determines
the purposes and means of the processing of personal data’, (Article 2.d).
A processor is ‘the natural or legal person, public authority, agency or any
other body which processes personal data on behalf of the controller’
(Article 2 e). According to Article 17.1 of the Data Protection Directive, the Member States are to provide that the controller must implement
appropriate technical and organizational measures to protect personal
data. If the controller does not conduct the processing him/herself but
leaves this to a processor, Article 17.3 stipulates that the processing of
data must be governed by a contract or legal act binding the processor to
the controller and stipulating, that the processor shall act only on
instructions from the controller and in conformity with the measure set
out by the Member States in accordance to Article 17.1.
For the sake of establishing a transparent and reliable chain of command
between controller and processors, the BiobankCloud should take care to
allocate all responsibilities stemming from the Data Protection Directive in a clear and concise manner1. The agreements entered into by the
BiobankCloud with the users wishing to upload and analyse data, at this
stage Charité in Germany, should therefore state the division of labour
between the parties, making clear that it is the user, Charité, that
continues to be the controller of the data, and that the BiobankCloud is
the processor [10].
It is a future task of this work package to develop a draft model
agreement for the users/controller to enter into with the
platform/processor.
1See further Article 29 Working Party Group, Opinion 5/2012 on Cloud Computing, adopted on 1st of July 2012, p. 8 and 12.
BiobankCloud – Object model for biobank data sharing page 13/28
317871
Fig 2. Class association representing standard forms
BiobankCloud – Object model for biobank data sharing page 14/28
317871
2.7 Summary
The standard forms are meant to hold information required by the
BiobankCloud Model Data Management Policy (MDMP).
The SIAC and SINC are needed only for data related to experiments
done on human samples.
The SRPA form is required by all the studies uploaded to the
platform. The agreement between controller and processor is not related to
the study but to the researcher that holds a controller role.
A controller for one study data can be a trusted researcher
associated to another study data.
The controller of the study data provides the information needed by
SRPA, SIAC and SINC.
The contact person of an organization provides the information for
the Organization application.
All the researchers (controller or trusted researcher) need to provide
the information required for the Individual membership application
related to personal information as specified in Section 2.2.
Deliverable 1.3 will provide a draft model agreement between controller and processor.
It is recommendable to use specific entities holding the information
related to the standard forms (Fig 2) to facilitate searching.
For non-human experiment data, the controller role has not the
same relevance but it would be a good practice to use this role as
the owner of the data to decide which data can be shared and by
whom. A researcher can use the trusted researcher role to analyse
and download data once the data has been authorized for sharing.
Only the owner of the data can decide how to share his/her data.
BiobankCloud – Object model for biobank data sharing page 15/28
317871
3 Data Model
D1.1 brought the first draft of the data model for biobank data
management in the BiobankCloud platform based on MIABIS 2.0. Some
modifications should be done in order to adapt the model to the platform
requirements.
Fig 3. The white blocks represent extensions to be made to MIABIS 2.0
MIABIS 2.0 [1] is designed for sharing human samples. The donors and
the samples are represented at the aggregated levels and the species and
anatomical location of the sample are not included in the model.
As shown in Fig 3, each Study has associated a set of Donors and a set of
Samples. One solution is that each set of Samples has the species and
anatomical site. Because the Samples are represented at aggregated level, the Species and Anatomical sites will also be represented at
aggregated level. Then, the species and anatomical sites should also be
specified in each omics descriptive metadata as shown in Fig 4.
BiobankCloud – Object model for biobank data sharing page 16/28
317871
Fig 4. Species and anatomical part properties at aggregated and data levels. Properties in
class Study should be defined based on MIABIS 2.0 (table 6)
Having Species and Anatomical part in both levels (aggregated and data)
can help a query engine to search into studies without having to search
into the omics descriptive metadata.
It is a suggestion to use the species list from Ensembl database [3] or
NCBI Taxonomy Database to define the species annotation.
The specification of the anatomical parts to describe the origin of the
sample is a complicated matter. Even more if several species can be
handled by the system. A simplified solution could be to create a
dictionary of anatomical terms used by the system and provide the
flexibility to add new anatomical parts when it is required. In order to use
a controlled vocabulary it is recommendable to use terms from established
ontologies or standards as NCI Thesaurus [5], SNOMED CT [6], AEO [7] etc.
For the aggregated level of the data model (Fig 1), the classes Biobank,
SampleCollection and Study should be defined according to MIABIS 2.0
(Table 6). For more information:
http://bbmri-wiki.wikidot.com/en:dataset
Name Possible values Explanation Data level
Biobank ID Text
Text string of letters starting with the country code (according to standard ISO1366 alpha2) followed by the underscore “_” and post-fixed by a biobank ID or name specified by its juristic person (nationally specific) Biobank
Name of biobank Free text in English Text string of letters denoting the name of the biobank in the local language Biobank
ContactPerson Structured data Name, email, address, phone number, affiliation Biobank
BiobankCloud – Object model for biobank data sharing page 17/28
317871
Sample Collection ID Free text in any language Text string depicting the unique ID or acronym for the sample collection or study
Sample Collection
ContactPerson Structured data Name, email, address, phone number, affiliation
Sample Collection
Sample Collection Description
Free text in English Text string of letters describing the sample collection or study aim (max 200 characters)
Sample Collection
Sample Collection Responsible
Text Text string of letters denoting the name of the sample collection responsible or principal investigator
Sample Collection
Sample Collection Contact Information
Text Structured information about the contact person including address, phone, email, organization, department
Sample Collection
Type of Collection
Case-control, Cohort, Cross-sectional, Longitudinal, Twin-
study, Quality control, Population-based, Disease
specific, Other
Text string of letters denoting the type of sample collection or study design. Can be one or several of the following values: Case-control, Cohort, Cross-sectional, Longitudinal, Twin-study, Quality control, Population-based, Other. Definitions for the values are as follows: Case-control = A case-control study design compares two groups of subjects: those with the disease or condition under study (cases) and a very similar group of subjects who do not have the disease or condition (controls), Cohort = A group of individuals identified by a common characteristic (e.g. demographic, exposures, illness etc.), Cross-sectional = A study in which participants are examined at only a single time for characteristics of a disease, Longitudinal = Research studies involving repeated observations of the same entity over time. In the biobank context, longitudinal studies sample a group of people in a given time period, and study them at intervals by the acquisition and analyses of data and/or samples over time, Twin-study = A twin study design is a study design in behavior genetics which aid the study of individual differences between genetically identical twins by highlighting the role of environmental and genetic causes on behavior, Quality Control = A quality control testing study design type is where some aspect of the experiment is quality controlled for the purposes of quality assurance and Population-based = Multidisciplinary study done at the population level or among the population groups, generally to find the cause, incidence or spread of the disease or to see the response to the treatment, nutrition or environment, Disease specific = A study or biobank for which material and information is collected from subjects that have already
Sample Collection
BiobankCloud – Object model for biobank data sharing page 18/28
317871
developed a particular disease, Other
Collection start Date Date in ISO-standard (8601) time format specifying when the sample collection starts
Sample Collection
Collection end Date Date in ISO-standard (8601) time format specifying when the sample collection ends, if applicable
Sample Collection
Sex
Female, Male
Text string of letters denoting the sex of the sample donors. Can be one or both of the following values: Female, Male
Sample Collection
Age interval [001-999] Age interval of youngest to oldest participant in sample collection
Sample Collection
Average age Real Average age of all sample donors in the sample collection
Sample Collection
Main diagnosis Text Diagnosis system defined by RD-Connect. Can be several values
Sample Collection
Categories of data collected
Biological samples, Register data, Survey
data, Physiological
measurements, Imaging data, Medical records,
Other
Can be one or several of the following values: Biological samples, Register data, Survey data, Physiological measurements, Imaging data, Medical records, Other
Sample Collection
Material type
Whole blood, Plasma,
Serum, Urine, Saliva, CSF, DNA, RNA, Tissue,
Faeces, Other
Text string of letters denoting the nature of the biological samples that make up the sample collection. Can be one or several of the following values: Whole blood, Plasma, Serum, Urine, Saliva, CSF, DNA, RNA, Tissue, Faeces, Other
Sample Collection
BiobankCloud – Object model for biobank data sharing page 19/28
317871
Survey data
Individual Disease
History, Individual History of Injuries,
Medication, Perception of
Health, Women's Health, Reproductive History,
Familial Disease History,
Life Habits/Behaviours, Sociodemographic
Characteristics,
Socioeconomic Characteristics, Physical
Environment, Mental
Health, Other
Text string of letters covering additional information existing about the sample donors. Can be one or several of the following values: Individual Disease History, Individual History of Injuries, Medication, Perception of Health, Women's Health, Reproductive History, Familial Disease History, Life Habits/Behaviors, Sociodemographic Characteristics, Socioeconomic Characteristics, Physical Environment, Mental Health, Other.
Sample Collection
Medical records Text Free text specifying which medical records are available in the sample collection/study
Sample Collection
Registers Text Free text specifying which registers are available in the sample collection/study
Sample Collection
Sample handling Text
Text string of letters describing how the samples in the sample collection have been handled as an indication of sample quality. Can be one or several of the following values: Freeze chain, indicating if the samples in the collection have been kept cool from needle to freezer. Freeze time, time in hours from needle to freezer. SPREC compliant, if the samples are labeled according to SPREC, Other
Sample Collection
Current sampled individuals
Integer
Number of individuals with biological samples in the study at the date of Last update (also see Planned sampled individuals)
Sample Collection
Current total individuals
Integer Total number of individuals in the study at the date of Last updated (also see Planned total individuals)
Sample Collection
Hosting biobank Text Text string of letters of the biobank/s storing the biological samples that are part of the sample collection. Can be several
Sample Collection
Date of entry Date Date in ISO-standard (8601) time format when data about the sample collection was reported into a database
Sample Collection
Last updated Date Date in ISO-standard (8601) time format when data about the sample collection was last updated in a database
Sample Collection
Study ID* Free text in any language Text string depicting the unique ID or acronym for the study. Can be generated by the system Study
BiobankCloud – Object model for biobank data sharing page 20/28
317871
Study Description* Free text in English Text string of letters describing the sample collection or study aim (max 200 characters) Study
Principal Investigator*
Text Text string of letters denoting the name of the sample collection responsible or principal investigator Study
Study Contact Person*
Text Structured information about the contact person including address, phone, email, organization, department Study
Sex Female, Male Text string of letters denoting the sex of the sample donors. Can be one or both of the following values: Female, Male Study
Age interval* [001 – 999] Age interval of youngest to oldest participant in sample collection Study
Main diagnosis* Text ICD-10 codes for the studied diagnoses. Can be several values Study
Categories of data collected
Biological samples, Register data, Survey data, Physiological measurements, Imaging data, Medical records, Other
Can be one or several of the following values: Biological samples, Register data, Survey data, Physiological measurements, Imaging data, Medical records, Other Study
Material type* Whole blood, Plasma, Serum, Urine, Saliva, CSF, DNA, RNA, Tissue, Faeces, Cell line, Other
Text string of letters denoting the nature of the biological samples that make up the sample collection. Can be one or several of the following values: Whole blood, Plasma, Serum, Urine, Saliva, CSF, DNA, RNA, Tissue, Faeces, Cell line, Other Study
Survey data
Individual Disease History, Individual History of Injuries, Medication, Perception of Health, Women's Health, Reproductive History, Familial Disease History, Life Habits/Behaviours, Sociodemographic Characteristics, Socioeconomic Characteristics, Physical Environment, Mental Health, Other
Text string of letters covering additional information existing about the sample donors. Can be one or several of the following values: Individual Disease History, Individual History of Injuries, Medication, Perception of Health, Women's Health, Reproductive History, Familial Disease History, Life Habits/Behaviors, Sociodemographic Characteristics, Socioeconomic Characteristics, Physical Environment, Mental Health, Other. Study
Current sampled individuals
Integer
Number of individuals with biological samples in the study at the date of Last updated (also see Planned sampled individuals) Study
Current total individuals
Integer Total number of individuals in the study at the date of Last updated (also see Planned total individuals) Study
Study name* Free text in English Text string of letters denoting the name of the study in English Study
Planned sampled individuals
Integer Number of individuals with biological samples planned for the study (also see Current sampled individuals) Study
Planned total individuals
Integer Number of individuals planned for the study (also see Current total individuals) Study
Comorbidity Yes/No Text string of letters indicating if information about comorbidity is available. Can be Yes or No Study
BiobankCloud – Object model for biobank data sharing page 21/28
317871
Storage temperature Room temperature, +4C,
-18C to -35C, -60C to -
85C, Liquid nitrogen, Other
Text string of letters with the temperature for the long-term storage of the biospecimens in the sample collection. Can be one or several of the following values: Room temperature, +4 °C, -18 °C to -35 °C, -60 °C to -85 °C, Liquid nitrogen, Other. The intervals are chosen according to SPREC. Study
Omics experiments*
Genomics,
Transcriptomics,
Proteomics,
Metabolomics, Lipidomics, Other
Text string of letters denoting the -omics experiment(s) that have been performed on the samples in the sample collection. Can be one or several of the following values: Genomics, Transcriptomics, Proteomics, Metabolomics, Other. Definitions for the values are as follows: Genomics = The study of an organism's entire genome. Transcriptomics = The study of the transcription, i.e., the expression levels of mRNAs in a given organism, tissue, etc. (under a specific set of conditions). Proteomics = The study of proteins, their structures, and their functions, namely the study of the proteome and Metabolomics = The identification, quantification, and characterization of the small molecule metabolites in the metabolome (i.e., the set of all small molecule metabolites found in a specific cell, organ, or organism), Other Study
Table 6. MIABIS 2.0: Biobank, Sample Collection and Study (* suggested mandatory
data elements)
The table 6 contains the common data elements to share data among
biobanks according to MIABIS 2.0. For systems like the BiobankCloud
platform, where the main aim is the analysis of experimental data, the
most relevant information is related to the “Study” data level. Information
related to biobanks and sample collections could be difficult to collect. In
the case that it is not realistic to try to collect information about the
biobanks and samples collections, it is recommendable to keep those data
levels even when the most of the information is missing. The most
relevant information in both levels; biobank and sample collection, is the
contact reference to find samples or information about the sample.
BiobankCloud – Object model for biobank data sharing page 22/28
317871
Fig 5. Main classes and relationships: MIABIS in yellow, Omics in pink, Regulations in
blue, Standard forms in grey and Management in white.
In D1.1, Section 6.2, the use cases where specified for studies done on
human samples. Keeping in mind that the data to be uploaded can be
related to animal experiments or cell lines, similar use cases should be
design to upload and analyse data from non-human samples. In this case,
the forms related to consent (SIAC, SINC) are not required.
3.1 Omics representation
The analysis pipelines to be initially implemented in the platform belong to the genomics and transcriptomics disciplines.
At least three types of experiments are going to be analyzed by the
platform:
whole genome
chip-seq
BiobankCloud – Object model for biobank data sharing page 23/28
317871
transcriptome extraction
Fig 3 shows the hierarchical representation of the data. The omics data
has associated the descriptive metadata, the datasets and the results.
Only the datasets can contain sensitive data. They contain genome
sequences, short DNA sequences, and RNA sequences respectively.
The sequences are pre-processed (alignment, splice alignment, filtering)
and depending on the type of genomics experiment, other processes are
executed generating more results. For instance, from the whole genome it
is possible to decode a whole new organism. From the alignments the
analyses can be: SNP calling (to get genetic markers), Peak calling (for
epigenetic and Transcription Factor Binding Site (TFBS)) and read
assembly (for novel transcribed regions, differential expression and
alternative splice sites, alternative promoters and terminators). From splice alignments, read assembly to get novel isoforms and fusion proteins
is possible (Fig 6).
Fig 6. Overview of analysis pipelines to be implemented in the platform
BiobankCloud – Object model for biobank data sharing page 24/28
317871
In relationship to omics data storage and analysis, this deliverable only
covers the issues regarding personal data protection and biobank data
sharing.
D1.5, Section 3.4.3 states: “For each omics dataset a specification of the
anonymization method should be provided (if needed)”. As stated in
Deliverable 1.5, Section 1.5, there may be occasions where non-
consented data will be processed within the platform, if in accordance with
the law of the controller. This may be the case with for example data that
has been collected a long time ago, where the data-subject is diseased, or
in the case of data on anonymous cell lines. These data may fall outside
the definition of personal data according to Article 2.1.a of the Data
Protection Directive. No further anonymization will then be necessary.
Another issue to take care of is the downloading function. Only the controller can download the original datasets. The analysis results can be
downloaded by both, controller and trusted researcher authorized to
download data from the specific study (Fig 7).
If the omics experiment has been done on non-human species or cell
lines, all the data can be downloaded, including descriptive metadata,
datasets and analysis results. The controller grants these permissions.
Fig 7. Properties to specify omics data, analyses and results
BiobankCloud – Object model for biobank data sharing page 25/28
317871
The Fig 7 proposes a class diagram for omics data. The Reference
property could be a reference to a file or any other form of relationship to
the data. For instance, the DescriptiveMetadata can reference a XML file
describing the omics experiment based on an established standard.
For the reporting, exchange, and management of omics data it is a
suggestion to use a standard specification for that omics. For instance, for
microarray experiments use MIAME [8][9] from Functional Genomics Data
Society (FGED). The information could be associated to the descriptive
metadata as an extended XML, RDF, etc.
Some standards can be found at:
http://mibbi.sourceforge.net/portal.shtml
Even when the platform will only implement analysis pipelines for some
omics, it is a good idea to keep the model as abstract as possible to be able to add new omics and new analyses. It could pave the way to the
cross-experiment data analysis in the platform.
A suggestion is to keep the Analysis class as simple as possible and
maintain a reference to a semantic representation of the analysis (XML,
RDF, etc.).
Fig 8. Classes related to Study
BiobankCloud – Object model for biobank data sharing page 26/28
317871
MIABIS 2.0 includes additional information related to publications
generated by omics experiments (Fig 8). It is a suggestion to add these
elements to the data model. It can help to determine if the data can be
available for sharing and also to get more specific information about the
study (http://bbmri-wiki.wikidot.com/en:dataset-study#toc2).
3.2 Summary
The proposed data model is based on MIABIS 2.0. MIABIS was designed to facilitate data sharing among biobanks and researchers. The central
data element is the study. In order to make the BiobankCloud platform
MIABIS compliant some issues should taken into account:
In the proposed data model, only the Dataset element can contain
personal data. The data elements related to MIABIS 2.0 are defined
at aggregated and metadata levels and provide enough information
related to studies done on biobanked samples. Information about
species and anatomical parts related to the origin of the sample
should be defined at aggregated and data levels.
It is recommendable to use a controlled vocabulary for the
annotation of species and anatomical parts. Sex and age for non-human species should be specified differently
(e.g. age can be defined in other units than years)
Information about biobanks and sample collections can be difficult to
collect by the users of the platform. The model should be flexible
enough to capture the relevant information in the study and keep
the possibility of updating biobanks and sample collections at any
time.
It is desirable to follow the MIABIS structure for future integration
with biobank management systems.
Due to the diversity of omics data definitions and formats, it is
suggested to keep the model as abstract as possible for future
adoptions of new omics analysis pipelines and data analysis
integration with other informatics platforms. Use controlled vocabulary and standards for omics data annotation.
BiobankCloud – Object model for biobank data sharing page 27/28
317871
4 Conclusions
This deliverable extends D1.1 and covers details about the design of the
data model for data storage and analysis in the platform. It provides WP2,
WP3 with guidelines to implement security and storage data structure as
well as directions about how to implement data searching and sharing.
Section 2 defines the design of the standard forms required by the
BiobankCloud Model Data management Policy. Section 3 provides specific
guidelines on how to adapt MIABIS 2.0 to the platform and also provides
suggestion about the use of standards for data representation.
Regarding the data model, not all the system requirements are captured
in it but only those related to user interaction with the platform, and
processes involving data protection and biobank data management and
sharing.
The “Agreement between the controller & processor” will be provided in deliverable 1.3.
New regulations regarding ethical issues should be added to the
BiobankCloud ethical framework with the help of the BiobankCloud Ethical
Board.
BiobankCloud – Object model for biobank data sharing page 28/28
317871
5 References
1. A Minimum Data Set for Sharing Biobank Samples, Information, and Data: MIABIS, Biopreservation and Biobanking. August 2012, 10(4): 343-348. doi:10.1089/bio.2012.0003
2. The Role of “Roles” in Use Case Diagrams. http://infoscience.epfl.ch/record/268/files/WegmannG00.pdf
3. The Ensemble Project: http://www.ensembl.org/index.html 4. Functional Genomics Data Society: http://www.fged.org/ 5. NCI Thesaurus: a semantic model integrating cancer-related
clinical and molecular information. J Biomed Inform. 2007 Feb;40(1):30-43. Epub 2006 Mar 15. http://www.ncbi.nlm.nih.gov/pubmed/16697710
6. Snomed CT implementation. Mapping guidelines facilitating reuse of data. Methods Inf Med. 2012 Dec 4;51(6):529-38. doi: 10.3414/ME11-02-0023. Epub 2012 Oct 1.
7. The AEO, an Ontology of Anatomical Entities for Classifying Animal Tissues and Organs. 2012;3:18. doi: 10.3389/fgene.2012.00018. Epub 2012 Feb 14. http://www.ncbi.nlm.nih.gov/pubmed/22347883
8. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001 Dec;29(4):365-71.
9. The minimum information about a genome sequence (MIGS) specification. Nature Biotechnology 26, 541 - 547 (2008) Published online: 8 May 2008 | doi:10.1038/nbt1360
10. Article 29 Working Party Group, Opinion 5/2012 on Cloud Computing, adopted on 1st of July 2012