Upload
vulien
View
213
Download
1
Embed Size (px)
Citation preview
Towards the First Data Acquisition Standard in
Synthetic Biology
Iñaki Sainz de Murieta†‡, Matthieu Bultelle†‡ and Richard I Kitney†‡*
† Centre for Synthetic Biology and Innovation, Imperial College London, SW7 2AZ, United Kingdom
‡ Department of BioEngineering, Imperial College London, SW7 2AZ, United Kingdom
* Correspondence: Richard I Kitney, Co-Director of the Centre for Synthetic Biology and
Innovation, Imperial College London, SW7 2AZ, United Kingdom.
KEYWORDS: data acquisition, characterization, standard, biopart, synthetic biology.
ABSTRACT
This paper describes the development of a new data acquisition standard for synthetic biology.
This comprises the creation of a methodology that is designed to capture all the data, metadata
and protocol information associated with biopart characterization experiments. The new
standard, called DICOM-SB, is a based on the highly successful Digital Imaging and
Communications in Medicine (DICOM) standard in medicine. A data model is described which
has been specifically developed for synthetic biology. The model is a modular, extensible data
1
model for the experimental process, which can optimize data storage for large amounts of data.
DICOM-SB also includes services orientated towards the automatic exchange of data and
information between modalities and repositories. DICOM-SB has been developed in the context
of systematic design in synthetic biology — which is based on the engineering principles of
modularity, standardization and characterization. The systematic design approach utilizes the
design, build, test and learn design cycle paradigm. DICOM-SB has been designed to be
compatible with and complementary to other standards in synthetic biology, including SBOL. In
this regard, the software provides effective interoperability. The new standard has been tested by
experiments and data exchange between Nanyang Technological University in Singapore and
Imperial College London.
INTRODUCTION
Synthetic biology is a young discipline (15 years, at most) that aims to design and engineer
biologically based parts, novel devices and systems — as well as redesigning existing, natural
biological systems 1,2. Bioparts are the key element of this definition: they perform specific
functions such as regulating transcription/translation or the binding to small molecules or protein
domains, and are used as the basic blocks for building devices and systems of higher complexity.
The first bioparts used in synthetic biology applications were natural parts, transplanted to other
settings (e.g. a different chassis). Originally only a few parts were available, but soon synthetic
libraries were built by modifying natural parts with techniques such as error-prone PCR 3–6.
Device design, spearheaded by the repressilator 7 and the toggle switch 8, and followed by an
extensive amount of important devices 9–20 , proved that the analogy between bioparts and
2
electronic components could be used to design devices — and that practically, it was possible to
endow biological systems with computing-like behavior by combining elementary bioparts.
The second wave of synthetic devices (2010 onwards) has not only been characterized by
attempts to build more complex devices and investigate robust design principles. But rather, its
focus has also been on applications such as biosensing, biofuels, pharmaceuticals and
biomaterials, with the stated aim to establish synthetic biology as one of the key technologies to
solve major societal problems 21,22.
Standard workflows. As engineered biological systems and their applications become more
complex and ambitious, the traditional iterative approach to design, mainstream in many fields of
engineering, has also been adopted in synthetic biology 23. The design cycle (illustrated in Figure
1) comprises four distinct sections. Depending on the results, the process may be repeated
(iterated) several times until the initial specifications are met.
The concepts of modularity (the approach that builds larger systems by combining smaller
subsystems — here, bioparts and available devices) and division of labor (the specialization of
cooperating individuals who perform specific tasks and roles) are central to its success. The latter
plays an increasing role, as projects become more complex and larger teams of specialists are
needed. In a move that mirrors the electronic industry, where circuit design is “fabless” and
construction takes place in specialized foundries, outsourcing the DNA synthesis of genes and
gene fragments is now an integral part of the rational design cycle — final assembly taking place
in-house using an ever wider range of techniques 24,25.
Characterization is the process of describing distinctive characteristics or essential features of
bioparts. Accurate characterization is essential to the success of the iterative design approach.
3
Parts need to be characterized to a high standard, so the behavior of their combination may be
predicted with higher fidelity. Of equal importance is that repositories must make large libraries
of characterized parts available, such that new systems can be built by their
addition/combination.
Figure 1. The synthetic biology design cycle.
4
Several high profile repositories are in common use — the iGEM Registry of Standard
Biological Parts 26, the JBEI Inventory of Composable Elements (ICE) 27 and the Virtual Parts
Repository 28, to name a few. However, currently no repository offers a large catalogue of
bioparts that are characterized to a consistently high standard. Although regrettable, such a state
of affairs is not surprising. Collecting the data for such catalogues is a lengthy, staff and
resource-intensive affair – very difficult before the advent of high-throughput, automated
platforms. Building the catalogues also requires the development and adoption of a set of robust
data formats to describe the various components of the characterization experiments. In
particular, it requires the development and adoption of a data format to store the raw data they
generate — a format such as the one presented in this paper.
In order to build such online catalogue of bioparts, a characterization pipeline has been
established at Imperial College, London (see Figure 2). It is supported by an IT-spine called
SynBIS 29, which enables characterization on a scale that is difficult for human experimentalists
to achieve. First developed with constitutive promoters, the system now supports the
characterization of other fundamental bioparts — such as inducible promoters — and continues
to be expanded. Plate reader and flow cytometry data are typically acquired.
Information and data standards. Design and characterization projects greatly vary in terms of
purpose, internal organization and output; but, nonetheless they deal with similar types of
information. We have identified three main categories:
Sequence description (Description of genetic objects of interest): FASTA 30 and GenBank
31 are well established formats that underpin very large public databases of naturally
occurring, annotated, sequences. However, they are not suitable for the description of the
5
genetic constructs encountered in synthetic biology — as they were not designed to
express the constructs hierarchy and modularity. This has led to the creation of SBOL
(the Synthetic Biology Open Language), a standard that captures the same sequence-
oriented information found in a GenBank file, which allows full hierarchical annotation
of DNA components 32, and thus facilitates the exchange of genetic designs33. SBOL’s
latest version, SBOL 2.0 34, proposed a revision to the core model in order to represent a
wider range of molecular interactions and components.
Figure 2. The CSynBI characterization pipeline.
6
Modelling (representing the actual or desired behavior of genetic objects): The Systems
Biology Markup Language (SBML) 35–37, is the most commonly used modelling standard
for the representation of biological phenomena. It is free, open, enjoys widespread
software support and is the de facto standard for representing computational models in
systems biology today 35–37. Modelling standards naturally complement genetic-construct
descriptions standards, as made apparent by SBOL 2.0 38 and by the mechanism to
annotate SBML models with SBOL files put forward by Roehner and Myers 39.
Raw data acquisition (learning about the genetic objects): As it matures into an
engineering discipline, synthetic biology will move from qualitative to quantitative data.
The amount of data captured and analyzed will consequently increase substantially —
due in no small part to the availability of new imaging modalities (generating ever larger
files) and the development of high throughput platforms 25.
RESULTS AND DISCUSSION
Motivation: data acquisition is the missing standard. In the design cycle, data acquisition
takes place during the testing phase. Since validating or rejecting a design is determined by
whether some concentrations and observed phenotypes fall within ranges listed in the
specifications, only a few repeat experiments may be needed for a given context. However,
testing may have to be performed for a potentially large number of candidate constructs and
experimental contexts.
With biopart characterization, a set of experiments are typically run on a small number of
characterization constructs (often, one containing the biopart and a set of controls).
Characterization differs from testing in that it has no margin of tolerance: it has to be as precise
7
as possible (so the results may be re-used to model more complex designs). In practice, a large
number of repeat experiments should be run and several acquisition modalities used. Because of
the need for catalogues of significant numbers of characterized parts, characterization can be
expected to be a main driver behind increased data capture in the future.
At present there is no such standard. Hence, developing such a standard is of the utmost
importance for synthetic biology - as it will be a key driver in transforming the field into a fully-
fledged engineering discipline. For such a standard to be of use in synthetic biology, it needs to
effectively support data acquisition. Therefore, it must:
1. Be based on a modular, extensible data model for the experimental process.
2. Optimize data storage of very large amounts of data.
3. Provide services oriented to automate the exchange of information between modalities
and repositories.
4. Build, if possible, on an existing validated standard, to facilitate adoption by hardware
manufacturers and the engineering community.
DICOM and DICOM-SB. There is already a highly successful representation and
communication standard that meets requirements 1 to 4, described above, in the field of
biomedicine. DICOM (Digital Imaging and Communications in Medicine) is the de facto
standard for handling, storing, printing, and transmitting information in medical imaging. It is
also known as NEMA standard PS3, and as ISO standard 12052:2006 "Health informatics —
Digital Imaging and Communication in Medicine (DICOM) including workflow and data
management" 40. It is both a file format definition and a network communications protocol (based
8
on TCP/IP). Originally developed to achieve compatibility between different medical imaging
and information systems, it has developed into a comprehensive standard over the last 20 years.
Several key technical aspects of DICOM provide a strong case for extending the standard to
synthetic biology (technically by adding a new module; working title DICOM-SB), rather than
developing a novel standard or adapting an existing rival data acquisition standard:
First, DICOM was designed with data acquisition and transmission in mind. This means
that in practice most of the practical issues involved in networking various imaging
resources and repositories have already been solved. Also, DICOM’s real-world model
was built around the experimental process.
Second, DICOM already supports a number of imaging modalities, such as microscopy,
used in synthetic biology.
Third, because of DICOM’s popularity, there already are a large number of programmers
and engineers familiar with the standard. It would therefore take little development work
for manufacturers to adapt their equipment and support DICOM-SB.
DICOM has also had a transformative effect on medical ICT — the like of which would be
highly beneficial to synthetic biology. The combination of DICOM and HL7 has supported the
development of electronic health records (EHR) — a class of software that systematically
collects electronic health information about individual patients or populations – including
demographics, medical history, medication and allergies, images, vital signs, personal statistics,
and procedural information 41–43. In parallel, a class of software called PACS (picture archiving
and communication systems) was developed to provide economical storage, access to images
9
acquired with multiple modalities 44–46. It is straightforward to draw an analogy between
EHRs/PACS and repositories of characterized bioparts that would collect raw data, procedural
data (assembly for instance), experimental protocol data and processed data.
Indeed we fully anticipate a successful data acquisition standard (such as the one we present in
this paper) to underpin such repositories. The next subsections will introduce a novel variant of
the successful DICOM standard and make the case that this standard is highly suitable to support
data acquisition in synthetic biology. In particular, it is complementary to SBOL and provides
efficient data storage.
DICOM for Synthetic Biology. As stated in the Introduction, our analysis of the DICOM
standard established that its features are compatible with the requirements for synthetic biology.
Therefore, we decided to develop a new synthetic biology extension for DICOM (DICOM-SB).
DICOM-SB provides a framework that allows the integration of wet lab experimental data
acquisition modalities into a common data model. It enhances the basic architecture inherited
from DICOM, to allow the encoding of new synthetic biology data acquisition modalities not
present in the general standard 40.
DICOM encodes data objects as a series of items (or data elements), such that each item is
identified by a predefined attribute (also called a tag). Attributes are named by the combination
of two fields: group and element. Groups organize the attributes into categories — while each
element identifies each different type of attribute within a group. Each attribute is related to a
data type (e.g. integer, float, character, string etc.) which in DICOM is called Value
Representation (VR). Finally, the value to be represented is encoded at the end (last bytes) of
10
each item, preceded by the total items length. Figure 3-A depicts the DICOM encoding of data
items.
Figure 3. (A) DICOM encoding of data elements. (B) Nesting data elements using the SQ value
representation. (C) Building Modules and Information Entities. (D) The DICOM-SB data model.
11
Data objects can be nested into higher level objects. This is achieved by using the Sequence (SQ)
value representation. When an attribute is assigned an SQ VR it means the content of its value
field is a series of DICOM objects. Each object within the nested series includes another series of
data elements, and some of them may (or may not) be encoded again an SQ VR — which would
add another nesting layer, and so on. The tree in Figure 3-B illustrates a nesting example
including different data elements and objects.
DICOM includes two types of value representations to enable resource identification. On the one
hand the AE VR represents an Application Entity. An AE is the name of a DICOM device or
program which uniquely identifies it locally (e.g. inside of your network). It can refer to a
specific workstation (e.g. WORKSTATION2), a specific software service (e.g. DATASTORE),
etc. On the other hand, the UI VR encodes a Unique Identifier, used uniquely to reference
instances of DICOM data. DICOM UI's must be globally unique, and they are built from groups
of digits separated by periods (e.g. 1.2.408.41112.3.1).
Since the amount of DICOM attributes is so extensive, building consistent objects that include
all the required information can become a tedious task if they have to be searched and chosen
one by one. In order to ease this task, DICOM clusters the attributes describing the same concept
into the same Information Module. Hence, when designing a DICOM object to encode a certain
data structure, modules will be the minimal blocks that will be combined. The module
specification also determines what attributes are mandatory (and thus must always be completed)
and which ones can be left incomplete. Finally, Information Modules are combined to build
Information Entities (IE), and IE's are aggregated to build Information Object Definitions (IOD)
(see Figure 3-C).
12
In order to understand the basis of the extension of DICOM for synthetic biology, the
hierarchical data model for standard DICOM will now be described. Every data object must
implement a standard IOD. Their entities are related following a hierarchical information model.
In the standard DICOM model for medicine, the patient is at the top of this hierarchy, as they are
the object of analysis of any biomedical application. All the details related to a patient (name,
identifier, age, gender, etc.) are included in the Patient IE. Patients can be subject to different
medical studies, and this requires tracking additional data such as study date (e.g. date, time,
study number, physician’s name, etc.). Going one step down in the hierarchy, this is represented
by the Study IE. Studies comprise different procedures, such that each one is performed on
specific equipment and can be repeated over time. Each procedure is termed a series, and their
features (series number, date, time, etc.) are included in the Series IE. Finally, each series
contains raw data acquired with one modality (e.g. electrocardiogram, magnetic resonance
imaging etc.). In total, the data measured at the down most level of the hierarchy — the modality
results — are annotated by the remaining levels: patient, study and series. It is worth noting that
although the main DICOM standard is associated first and foremost with images, the standard
also supports other types of data— waveforms being the most relevant for the present exercise
(see the section on Synthetic Biology Raw Data IOD).
The synthetic biology extension of DICOM organizes its data model following a similar strategy
(see class diagram in Figure 3-D):
Instead of “patient”, the object of study is the transformation of a host organism ,
according to a transformation protocol, with a set of genetic constructs — whose
behavior within the host is to be determined. We have created three new IE’s to model
this information (see top-left of the hierarchy in Figure 3-D, in green).
13
o Component: as the main target of the characterization process, this IE lays at the
top of the hierarchy. It describes the basic features to be tracked for each biopart
under analysis (the term Component has been chosen to be consistent with
SBOL). One of its attributes, named URI, allows the biopart to be described by
referencing an SBOL entity (a Component Definition), which allows a more
detailed annotation of its DNA sequence. Although nothing prevents the use of
GenBank or FASTA files, it is recommended to use SBOL, as it has the
advantage supporting the representation of recursive biopart structures. This is
especially useful to represent in a single file a circular plasmid structure
integrating e.g. cargos, bioparts of interest and reporting genes.
o Host: the components (or bioparts) to be characterized are studied and analyzed in
the context of a specific host organism (also known as a chassis). This IE
describes the basic features to be tracked for a host in a characterization
experiment. Cell free systems are also represented by this entity by using a special
host type.
o Transformation: host cells may be genetically modified (transformed) by DNA
components before they are used in a characterization experiment. A
Transformation IE represents a cell design — as a combination of one host
organism and a list of components. The list of components in a transformation
can be empty, meaning that the host would be used untransformed in the
experiment (typically to be used as a control). Optionally, it can also include
details about the transformation protocol (more details about the Protocol IE
below).
14
In the next level of the hierarchy, the Experiment IE (analogous to Study in the medical
model) is defined. Its purpose is to perform all the procedures required to analyze the
change of behavior that the integrated set of components produces in the host. Each
experiment must also adhere to an experimental protocol whose details are defined by the
Protocol IE (see top-right of the hierarchy in Figure 3-D, in red).
An experiment comprises a set of procedures that are repeated on different compartments
(typically a well) over time. Each single repeat of a specific procedure in a compartment,
performed with dedicated equipment constitutes a Series (similar to the medical model).
The following IE’s expand the scope of a series (see bottom of the hierarchy in Figure 3-
D, in blue):
o Stimuli: when the series requires interaction with external stimuli, this IE may
represent either environmental conditions (e.g. temperature) or chemical
components to be added into the media during the course of the series.
Environmental conditions and chemical components can either be specified as
absolute values or as increments over time.
o Compartment: the experiment cells may be grouped in different compartments,
according to the cell interactions that need to be tested. Thus the Compartment IE
can be seen as a container (e.g. a vessel) where an experiment is performed. When
working with automated platforms, it is common to use plates that arrange wells
as a matrix of rows and columns, such that each well is assigned a different series.
In such a scenario, each well is represented by a Compartment IE that enables
tracking the localization of the series. The term ‘compartment’ was chosen to be
15
consistent with the modelling standard SBML (where it is defined as a bounded
space in which species are located).
o Equipment: this IE identifies and describes the piece of equipment performing the
measurements — whether it is microscopy, flow cytometry etc.
Finally, each series references the raw data generated by the equipment after that run. In
synthetic biology the raw data are often organized as a list of data arrays, such that each
array represents each of the different magnitudes measured by the equipment (e.g. time,
temperature, fluorescence intensity, optical density, etc.) within its corresponding values.
When the raw data are structured in this fashion they can be easily encoded using the
standard DICOM Waveform module 40. The next subsections of the paper show in more
detail how the Waveform module is used to encode both cell population and single cell
measurements.
As with the standard DICOM model, the experimental measurements for each data acquisition
activity are stored in the corresponding attributes of each modality — whereas the rest of the
higher entities in the data model store the metadata required for classifying, process, analyzing
and disseminating the experimental measurements.
Having established the basic structure of the DICOM-SB data model, a new Synthetic Biology
IOD was defined to accommodate some of the modalities (and metadata) that are often used in
synthetic biology, but not already present in the DICOM standard. We call this IOD the
Synthetic Biology Raw Data IOD (SBRD). The SBRD was constructed both by reusing some of
the standard DICOM Information Modules (such as the Waveform Module), and defining new
modules. A full description of the DICOM-SB data model is available in the supporting material,
16
including details of the associated Tags and their Value Representations, as well as the
corresponding Information Modules and Information Entities.
To illustrate how the SBRD is used in practice, let us consider the following case study: the
characterization of constitutive promoters on a robotic automated platform as performed at the
Centre for Synthetic Biology and Innovation (CSynBI) at Imperial college London — see
Supplementary Information for more on the characterization protocol for constitutive promoters
at CSynBI and the data that are collected as part of the exercise.
Encoding cell population measurements. CSynBI’s characterization experiments use a plate
reader to measure the optical density and fluorescence of the population of E. coli (MG-1655)
transformed according to a specific experimental protocol. Thanks to previously established
calibration curves, these measurements are converted into estimates for cell population and GFP
population respectively. The characterization protocol states that the plate reader should
periodically sample each well of the plate at intervals of 15 minutes and tracks measurements in
two channels:
Target measure: total fluorescence intensity
Population measure: total optical density
The SBRD handles plate reader population level data (which are time series) as follows.
As long as the sampling frequency is constant, and the channel values are chronologically sorted,
the attribute Sampling Frequency (003A, 001A) can be used to track the frequency value. If this
is not the case, an extra channel can be added with corresponding time marks.
17
The attributes defined as part of the Data Series module provide a data structure to store target
and populations measurements in different channels, as well as sampling independent data
related to the test itself (e.g. acquisition date and time, number of channels and sample, channel
names, channel properties, etc.).
Figure 4 depicts how the data generated by a plate reader experiment can be mapped and
structured into the Raw Data module as part of the SBRD IOD. Referring to Figure 4, reading
from left to right, it can be seen that the whole structure is encoded as a Waveform Sequence,
which allows the inclusion of several modality repeats (for instance, a range of fluorescence
channel, each corresponding to a given bandwidth). As per the protocol, there is only one repeat,
named “Object 1” (corresponding to an excitation wavelength of 385 nm ±10nm and an emission
wavelength of 428 nm ±10nm).
The first attribute within this object is the Channel Definition Sequence, which includes
the metadata required to describe all the channel related settings; this sequence must
contain as many objects as different channels (Objects 2.1 and 2.2 for OD and GFP
channels), and the data elements within each object relate to the different channel
features: Channel Number, Channel Label, Status (active / inactive / data / test / ...),
number of bits encoding each channel value (Waveform Bits Allocated), etc. Since our
experiments report in arbitrary units, there is no detail about units of measure. However,
such details can be included, if required, and there are attributes available to track them
(see supporting material).
18
Figure 4. Encoding the results of the plate reader using the Raw Data IOD.
The last attribute (Waveform Data) is used to encode the sequence of experimental data,
such that the samples are sorted in ascending order (from 1 to n). Each sample is built by
the concatenation of the different channel values, sorted as per the corresponding
Waveform Channel Numbers (first OD and then GFP).
The attributes in the center encode metadata not related with the channels — such as the
Sampling Frequency, the length of each data sample (Waveform Bits Allocated and
Waveform Sample Interpretation), total Number of Waveform Channels and Samples,
and Waveform Originality.
19
Encoding single cell measurements. Single cell modalities, such as flow cytometry, yield an
estimate of the amount of fluorescence on a per cell basis. As with plate reader data, the data
need processing before they can be used. However, with flow cytometry, the problem is not to
estimate population values, but, rather, to identify which of the measured particles (events)
correspond to growing bacteria, instead of cell debris. This is typically done through a process
called gating, which implies selecting area(s) on the scatter plot generated during the flow
experiment to decide which cells are to be analyzed and which not.
In the characterization protocol, data acquisition with flow cytometry takes place twice during
the assay - the first time after 3 hours, the second time after 6 hours. Each time a 10 % sacrificial
sample is extracted.
In addition to measuring fluorescence in a range of bandwidths, flow cytometry provides other
types of measurements that are related to properties of a particle. For example, forward scatter
(FSC) relates to the size of the event, while the side scatter (SSC) refers to its granularity. The
curation protocol we have implemented for data analysis uses the FSC value of each event to
determine whether it should be included as living bacteria.
The Raw Data module enables the encoding of as many scattering (FSC / SSC) and fluorescence
channels as generated by the flow cytometer. Even the time mark can be tracked as an extra
channel. The mapping is similar to that depicted in Figure 4, but with a larger number of
channels. It is worth noting that there have been attempts at using DICOM to encode cytometry
data 47; the SBRD presented in this article is a more general approach however, as it can be used
for any type of data.
20
Supporting communication. After having set the data model, DICOM must offer a service to
allow the communication of data objects (encoded as IOD's) between different Application
Entities. In a data acquisition context (such as here), there is a need for at least one service that
stores the acquired IOD into a data repository. Consequently, we have developed a web service
that implements a DICOM Message Service Element (DIMSE): the Store service. The
combination of an IOD and the corresponding DIMSE service creates the Service Object Pairs
(SOP's). Accordingly, our DICOM-SB extension has defined new SOP objects for each different
equipment modality used: Synthetic Biology Plate Reader (SBPR) and Synthetic Biology Flow
Cytometry (SBFC). Both of them include (see Figure 5-A):
The IOD called Synthetic Biology Raw Data (SBRD).
The Store DIMSE service.
DICOM-SB has been developed jointly with our automated biopart characterization pipeline (see
Figure 1) and is now mature enough to support all biopart characterization at the Centre for
Synthetic Biology and Innovation (CSynBI). The SBRD and the Store DIMSE service are used
as part of the data acquisition step of the pipeline (step 2), which, in practice, proceeds as follows
(Figure 5-B):
1. Execution of the experimental protocol in the laboratory.
2. Data conversion of experimental results following the DICOM-SB standard (generation
of the SBRD IOD).
21
3. Automated communication of experimental results from the laboratory equipment to a
centralized data repository (by using the Store DIMSE service).
Data stored in the central repository and then analyzed and eventually published onto SynBIS.
Internal (CSynBI) data are now not the only data feeding SynBIS. We have established
collaboration with the Poh Lab at the Nanyang Technological University (NTU). As part of this
joint work they have characterized a set of constitutive promoters following a manual version of
our experimental protocol that only collects cell population measurements (currently
unpublished). In addition, they have used DICOM-SB to standardize their characterization
results (encoding their data as an SBRD IOD) and make them available to SynBIS (uploading
them using our Store DIMSE service).
22
Figure 5. (A) The Synthetic Biology SOP Classes. (B) Receiving characterization data from
external partners.
DISCUSSION
No single standard will be able to support typical, full synthetic biology workflows. It is too
large and too diverse a task for one standard to encompass successfully. In our view a small
23
number of non-competing data standards can effectively describe the workflows. The first set of
standards should concern the description of genetic constructs. SBOL has been designed for such
task, as well as sequence annotations 32,33. It is now being expanded to improve the description of
the modular properties of bioparts 34. The second set of standards — SBML 35, Kappa 48, for
example — are used to model the behavior of the constructs and come from systems biology.
They can easily be interfaced with the first set of standards 39.
In the paper we have made the case that the synthetic biology community (from academia to
industry) should also focus its attention on the development and adoption of a third set of
complementary standards that would support and indeed enable data acquisition. To this end we
have developed DICOM-SB, an extension of the DICOM data standard designed for synthetic
biology. DICOM-SB was specifically developed with biopart characterization and construct-
testing in mind. Both are likely to involve heavy data acquisition.
We have developed a DICOM-SB a data model built around a typical experiment in synthetic
biology. It is totally compatible, complementary indeed with SBOL, in relation to the description
of the constructs involved in experiments. The data model contains in its header all the metadata
required to describe experimental context accurately — crucially, it also standardizes the
description.
We have shown that one of the advantages of DICOM-SB is that it optimizes data storage. In this
regard instead of using text-based representations like most standards (e.g. XML, SBML,
SBOL), DICOM-SB encodes data in binary format. While this doesn't make a difference when
dealing with text strings or characters, it offers significant savings when dealing with numbers:
binary representations — allowing encoding of up to 256 different numbers per byte.
24
Conversely, text based representations use 1 byte per digit, meaning up to 10 numbers per byte.
In sum, DICOM offers up to a 25:1 downscale just by using binary encoding without data
compression. This feature becomes especially important when dealing with data intensive
modalities, such as flow cytometry (up to 50000 events per file), microarrays etc. It is even more
important in the context of characterization, where a significant number of experiments and
repeats may be needed in order to extract the properties of a biopart (for instance to induce a
promoter over a large range of concentration, or if the biopart exhibits a very stochastic
behavior).
We have also described a less obvious, but practically important, feature of DICOM-SB in
relation to data acquisition: its communication layer. DICOM (and by extension DICOM-SB) is
a communication standard as well as a data standard. The DICOM data representation is
incorporated within a corresponding communication service, built to facilitate the automated
distribution of results between measuring equipment and repositories. Medical ICT has greatly
benefited from such automation over the last two decades, so much so that DICOM is supported
by all the major stakeholders in the medical ICT industry. This, in our opinion, is a crucial
advantage of DICOM-SB over potential rival standards for data acquisition. DICOM is a well-
known, widely adopted standard by both industry and academia. There already are a large
number of programmers and engineers familiar with DICOM. It would therefore take a
relatively small amount of development work for manufacturers to adapt their equipment to
support DICOM-SB. For some modalities such as microscopy that already use DICOM,
transitioning to DICOM-SB would be straightforward due to the similarity between the
standards.
25
Far from competing with other synthetic biology standards, DICOM-SB can be the perfect
complement to the standards currently available in the area. Taking the example of SBOL, it is
currently the most powerful and successful tool for the representation of structural 32,33 and
functional descriptions 34 in synthetic biology. However, when encoding data inclusive
representations, DICOM-SB is highly effective due to its binary representation of files and its
storing efficiency. DICOM-SB is also ideal for encoding experimental results. The underlying
data model can, where appropriate, readily provide a binary implementation of an SBOL data
model.
The DICOM data model is the first extensive attempt at standardizing the data acquisition
process in synthetic biology. The performance module standardizes the encoding of raw data in
synthetic biology.
As a follow up to DICOM-SB, we have also developed a data model to standardize the encoding
of analyzed (experimental) data in synthetic biology — the datasheet module. It is based on the
model that was developed to help disseminate the biopart datasheets hosted by SynBIS, which
matches the workflow for a canonical characterization pipeline shown in Figure 1. The data
model has also been designed to promote compatibility between standards. We plan to release
the datasheets, together with a DICOM-SB implementation based on DICOM Structured
Reports 49. It is our belief the datasheet module will provide a simple way for existing
repositories (which mainly deal with designs) to host raw characterization data (as DICOM-SB)
and their interpretation.
26
The adoption of a DICOM-SB by the synthetic biology community (academia and industry)
would represent a discrete, important milestone in the development of synthetic biology —
particularly in relation to interoperability and industrial translation.
METHODS
Two different software applications have supported the results presented in the coming sections:
A DICOM-SB converter, which takes as input the ad hoc text and fcs data files generated
by the different modalities (flow cytometry and plate reader, in our case) and produces
DICOM-SB files. This software has been developed using Java SE as programming
language 50, and has used the dcm4che Toolkit 51 to import the libraries needed to produce
DICOM data. Dcm4che is a popular collection of open source applications for healthcare.
The dcm4che toolkit constitutes an excellent starting point for the development of
DICOM-SB applications in JAVA SE.
A DICOM Store service, responsible of uploading the DICOM-SB formatted raw data
into our SynBIS repository. This application has been developed as a RESTful Web
Service under Java EE 7 52. Although this service is not publicly available through
SynBIS, there are a number of open tools available elsewhere — e.g. the ones under the
dcm4che project 51 that can be used instead.
ASSOCIATED CONTENT
A detailed description of the DICOM-SB standard is available free of charge via the Internet at
http://pubs.acs.org. The software tools used to generate DICOM-SB files from experimental
data, as well as sample DICOM-SB files, are available via the Internet at
27
http://synbis.bg.ic.ac.uk/dicomsb. Description of the characterization protocol as well as
examples of SBOL files for the characterization constructs can also be found there.
AUTHOR INFORMATION
Corresponding Author
* Richard I Kitney, Co-Director of the Centre for Synthetic Biology and Innovation, Imperial
College London, SW7 2AZ, United Kingdom.
Author Contributions
The manuscript was written through contributions of all authors. All authors have given approval
to the final version of the manuscript. All authors contributed equally.
ACKNOWLEDGMENT
The authors acknowledge the support provided for synthetic biology research by the Engineering
and Physical Science Research Council [EP/J02175X/1] and the European Commission funded
7th Framework Program [FP7-KBBE 289326].
REFERENCES
(1) Kitney, R., Calvert, J., Challis, R., Cooper, J., Elfick, A., Freemont, P., Haseloff, J., Kelly,
M., and Paterson, L. (2009) Synthetic Biology: scope, applications and implications. The Royal
Academy of Engineering.
(2) Clarke, L., Adams, J., Sutton, P., Bainbridge, J., Birney, E., Calvert, J., Collis, A., Kitney, R.,
Freemont, P., Manson, P., Pandya, K., Ghaffar, T., Rose, N., Marris, C., and Woolfson, D.
(2012) A Synthetic Biology Roadmap for the UK., pp 1–35. Research Councils UK.
28
(3) Cheng, A. A., and Lu, T. K. (2012) Synthetic Biology: An Emerging Engineering Discipline.
Annu. Rev. Biomed. Eng. 14, 155–178.
(4) Isaacs, F. J., Dwyer, D. J., Ding, C., Pervouchine, D. D., Cantor, C. R., and Collins, J. J.
(2004) Engineered riboregulators enable post-transcriptional control of gene expression. Nat.
Biotechnol. 22, 841–847.
(5) Win, M. N., Liang, J. C., and Smolke, C. D. (2009) Frameworks for Programming Biological
Function through RNA Parts and Devices. Chem. Biol. 16, 298–310.
(6) Salis, H. M., Mirsky, E. A., and Voigt, C. A. (2009) Automated design of synthetic ribosome
binding sites to control protein expression. Nat. Biotechnol. 27, 946–950.
(7) Elowitz, M. B., and Leibler, S. (2000) A synthetic oscillatory network of transcriptional
regulators. Nature 403, 335–338.
(8) Gardner, T. S., Cantor, C. R., and Collins, J. J. (2000) Construction of a genetic toggle switch
in Escherichia coli. Nature 403, 339–342.
(9) Atkinson, M. R., Savageau, M. A., Myers, J. T., and Ninfa, A. J. (2003) Development of
Genetic Circuitry Exhibiting Toggle Switch or Oscillatory Behavior in Escherichia coli. Cell
113, 597–607.
(10) Deans, T. L., Cantor, C. R., and Collins, J. J. (2007) A Tunable Genetic Switch Based on
RNAi and Repressor Proteins for Regulating Gene Expression in Mammalian Cells. Cell 130,
363–372.
(11) Ham, T. S., Lee, S. K., Keasling, J. D., and Arkin, A. P. (2008) Design and Construction of
a Double Inversion Recombination Switch for Heritable Sequential Genetic Memory. PLoS ONE
3, e2815.
29
(12) Kramer, B. P., and Fussenegger, M. (2005) Hysteresis in a synthetic mammalian gene
network. Proc. Natl. Acad. Sci. U. S. A. 102, 9517–9522.
(13) Fung, E., Wong, W. W., Suen, J. K., Bulter, T., Lee, S., and Liao, J. C. (2005) A synthetic
gene–metabolic oscillator. Nature 435, 118–122.
(14) Danino, T., Mondragón-Palomino, O., Tsimring, L., and Hasty, J. (2010) A synchronized
quorum of genetic clocks. Nature 463, 326–330.
(15) Anderson, J. C., Voigt, C. A., and Arkin, A. P. (2007) Environmental signal integration by a
modular AND gate. Mol. Syst. Biol. 3, 133.
(16) Win, M. N., and Smolke, C. D. (2008) Higher-Order Cellular Information Processing with
Synthetic RNA Devices. Science 322, 456–460.
(17) Wang, B., Kitney, R. I., Joly, N., and Buck, M. (2011) Engineering modular and orthogonal
genetic logic gates for robust digital-like synthetic biology. Nat. Commun. 2, 508.
(18) Basu, S., Mehreja, R., Thiberge, S., Chen, M.-T., and Weiss, R. (2004) Spatiotemporal
control of gene expression with pulse-generating networks. Proc. Natl. Acad. Sci. U. S. A. 101,
6355–6360.
(19) Basu, S., Gerchman, Y., Collins, C. H., Arnold, F. H., and Weiss, R. (2005) A synthetic
multicellular system for programmed pattern formation. Nature 434, 1130–1134.
(20) You, L., Cox, R. S., Weiss, R., and Arnold, F. H. (2004) Programmed population control by
cell–cell communication and regulated killing. Nature 428, 868–871.
(21) Khalil, A. S., and Collins, J. J. (2010) Synthetic biology: applications come of age. Nat. Rev.
Genet. 11, 367–379.
(22) Weber, W., and Fussenegger, M. (2012) Emerging biomedical applications of synthetic
biology. Nat. Rev. Genet. 13, 21–35.
30
(23) Kitney, R., and Freemont, P. (2012) Synthetic biology – the state of play. FEBS Lett. 586,
2029–2036.
(24) Ellis, T., Adie, T., and Baldwin, G. S. (2011) DNA assembly for synthetic biology: from
parts to pathways and beyond. Integr. Biol. 3, 109–118.
(25) Kelwick, R., MacDonald, J. T., Webb, A. J., and Freemont, P. (2014) Developments in the
tools and methodologies of synthetic biology. Synth. Biol. 2, 60.
(26) Registry of Standard Biological Parts. http://parts.igem.org (accessed Oct 29, 2015).
(27) JBEI Inventory of Composable Elements (ICE). https://public-registry.jbei.org (accessed
Nov 29, 2015).
(28) Virtual Parts Repository. http://sbol.ncl.ac.uk:8081 (accessed Oct 29, 2015)
(29) Synthetic Biology Information System (SynBIS). http://synbis.bg.ic.ac.uk (accessed Oct 29,
2015). On-line repository of biopart-datasheets.
(30) Pearson, W. R., and Lipman, D. J. (1988) Improved tools for biological sequence
comparison. Proc. Natl. Acad. Sci. 85, 2444–2448.
(31) Bilofsky, H. S., and Christian, B. (1988) The GenBank® genetic sequence data bank.
Nucleic Acids Res. 16, 1861–1863.
(32) Galdzicki, M., Wilson, M., Rodriguez, C. A., Pocock, M. R., Oberortner, E., Adam, L.,
Adler, A., Anderson, J. C., Beal, J., Cai, Y., Chandran, D., Densmore, D., Drory, O. A., Endy,
D., Gennari, J. H., Grünberg, R., Ham, T. S., Hillson, N. J., Johnson, J. D., Kuchinsky, A., Lux,
M. W., Madsen, C., Misirli, G., Myers, C. J., Olguin, C., Peccoud, J., Plahar, H., Platt, D.,
Roehner, N., Sirin, E., Smith, T. F., Stan, G.-B., Villabos, A., Wipat, A., and Sauro, H. M.
(2012) Synthetic Biology Open Language (SBOL) Version 1.1.0.
31
(33) Galdzicki, M., Clancy, K. P., Oberortner, E., Pocock, M., Quinn, J. Y., Rodriguez, C. A.,
Roehner, N., Wilson, M. L., Adam, L., Anderson, J. C., Bartley, B. A., Beal, J., Chandran, D.,
Chen, J., Densmore, D., Endy, D., Grünberg, R., Hallinan, J., Hillson, N. J., Johnson, J. D.,
Kuchinsky, A., Lux, M., Misirli, G., Peccoud, J., Plahar, H. A., Sirin, E., Stan, G.-B., Villalobos,
A., Wipat, A., Gennari, J. H., Myers, C. J., and Sauro, H. M. (2014) The Synthetic Biology Open
Language (SBOL) provides a community standard for communicating designs in synthetic
biology. Nat. Biotechnol. 32, 545–550.
(34) Roehner, N., Oberortner, E., Pocock, M., Beal, J., Clancy, K., Madsen, C., Misirli, G.,
Wipat, A., Sauro, H., and Myers, C. J. (2014) Proposed Data Model for the Next Version of the
Synthetic Biology Open Language. ACS Synth. Biol.
(35) Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., Forum, and the
rest of the S., Arkin, A. P., Bornstein, B. J., Bray, D., Cornish-Bowden, A., Cuellar, A. A.,
Dronov, S., Gilles, E. D., Ginkel, M., Gor, V., Goryanin, I. I., Hedley, W. J., Hodgman, T. C.,
Hofmeyr, J.-H., Hunter, P. J., Juty, N. S., Kasberger, J. L., Kremling, A., Kummer, U., Novère,
N. L., Loew, L. M., Lucio, D., Mendes, P., Minch, E., Mjolsness, E. D., Nakayama, Y., Nelson,
M. R., Nielsen, P. F., Sakurada, T., Schaff, J. C., Shapiro, B. E., Shimizu, T. S., Spence, H. D.,
Stelling, J., Takahashi, K., Tomita, M., Wagner, J., and Wang, J. (2003) The systems biology
markup language (SBML): a medium for representation and exchange of biochemical network
models. Bioinformatics 19, 524–531.
(36) Hucka, M., Finney, A., Bornstein, B. J., Keating, S. M., Shapiro, B. E., Matthews, J.,
Kovitz, B. L., Schilstra, M. J., Funahashi, A., Doyle, J. C., and Kitano, H. (2004) Evolving a
lingua franca and associated software infrastructure for computational systems biology: the
Systems Biology Markup Language (SBML) project. Syst. Biol. IEE Proc. 1, 41–53.
32
(37) A, F., and M, H. (2003, December 1) Systems biology markup language: Level 2 and
beyond.
(38) Bartley, B., Beal, J., Clancy, K., Misirli, G., Roehner, N., Oberortner, E., Pocock, M.,
Bissell, M., Madsen, C., Nguyen, T., Zhang, Z., Gennari, J. H., Myers, C., Wipat, A., and Sauro,
H. (2015) Synthetic Biology Open Language (SBOL) Version 2.0.0. J. Integr. Bioinforma. 12,
272.
(39) Roehner, N., and Myers, C. J. (2014) A Methodology to Annotate Systems Biology Markup
Language Models with the Synthetic Biology Open Language. ACS Synth. Biol. 3, 57–66.
(40) NEMA PS3 / ISO 12052, Digital Imaging and Communications in Medicine (DICOM)
Standard. National Electrical Manufactureres Association, Rosslyn, VA, USA.
(41) Gunter, T. D., and Terry, N. P. (2005) The Emergence of National Electronic Health Record
Architectures in the United States and Australia: Models, Costs, and Questions. J. Med. Internet
Res. 7.
(42) Hoerbst, A., and Ammenwerth, E. (2010) Electronic Health Records: A Systematic Review
on Quality Requirements. Methods Inf. Med. 49, 320–336.
(43) Poh, C.-L., Kitney, R. I., and Shrestha, R. B. K. (2007) Addressing the Future of Clinical
Information Systems — Web-Based Multilayer Visualization. IEEE Trans. Inf. Technol. Biomed.
11, 127–140.
(44) Choplin, R. H., Boehme, J. M., and Maynard, C. D. (1992) Picture archiving and
communication systems: an overview. RadioGraphics 12, 127–129.
(45) Meyer-Ebrecht, D. (1994) Picture archiving and communication systems (PACS) for
medical application. Int. J. Biomed. Comput. 35, 91–124.
33
(46) Müller, H., Michoux, N., Bandon, D., and Geissbuhler, A. (2004) A review of content-based
image retrieval systems in medical applications—clinical benefits and future directions. Int. J.
Med. Inf. 73, 1–23.
(47) Leif, R. C., and Leif, S. B. (2001) DICOM-compatible format for analytical cytology data
that can be expressed in XML, pp 238–248.
(48) Danos, V., Feret, J., Fontana, W., Harmer, R., and Krivine, J. (2008) Rule-Based Modelling,
Symmetries, Refinements, in Formal Methods in Systems Biology (Fisher, J., Ed.), pp 103–122.
Springer Berlin Heidelberg.
(49) Clunie, D. D. A. (2000) DICOM Structured Reporting. PixelMed Publishing, Bangor, Pa.
(50) Java Standard Edition (SE).
http://www.oracle.com/technetwork/java/javase/overview/index.html (accessed Oct 29, 2015).
(51) dcm4che2 DICOM Toolkit.
http://www.dcm4che.org/confluence/display/d2/dcm4che2+DICOM+Toolkit (accessed Oct 29,
2015.
(52) The JavaTM API for RESTful Web Services. https://jcp.org/en/jsr/detail?id=311 (accessed
Oct 29, 2015).
34
For Table of Contents Use Only
We present in this paper the first data acquisition standard for synthetic biology - DICOM-SB.
Built on a modular data model for the experimental process, DICOM-SB optimizes data storage
and has a communication layer supporting exchange of information between modalities and
repositories. To demonstrate these features, we use the example of the biopart characterization
pipeline at Imperial College London.
35