Template for Electronic Submission to ACS Journals · Web viewDcm4che is a popular collection of open source applications for healthcare. The dcm4che toolkit constitutes an excellent

Towards the First Data Acquisition Standard in

Synthetic Biology

Iñaki Sainz de Murieta†‡, Matthieu Bultelle†‡ and Richard I Kitney†‡*

† Centre for Synthetic Biology and Innovation, Imperial College London, SW7 2AZ, United Kingdom

‡ Department of BioEngineering, Imperial College London, SW7 2AZ, United Kingdom

* Correspondence: Richard I Kitney, Co-Director of the Centre for Synthetic Biology and

Innovation, Imperial College London, SW7 2AZ, United Kingdom.

[email protected]

KEYWORDS: data acquisition, characterization, standard, biopart, synthetic biology.

ABSTRACT

This paper describes the development of a new data acquisition standard for synthetic biology.

This comprises the creation of a methodology that is designed to capture all the data, metadata

and protocol information associated with biopart characterization experiments. The new

standard, called DICOM-SB, is a based on the highly successful Digital Imaging and

Communications in Medicine (DICOM) standard in medicine. A data model is described which

has been specifically developed for synthetic biology. The model is a modular, extensible data

1

model for the experimental process, which can optimize data storage for large amounts of data.

DICOM-SB also includes services orientated towards the automatic exchange of data and

information between modalities and repositories. DICOM-SB has been developed in the context

of systematic design in synthetic biology — which is based on the engineering principles of

modularity, standardization and characterization. The systematic design approach utilizes the

design, build, test and learn design cycle paradigm. DICOM-SB has been designed to be

compatible with and complementary to other standards in synthetic biology, including SBOL. In

this regard, the software provides effective interoperability. The new standard has been tested by

experiments and data exchange between Nanyang Technological University in Singapore and

Imperial College London.

INTRODUCTION

Synthetic biology is a young discipline (15 years, at most) that aims to design and engineer

biologically based parts, novel devices and systems — as well as redesigning existing, natural

biological systems 1,2. Bioparts are the key element of this definition: they perform specific

functions such as regulating transcription/translation or the binding to small molecules or protein

domains, and are used as the basic blocks for building devices and systems of higher complexity.

The first bioparts used in synthetic biology applications were natural parts, transplanted to other

settings (e.g. a different chassis). Originally only a few parts were available, but soon synthetic

libraries were built by modifying natural parts with techniques such as error-prone PCR 3–6.

Device design, spearheaded by the repressilator 7 and the toggle switch 8, and followed by an

extensive amount of important devices 9–20 , proved that the analogy between bioparts and

2

electronic components could be used to design devices — and that practically, it was possible to

endow biological systems with computing-like behavior by combining elementary bioparts.

The second wave of synthetic devices (2010 onwards) has not only been characterized by

attempts to build more complex devices and investigate robust design principles. But rather, its

focus has also been on applications such as biosensing, biofuels, pharmaceuticals and

biomaterials, with the stated aim to establish synthetic biology as one of the key technologies to

solve major societal problems 21,22.

Standard workflows. As engineered biological systems and their applications become more

complex and ambitious, the traditional iterative approach to design, mainstream in many fields of

engineering, has also been adopted in synthetic biology 23. The design cycle (illustrated in Figure

1) comprises four distinct sections. Depending on the results, the process may be repeated

(iterated) several times until the initial specifications are met.

The concepts of modularity (the approach that builds larger systems by combining smaller

subsystems — here, bioparts and available devices) and division of labor (the specialization of

cooperating individuals who perform specific tasks and roles) are central to its success. The latter

plays an increasing role, as projects become more complex and larger teams of specialists are

needed. In a move that mirrors the electronic industry, where circuit design is “fabless” and

construction takes place in specialized foundries, outsourcing the DNA synthesis of genes and

gene fragments is now an integral part of the rational design cycle — final assembly taking place

in-house using an ever wider range of techniques 24,25.

Characterization is the process of describing distinctive characteristics or essential features of

bioparts. Accurate characterization is essential to the success of the iterative design approach.

3

Parts need to be characterized to a high standard, so the behavior of their combination may be

predicted with higher fidelity. Of equal importance is that repositories must make large libraries

of characterized parts available, such that new systems can be built by their

addition/combination.

Figure 1. The synthetic biology design cycle.

4

Several high profile repositories are in common use — the iGEM Registry of Standard

Biological Parts 26, the JBEI Inventory of Composable Elements (ICE) 27 and the Virtual Parts

Repository 28, to name a few. However, currently no repository offers a large catalogue of

bioparts that are characterized to a consistently high standard. Although regrettable, such a state

of affairs is not surprising. Collecting the data for such catalogues is a lengthy, staff and

resource-intensive affair – very difficult before the advent of high-throughput, automated

platforms. Building the catalogues also requires the development and adoption of a set of robust

data formats to describe the various components of the characterization experiments. In

particular, it requires the development and adoption of a data format to store the raw data they

generate — a format such as the one presented in this paper.

In order to build such online catalogue of bioparts, a characterization pipeline has been

established at Imperial College, London (see Figure 2). It is supported by an IT-spine called

SynBIS 29, which enables characterization on a scale that is difficult for human experimentalists

to achieve. First developed with constitutive promoters, the system now supports the

characterization of other fundamental bioparts — such as inducible promoters — and continues

to be expanded. Plate reader and flow cytometry data are typically acquired.

Information and data standards. Design and characterization projects greatly vary in terms of

purpose, internal organization and output; but, nonetheless they deal with similar types of

information. We have identified three main categories:

Sequence description (Description of genetic objects of interest): FASTA 30 and GenBank

31 are well established formats that underpin very large public databases of naturally

occurring, annotated, sequences. However, they are not suitable for the description of the

5

genetic constructs encountered in synthetic biology — as they were not designed to

express the constructs hierarchy and modularity. This has led to the creation of SBOL

(the Synthetic Biology Open Language), a standard that captures the same sequence-

oriented information found in a GenBank file, which allows full hierarchical annotation

of DNA components 32, and thus facilitates the exchange of genetic designs33. SBOL’s

latest version, SBOL 2.0 34, proposed a revision to the core model in order to represent a

wider range of molecular interactions and components.

Figure 2. The CSynBI characterization pipeline.

6

Modelling (representing the actual or desired behavior of genetic objects): The Systems

Biology Markup Language (SBML) 35–37, is the most commonly used modelling standard

for the representation of biological phenomena. It is free, open, enjoys widespread

software support and is the de facto standard for representing computational models in

systems biology today 35–37. Modelling standards naturally complement genetic-construct

descriptions standards, as made apparent by SBOL 2.0 38 and by the mechanism to

annotate SBML models with SBOL files put forward by Roehner and Myers 39.

Raw data acquisition (learning about the genetic objects): As it matures into an

engineering discipline, synthetic biology will move from qualitative to quantitative data.

The amount of data captured and analyzed will consequently increase substantially —

due in no small part to the availability of new imaging modalities (generating ever larger

files) and the development of high throughput platforms 25.

RESULTS AND DISCUSSION

Motivation: data acquisition is the missing standard. In the design cycle, data acquisition

takes place during the testing phase. Since validating or rejecting a design is determined by

whether some concentrations and observed phenotypes fall within ranges listed in the

specifications, only a few repeat experiments may be needed for a given context. However,

testing may have to be performed for a potentially large number of candidate constructs and

experimental contexts.

With biopart characterization, a set of experiments are typically run on a small number of

characterization constructs (often, one containing the biopart and a set of controls).

Characterization differs from testing in that it has no margin of tolerance: it has to be as precise

7

as possible (so the results may be re-used to model more complex designs). In practice, a large

number of repeat experiments should be run and several acquisition modalities used. Because of

the need for catalogues of significant numbers of characterized parts, characterization can be

expected to be a main driver behind increased data capture in the future.

At present there is no such standard. Hence, developing such a standard is of the utmost

importance for synthetic biology - as it will be a key driver in transforming the field into a fully-

fledged engineering discipline. For such a standard to be of use in synthetic biology, it needs to

effectively support data acquisition. Therefore, it must:

1. Be based on a modular, extensible data model for the experimental process.

2. Optimize data storage of very large amounts of data.

3. Provide services oriented to automate the exchange of information between modalities

and repositories.

4. Build, if possible, on an existing validated standard, to facilitate adoption by hardware

manufacturers and the engineering community.

DICOM and DICOM-SB. There is already a highly successful representation and

communication standard that meets requirements 1 to 4, described above, in the field of

biomedicine. DICOM (Digital Imaging and Communications in Medicine) is the de facto

standard for handling, storing, printing, and transmitting information in medical imaging. It is

also known as NEMA standard PS3, and as ISO standard 12052:2006 "Health informatics —

Digital Imaging and Communication in Medicine (DICOM) including workflow and data

management" 40. It is both a file format definition and a network communications protocol (based

8

on TCP/IP). Originally developed to achieve compatibility between different medical imaging

and information systems, it has developed into a comprehensive standard over the last 20 years.

Several key technical aspects of DICOM provide a strong case for extending the standard to

synthetic biology (technically by adding a new module; working title DICOM-SB), rather than

developing a novel standard or adapting an existing rival data acquisition standard:

First, DICOM was designed with data acquisition and transmission in mind. This means

that in practice most of the practical issues involved in networking various imaging

resources and repositories have already been solved. Also, DICOM’s real-world model

was built around the experimental process.

Second, DICOM already supports a number of imaging modalities, such as microscopy,

used in synthetic biology.

Third, because of DICOM’s popularity, there already are a large number of programmers

and engineers familiar with the standard. It would therefore take little development work

for manufacturers to adapt their equipment and support DICOM-SB.

DICOM has also had a transformative effect on medical ICT — the like of which would be

highly beneficial to synthetic biology. The combination of DICOM and HL7 has supported the

development of electronic health records (EHR) — a class of software that systematically

collects electronic health information about individual patients or populations – including

demographics, medical history, medication and allergies, images, vital signs, personal statistics,

and procedural information 41–43. In parallel, a class of software called PACS (picture archiving

and communication systems) was developed to provide economical storage, access to images

9

acquired with multiple modalities 44–46. It is straightforward to draw an analogy between

EHRs/PACS and repositories of characterized bioparts that would collect raw data, procedural

data (assembly for instance), experimental protocol data and processed data.

Indeed we fully anticipate a successful data acquisition standard (such as the one we present in

this paper) to underpin such repositories. The next subsections will introduce a novel variant of

the successful DICOM standard and make the case that this standard is highly suitable to support

data acquisition in synthetic biology. In particular, it is complementary to SBOL and provides

efficient data storage.

DICOM for Synthetic Biology. As stated in the Introduction, our analysis of the DICOM

standard established that its features are compatible with the requirements for synthetic biology.

Therefore, we decided to develop a new synthetic biology extension for DICOM (DICOM-SB).

DICOM-SB provides a framework that allows the integration of wet lab experimental data

acquisition modalities into a common data model. It enhances the basic architecture inherited

from DICOM, to allow the encoding of new synthetic biology data acquisition modalities not

present in the general standard 40.

DICOM encodes data objects as a series of items (or data elements), such that each item is

identified by a predefined attribute (also called a tag). Attributes are named by the combination

of two fields: group and element. Groups organize the attributes into categories — while each

element identifies each different type of attribute within a group. Each attribute is related to a

data type (e.g. integer, float, character, string etc.) which in DICOM is called Value

Representation (VR). Finally, the value to be represented is encoded at the end (last bytes) of

10

each item, preceded by the total items length. Figure 3-A depicts the DICOM encoding of data

items.

Figure 3. (A) DICOM encoding of data elements. (B) Nesting data elements using the SQ value

representation. (C) Building Modules and Information Entities. (D) The DICOM-SB data model.

11

Data objects can be nested into higher level objects. This is achieved by using the Sequence (SQ)

value representation. When an attribute is assigned an SQ VR it means the content of its value

field is a series of DICOM objects. Each object within the nested series includes another series of

data elements, and some of them may (or may not) be encoded again an SQ VR — which would

add another nesting layer, and so on. The tree in Figure 3-B illustrates a nesting example

including different data elements and objects.

DICOM includes two types of value representations to enable resource identification. On the one

hand the AE VR represents an Application Entity. An AE is the name of a DICOM device or

program which uniquely identifies it locally (e.g. inside of your network). It can refer to a

specific workstation (e.g. WORKSTATION2), a specific software service (e.g. DATASTORE),

etc. On the other hand, the UI VR encodes a Unique Identifier, used uniquely to reference

instances of DICOM data. DICOM UI's must be globally unique, and they are built from groups

of digits separated by periods (e.g. 1.2.408.41112.3.1).

Since the amount of DICOM attributes is so extensive, building consistent objects that include

all the required information can become a tedious task if they have to be searched and chosen

one by one. In order to ease this task, DICOM clusters the attributes describing the same concept

into the same Information Module. Hence, when designing a DICOM object to encode a certain

data structure, modules will be the minimal blocks that will be combined. The module

specification also determines what attributes are mandatory (and thus must always be completed)

and which ones can be left incomplete. Finally, Information Modules are combined to build

Information Entities (IE), and IE's are aggregated to build Information Object Definitions (IOD)

(see Figure 3-C).

12

In order to understand the basis of the extension of DICOM for synthetic biology, the

hierarchical data model for standard DICOM will now be described. Every data object must

implement a standard IOD. Their entities are related following a hierarchical information model.

In the standard DICOM model for medicine, the patient is at the top of this hierarchy, as they are

the object of analysis of any biomedical application. All the details related to a patient (name,

identifier, age, gender, etc.) are included in the Patient IE. Patients can be subject to different

medical studies, and this requires tracking additional data such as study date (e.g. date, time,

study number, physician’s name, etc.). Going one step down in the hierarchy, this is represented

by the Study IE. Studies comprise different procedures, such that each one is performed on

specific equipment and can be repeated over time. Each procedure is termed a series, and their

features (series number, date, time, etc.) are included in the Series IE. Finally, each series

contains raw data acquired with one modality (e.g. electrocardiogram, magnetic resonance

imaging etc.). In total, the data measured at the down most level of the hierarchy — the modality

results — are annotated by the remaining levels: patient, study and series. It is worth noting that

although the main DICOM standard is associated first and foremost with images, the standard

also supports other types of data— waveforms being the most relevant for the present exercise

(see the section on Synthetic Biology Raw Data IOD).

The synthetic biology extension of DICOM organizes its data model following a similar strategy

(see class diagram in Figure 3-D):

Instead of “patient”, the object of study is the transformation of a host organism ,

according to a transformation protocol, with a set of genetic constructs — whose

behavior within the host is to be determined. We have created three new IE’s to model

this information (see top-left of the hierarchy in Figure 3-D, in green).

13

o Component: as the main target of the characterization process, this IE lays at the

top of the hierarchy. It describes the basic features to be tracked for each biopart

under analysis (the term Component has been chosen to be consistent with

SBOL). One of its attributes, named URI, allows the biopart to be described by

referencing an SBOL entity (a Component Definition), which allows a more

detailed annotation of its DNA sequence. Although nothing prevents the use of

GenBank or FASTA files, it is recommended to use SBOL, as it has the

advantage supporting the representation of recursive biopart structures. This is

especially useful to represent in a single file a circular plasmid structure

integrating e.g. cargos, bioparts of interest and reporting genes.

o Host: the components (or bioparts) to be characterized are studied and analyzed in

the context of a specific host organism (also known as a chassis). This IE

describes the basic features to be tracked for a host in a characterization

experiment. Cell free systems are also represented by this entity by using a special

host type.

o Transformation: host cells may be genetically modified (transformed) by DNA

components before they are used in a characterization experiment. A

Transformation IE represents a cell design — as a combination of one host

organism and a list of components. The list of components in a transformation

can be empty, meaning that the host would be used untransformed in the

experiment (typically to be used as a control). Optionally, it can also include

details about the transformation protocol (more details about the Protocol IE

below).

14

In the next level of the hierarchy, the Experiment IE (analogous to Study in the medical

model) is defined. Its purpose is to perform all the procedures required to analyze the

change of behavior that the integrated set of components produces in the host. Each

experiment must also adhere to an experimental protocol whose details are defined by the

Protocol IE (see top-right of the hierarchy in Figure 3-D, in red).

An experiment comprises a set of procedures that are repeated on different compartments

(typically a well) over time. Each single repeat of a specific procedure in a compartment,

performed with dedicated equipment constitutes a Series (similar to the medical model).

The following IE’s expand the scope of a series (see bottom of the hierarchy in Figure 3-

D, in blue):

o Stimuli: when the series requires interaction with external stimuli, this IE may

represent either environmental conditions (e.g. temperature) or chemical

components to be added into the media during the course of the series.

Environmental conditions and chemical components can either be specified as

absolute values or as increments over time.

o Compartment: the experiment cells may be grouped in different compartments,

according to the cell interactions that need to be tested. Thus the Compartment IE

can be seen as a container (e.g. a vessel) where an experiment is performed. When

working with automated platforms, it is common to use plates that arrange wells

as a matrix of rows and columns, such that each well is assigned a different series.

In such a scenario, each well is represented by a Compartment IE that enables

tracking the localization of the series. The term ‘compartment’ was chosen to be

15

consistent with the modelling standard SBML (where it is defined as a bounded

space in which species are located).

o Equipment: this IE identifies and describes the piece of equipment performing the

measurements — whether it is microscopy, flow cytometry etc.

Finally, each series references the raw data generated by the equipment after that run. In

synthetic biology the raw data are often organized as a list of data arrays, such that each

array represents each of the different magnitudes measured by the equipment (e.g. time,

temperature, fluorescence intensity, optical density, etc.) within its corresponding values.

When the raw data are structured in this fashion they can be easily encoded using the

standard DICOM Waveform module 40. The next subsections of the paper show in more

detail how the Waveform module is used to encode both cell population and single cell

measurements.

As with the standard DICOM model, the experimental measurements for each data acquisition

activity are stored in the corresponding attributes of each modality — whereas the rest of the

higher entities in the data model store the metadata required for classifying, process, analyzing

and disseminating the experimental measurements.

Having established the basic structure of the DICOM-SB data model, a new Synthetic Biology

IOD was defined to accommodate some of the modalities (and metadata) that are often used in

synthetic biology, but not already present in the DICOM standard. We call this IOD the

Synthetic Biology Raw Data IOD (SBRD). The SBRD was constructed both by reusing some of

the standard DICOM Information Modules (such as the Waveform Module), and defining new

modules. A full description of the DICOM-SB data model is available in the supporting material,

16

including details of the associated Tags and their Value Representations, as well as the

corresponding Information Modules and Information Entities.

To illustrate how the SBRD is used in practice, let us consider the following case study: the

characterization of constitutive promoters on a robotic automated platform as performed at the

Centre for Synthetic Biology and Innovation (CSynBI) at Imperial college London — see

Supplementary Information for more on the characterization protocol for constitutive promoters

at CSynBI and the data that are collected as part of the exercise.

Encoding cell population measurements. CSynBI’s characterization experiments use a plate

reader to measure the optical density and fluorescence of the population of E. coli (MG-1655)

transformed according to a specific experimental protocol. Thanks to previously established

calibration curves, these measurements are converted into estimates for cell population and GFP

population respectively. The characterization protocol states that the plate reader should

periodically sample each well of the plate at intervals of 15 minutes and tracks measurements in

two channels:

Target measure: total fluorescence intensity

Population measure: total optical density

The SBRD handles plate reader population level data (which are time series) as follows.

As long as the sampling frequency is constant, and the channel values are chronologically sorted,

the attribute Sampling Frequency (003A, 001A) can be used to track the frequency value. If this

is not the case, an extra channel can be added with corresponding time marks.

17

The attributes defined as part of the Data Series module provide a data structure to store target

and populations measurements in different channels, as well as sampling independent data

related to the test itself (e.g. acquisition date and time, number of channels and sample, channel

names, channel properties, etc.).

Figure 4 depicts how the data generated by a plate reader experiment can be mapped and

structured into the Raw Data module as part of the SBRD IOD. Referring to Figure 4, reading

from left to right, it can be seen that the whole structure is encoded as a Waveform Sequence,

which allows the inclusion of several modality repeats (for instance, a range of fluorescence

channel, each corresponding to a given bandwidth). As per the protocol, there is only one repeat,

named “Object 1” (corresponding to an excitation wavelength of 385 nm ±10nm and an emission

wavelength of 428 nm ±10nm).

The first attribute within this object is the Channel Definition Sequence, which includes

the metadata required to describe all the channel related settings; this sequence must

contain as many objects as different channels (Objects 2.1 and 2.2 for OD and GFP

channels), and the data elements within each object relate to the different channel

features: Channel Number, Channel Label, Status (active / inactive / data / test / ...),

number of bits encoding each channel value (Waveform Bits Allocated), etc. Since our

experiments report in arbitrary units, there is no detail about units of measure. However,

such details can be included, if required, and there are attributes available to track them

(see supporting material).

18

Figure 4. Encoding the results of the plate reader using the Raw Data IOD.

The last attribute (Waveform Data) is used to encode the sequence of experimental data,

such that the samples are sorted in ascending order (from 1 to n). Each sample is built by

the concatenation of the different channel values, sorted as per the corresponding

Waveform Channel Numbers (first OD and then GFP).

The attributes in the center encode metadata not related with the channels — such as the

Sampling Frequency, the length of each data sample (Waveform Bits Allocated and

Waveform Sample Interpretation), total Number of Waveform Channels and Samples,

and Waveform Originality.

19

Encoding single cell measurements. Single cell modalities, such as flow cytometry, yield an

estimate of the amount of fluorescence on a per cell basis. As with plate reader data, the data

need processing before they can be used. However, with flow cytometry, the problem is not to

estimate population values, but, rather, to identify which of the measured particles (events)

correspond to growing bacteria, instead of cell debris. This is typically done through a process

called gating, which implies selecting area(s) on the scatter plot generated during the flow

experiment to decide which cells are to be analyzed and which not.

In the characterization protocol, data acquisition with flow cytometry takes place twice during

the assay - the first time after 3 hours, the second time after 6 hours. Each time a 10 % sacrificial

sample is extracted.

In addition to measuring fluorescence in a range of bandwidths, flow cytometry provides other

types of measurements that are related to properties of a particle. For example, forward scatter

(FSC) relates to the size of the event, while the side scatter (SSC) refers to its granularity. The

curation protocol we have implemented for data analysis uses the FSC value of each event to

determine whether it should be included as living bacteria.

The Raw Data module enables the encoding of as many scattering (FSC / SSC) and fluorescence

channels as generated by the flow cytometer. Even the time mark can be tracked as an extra

channel. The mapping is similar to that depicted in Figure 4, but with a larger number of

channels. It is worth noting that there have been attempts at using DICOM to encode cytometry

data 47; the SBRD presented in this article is a more general approach however, as it can be used

for any type of data.

20

Supporting communication. After having set the data model, DICOM must offer a service to

allow the communication of data objects (encoded as IOD's) between different Application

Entities. In a data acquisition context (such as here), there is a need for at least one service that

stores the acquired IOD into a data repository. Consequently, we have developed a web service

that implements a DICOM Message Service Element (DIMSE): the Store service. The

combination of an IOD and the corresponding DIMSE service creates the Service Object Pairs

(SOP's). Accordingly, our DICOM-SB extension has defined new SOP objects for each different

equipment modality used: Synthetic Biology Plate Reader (SBPR) and Synthetic Biology Flow

Cytometry (SBFC). Both of them include (see Figure 5-A):

The IOD called Synthetic Biology Raw Data (SBRD).

The Store DIMSE service.

DICOM-SB has been developed jointly with our automated biopart characterization pipeline (see

Figure 1) and is now mature enough to support all biopart characterization at the Centre for

Synthetic Biology and Innovation (CSynBI). The SBRD and the Store DIMSE service are used

as part of the data acquisition step of the pipeline (step 2), which, in practice, proceeds as follows

(Figure 5-B):

1. Execution of the experimental protocol in the laboratory.

2. Data conversion of experimental results following the DICOM-SB standard (generation

of the SBRD IOD).

21

3. Automated communication of experimental results from the laboratory equipment to a

centralized data repository (by using the Store DIMSE service).

Data stored in the central repository and then analyzed and eventually published onto SynBIS.

Internal (CSynBI) data are now not the only data feeding SynBIS. We have established

collaboration with the Poh Lab at the Nanyang Technological University (NTU). As part of this

joint work they have characterized a set of constitutive promoters following a manual version of

our experimental protocol that only collects cell population measurements (currently

unpublished). In addition, they have used DICOM-SB to standardize their characterization

results (encoding their data as an SBRD IOD) and make them available to SynBIS (uploading

them using our Store DIMSE service).

22

Figure 5. (A) The Synthetic Biology SOP Classes. (B) Receiving characterization data from

external partners.

DISCUSSION

No single standard will be able to support typical, full synthetic biology workflows. It is too

large and too diverse a task for one standard to encompass successfully. In our view a small

23

number of non-competing data standards can effectively describe the workflows. The first set of

standards should concern the description of genetic constructs. SBOL has been designed for such

task, as well as sequence annotations 32,33. It is now being expanded to improve the description of

the modular properties of bioparts 34. The second set of standards — SBML 35, Kappa 48, for

example — are used to model the behavior of the constructs and come from systems biology.

They can easily be interfaced with the first set of standards 39.

In the paper we have made the case that the synthetic biology community (from academia to

industry) should also focus its attention on the development and adoption of a third set of

complementary standards that would support and indeed enable data acquisition. To this end we

have developed DICOM-SB, an extension of the DICOM data standard designed for synthetic

biology. DICOM-SB was specifically developed with biopart characterization and construct-

testing in mind. Both are likely to involve heavy data acquisition.

We have developed a DICOM-SB a data model built around a typical experiment in synthetic

biology. It is totally compatible, complementary indeed with SBOL, in relation to the description

of the constructs involved in experiments. The data model contains in its header all the metadata

required to describe experimental context accurately — crucially, it also standardizes the

description.

We have shown that one of the advantages of DICOM-SB is that it optimizes data storage. In this

regard instead of using text-based representations like most standards (e.g. XML, SBML,

SBOL), DICOM-SB encodes data in binary format. While this doesn't make a difference when

dealing with text strings or characters, it offers significant savings when dealing with numbers:

binary representations — allowing encoding of up to 256 different numbers per byte.

24

Conversely, text based representations use 1 byte per digit, meaning up to 10 numbers per byte.

In sum, DICOM offers up to a 25:1 downscale just by using binary encoding without data

compression. This feature becomes especially important when dealing with data intensive

modalities, such as flow cytometry (up to 50000 events per file), microarrays etc. It is even more

important in the context of characterization, where a significant number of experiments and

repeats may be needed in order to extract the properties of a biopart (for instance to induce a

promoter over a large range of concentration, or if the biopart exhibits a very stochastic

behavior).

We have also described a less obvious, but practically important, feature of DICOM-SB in

relation to data acquisition: its communication layer. DICOM (and by extension DICOM-SB) is

a communication standard as well as a data standard. The DICOM data representation is

incorporated within a corresponding communication service, built to facilitate the automated

distribution of results between measuring equipment and repositories. Medical ICT has greatly

benefited from such automation over the last two decades, so much so that DICOM is supported

by all the major stakeholders in the medical ICT industry. This, in our opinion, is a crucial

advantage of DICOM-SB over potential rival standards for data acquisition. DICOM is a well-

known, widely adopted standard by both industry and academia. There already are a large

number of programmers and engineers familiar with DICOM. It would therefore take a

relatively small amount of development work for manufacturers to adapt their equipment to

support DICOM-SB. For some modalities such as microscopy that already use DICOM,

transitioning to DICOM-SB would be straightforward due to the similarity between the

standards.

25

Far from competing with other synthetic biology standards, DICOM-SB can be the perfect

complement to the standards currently available in the area. Taking the example of SBOL, it is

currently the most powerful and successful tool for the representation of structural 32,33 and

functional descriptions 34 in synthetic biology. However, when encoding data inclusive

representations, DICOM-SB is highly effective due to its binary representation of files and its

storing efficiency. DICOM-SB is also ideal for encoding experimental results. The underlying

data model can, where appropriate, readily provide a binary implementation of an SBOL data

model.

The DICOM data model is the first extensive attempt at standardizing the data acquisition

process in synthetic biology. The performance module standardizes the encoding of raw data in

synthetic biology.

As a follow up to DICOM-SB, we have also developed a data model to standardize the encoding

of analyzed (experimental) data in synthetic biology — the datasheet module. It is based on the

model that was developed to help disseminate the biopart datasheets hosted by SynBIS, which

matches the workflow for a canonical characterization pipeline shown in Figure 1. The data

model has also been designed to promote compatibility between standards. We plan to release

the datasheets, together with a DICOM-SB implementation based on DICOM Structured

Reports 49. It is our belief the datasheet module will provide a simple way for existing

repositories (which mainly deal with designs) to host raw characterization data (as DICOM-SB)

and their interpretation.

26

The adoption of a DICOM-SB by the synthetic biology community (academia and industry)

would represent a discrete, important milestone in the development of synthetic biology —

particularly in relation to interoperability and industrial translation.

METHODS

Two different software applications have supported the results presented in the coming sections:

A DICOM-SB converter, which takes as input the ad hoc text and fcs data files generated

by the different modalities (flow cytometry and plate reader, in our case) and produces

DICOM-SB files. This software has been developed using Java SE as programming

language 50, and has used the dcm4che Toolkit 51 to import the libraries needed to produce

DICOM data. Dcm4che is a popular collection of open source applications for healthcare.

The dcm4che toolkit constitutes an excellent starting point for the development of

DICOM-SB applications in JAVA SE.

A DICOM Store service, responsible of uploading the DICOM-SB formatted raw data

into our SynBIS repository. This application has been developed as a RESTful Web

Service under Java EE 7 52. Although this service is not publicly available through

SynBIS, there are a number of open tools available elsewhere — e.g. the ones under the

dcm4che project 51 that can be used instead.

ASSOCIATED CONTENT

A detailed description of the DICOM-SB standard is available free of charge via the Internet at

http://pubs.acs.org. The software tools used to generate DICOM-SB files from experimental

data, as well as sample DICOM-SB files, are available via the Internet at

27

http://pubs.acs.org/

http://synbis.bg.ic.ac.uk/dicomsb. Description of the characterization protocol as well as

examples of SBOL files for the characterization constructs can also be found there.

AUTHOR INFORMATION

Corresponding Author

* Richard I Kitney, Co-Director of the Centre for Synthetic Biology and Innovation, Imperial

College London, SW7 2AZ, United Kingdom.

[email protected]

Author Contributions

The manuscript was written through contributions of all authors. All authors have given approval

to the final version of the manuscript. All authors contributed equally.

ACKNOWLEDGMENT

The authors acknowledge the support provided for synthetic biology research by the Engineering

and Physical Science Research Council [EP/J02175X/1] and the European Commission funded

7th Framework Program [FP7-KBBE 289326].

REFERENCES

(1) Kitney, R., Calvert, J., Challis, R., Cooper, J., Elfick, A., Freemont, P., Haseloff, J., Kelly,

M., and Paterson, L. (2009) Synthetic Biology: scope, applications and implications. The Royal

Academy of Engineering.

(2) Clarke, L., Adams, J., Sutton, P., Bainbridge, J., Birney, E., Calvert, J., Collis, A., Kitney, R.,

Freemont, P., Manson, P., Pandya, K., Ghaffar, T., Rose, N., Marris, C., and Woolfson, D.

(2012) A Synthetic Biology Roadmap for the UK., pp 1–35. Research Councils UK.

28

http://synbis.bg.ic.ac.uk/dicomsb

(3) Cheng, A. A., and Lu, T. K. (2012) Synthetic Biology: An Emerging Engineering Discipline.

Annu. Rev. Biomed. Eng. 14, 155–178.

(4) Isaacs, F. J., Dwyer, D. J., Ding, C., Pervouchine, D. D., Cantor, C. R., and Collins, J. J.

(2004) Engineered riboregulators enable post-transcriptional control of gene expression. Nat.

Biotechnol. 22, 841–847.

(5) Win, M. N., Liang, J. C., and Smolke, C. D. (2009) Frameworks for Programming Biological

Function through RNA Parts and Devices. Chem. Biol. 16, 298–310.

(6) Salis, H. M., Mirsky, E. A., and Voigt, C. A. (2009) Automated design of synthetic ribosome

binding sites to control protein expression. Nat. Biotechnol. 27, 946–950.

(7) Elowitz, M. B., and Leibler, S. (2000) A synthetic oscillatory network of transcriptional

regulators. Nature 403, 335–338.

(8) Gardner, T. S., Cantor, C. R., and Collins, J. J. (2000) Construction of a genetic toggle switch

in Escherichia coli. Nature 403, 339–342.

(9) Atkinson, M. R., Savageau, M. A., Myers, J. T., and Ninfa, A. J. (2003) Development of

Genetic Circuitry Exhibiting Toggle Switch or Oscillatory Behavior in Escherichia coli. Cell

113, 597–607.

(10) Deans, T. L., Cantor, C. R., and Collins, J. J. (2007) A Tunable Genetic Switch Based on

RNAi and Repressor Proteins for Regulating Gene Expression in Mammalian Cells. Cell 130,

363–372.

(11) Ham, T. S., Lee, S. K., Keasling, J. D., and Arkin, A. P. (2008) Design and Construction of

a Double Inversion Recombination Switch for Heritable Sequential Genetic Memory. PLoS ONE

3, e2815.

29

(12) Kramer, B. P., and Fussenegger, M. (2005) Hysteresis in a synthetic mammalian gene

network. Proc. Natl. Acad. Sci. U. S. A. 102, 9517–9522.

(13) Fung, E., Wong, W. W., Suen, J. K., Bulter, T., Lee, S., and Liao, J. C. (2005) A synthetic

gene–metabolic oscillator. Nature 435, 118–122.

(14) Danino, T., Mondragón-Palomino, O., Tsimring, L., and Hasty, J. (2010) A synchronized

quorum of genetic clocks. Nature 463, 326–330.

(15) Anderson, J. C., Voigt, C. A., and Arkin, A. P. (2007) Environmental signal integration by a

modular AND gate. Mol. Syst. Biol. 3, 133.

(16) Win, M. N., and Smolke, C. D. (2008) Higher-Order Cellular Information Processing with

Synthetic RNA Devices. Science 322, 456–460.

(17) Wang, B., Kitney, R. I., Joly, N., and Buck, M. (2011) Engineering modular and orthogonal

genetic logic gates for robust digital-like synthetic biology. Nat. Commun. 2, 508.

(18) Basu, S., Mehreja, R., Thiberge, S., Chen, M.-T., and Weiss, R. (2004) Spatiotemporal

control of gene expression with pulse-generating networks. Proc. Natl. Acad. Sci. U. S. A. 101,

6355–6360.

(19) Basu, S., Gerchman, Y., Collins, C. H., Arnold, F. H., and Weiss, R. (2005) A synthetic

multicellular system for programmed pattern formation. Nature 434, 1130–1134.

(20) You, L., Cox, R. S., Weiss, R., and Arnold, F. H. (2004) Programmed population control by

cell–cell communication and regulated killing. Nature 428, 868–871.

(21) Khalil, A. S., and Collins, J. J. (2010) Synthetic biology: applications come of age. Nat. Rev.

Genet. 11, 367–379.

(22) Weber, W., and Fussenegger, M. (2012) Emerging biomedical applications of synthetic

biology. Nat. Rev. Genet. 13, 21–35.

30

(23) Kitney, R., and Freemont, P. (2012) Synthetic biology – the state of play. FEBS Lett. 586,

2029–2036.

(24) Ellis, T., Adie, T., and Baldwin, G. S. (2011) DNA assembly for synthetic biology: from

parts to pathways and beyond. Integr. Biol. 3, 109–118.

(25) Kelwick, R., MacDonald, J. T., Webb, A. J., and Freemont, P. (2014) Developments in the

tools and methodologies of synthetic biology. Synth. Biol. 2, 60.

(26) Registry of Standard Biological Parts. http://parts.igem.org (accessed Oct 29, 2015).

(27) JBEI Inventory of Composable Elements (ICE). https://public-registry.jbei.org (accessed

Nov 29, 2015).

(28) Virtual Parts Repository. http://sbol.ncl.ac.uk:8081 (accessed Oct 29, 2015)

(29) Synthetic Biology Information System (SynBIS). http://synbis.bg.ic.ac.uk (accessed Oct 29,

2015). On-line repository of biopart-datasheets.

(30) Pearson, W. R., and Lipman, D. J. (1988) Improved tools for biological sequence

comparison. Proc. Natl. Acad. Sci. 85, 2444–2448.

(31) Bilofsky, H. S., and Christian, B. (1988) The GenBank® genetic sequence data bank.

Nucleic Acids Res. 16, 1861–1863.

(32) Galdzicki, M., Wilson, M., Rodriguez, C. A., Pocock, M. R., Oberortner, E., Adam, L.,

Adler, A., Anderson, J. C., Beal, J., Cai, Y., Chandran, D., Densmore, D., Drory, O. A., Endy,

D., Gennari, J. H., Grünberg, R., Ham, T. S., Hillson, N. J., Johnson, J. D., Kuchinsky, A., Lux,

M. W., Madsen, C., Misirli, G., Myers, C. J., Olguin, C., Peccoud, J., Plahar, H., Platt, D.,

Roehner, N., Sirin, E., Smith, T. F., Stan, G.-B., Villabos, A., Wipat, A., and Sauro, H. M.

(2012) Synthetic Biology Open Language (SBOL) Version 1.1.0.

31

(33) Galdzicki, M., Clancy, K. P., Oberortner, E., Pocock, M., Quinn, J. Y., Rodriguez, C. A.,

Roehner, N., Wilson, M. L., Adam, L., Anderson, J. C., Bartley, B. A., Beal, J., Chandran, D.,

Chen, J., Densmore, D., Endy, D., Grünberg, R., Hallinan, J., Hillson, N. J., Johnson, J. D.,

Kuchinsky, A., Lux, M., Misirli, G., Peccoud, J., Plahar, H. A., Sirin, E., Stan, G.-B., Villalobos,

A., Wipat, A., Gennari, J. H., Myers, C. J., and Sauro, H. M. (2014) The Synthetic Biology Open

Language (SBOL) provides a community standard for communicating designs in synthetic

biology. Nat. Biotechnol. 32, 545–550.

(34) Roehner, N., Oberortner, E., Pocock, M., Beal, J., Clancy, K., Madsen, C., Misirli, G.,

Wipat, A., Sauro, H., and Myers, C. J. (2014) Proposed Data Model for the Next Version of the

Synthetic Biology Open Language. ACS Synth. Biol.

(35) Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., Forum, and the

rest of the S., Arkin, A. P., Bornstein, B. J., Bray, D., Cornish-Bowden, A., Cuellar, A. A.,

Dronov, S., Gilles, E. D., Ginkel, M., Gor, V., Goryanin, I. I., Hedley, W. J., Hodgman, T. C.,

Hofmeyr, J.-H., Hunter, P. J., Juty, N. S., Kasberger, J. L., Kremling, A., Kummer, U., Novère,

N. L., Loew, L. M., Lucio, D., Mendes, P., Minch, E., Mjolsness, E. D., Nakayama, Y., Nelson,

M. R., Nielsen, P. F., Sakurada, T., Schaff, J. C., Shapiro, B. E., Shimizu, T. S., Spence, H. D.,

Stelling, J., Takahashi, K., Tomita, M., Wagner, J., and Wang, J. (2003) The systems biology

markup language (SBML): a medium for representation and exchange of biochemical network

models. Bioinformatics 19, 524–531.

(36) Hucka, M., Finney, A., Bornstein, B. J., Keating, S. M., Shapiro, B. E., Matthews, J.,

Kovitz, B. L., Schilstra, M. J., Funahashi, A., Doyle, J. C., and Kitano, H. (2004) Evolving a

lingua franca and associated software infrastructure for computational systems biology: the

Systems Biology Markup Language (SBML) project. Syst. Biol. IEE Proc. 1, 41–53.

32

(37) A, F., and M, H. (2003, December 1) Systems biology markup language: Level 2 and

beyond.

(38) Bartley, B., Beal, J., Clancy, K., Misirli, G., Roehner, N., Oberortner, E., Pocock, M.,

Bissell, M., Madsen, C., Nguyen, T., Zhang, Z., Gennari, J. H., Myers, C., Wipat, A., and Sauro,

H. (2015) Synthetic Biology Open Language (SBOL) Version 2.0.0. J. Integr. Bioinforma. 12,

272.

(39) Roehner, N., and Myers, C. J. (2014) A Methodology to Annotate Systems Biology Markup

Language Models with the Synthetic Biology Open Language. ACS Synth. Biol. 3, 57–66.

(40) NEMA PS3 / ISO 12052, Digital Imaging and Communications in Medicine (DICOM)

Standard. National Electrical Manufactureres Association, Rosslyn, VA, USA.

(41) Gunter, T. D., and Terry, N. P. (2005) The Emergence of National Electronic Health Record

Architectures in the United States and Australia: Models, Costs, and Questions. J. Med. Internet

Res. 7.

(42) Hoerbst, A., and Ammenwerth, E. (2010) Electronic Health Records: A Systematic Review

on Quality Requirements. Methods Inf. Med. 49, 320–336.

(43) Poh, C.-L., Kitney, R. I., and Shrestha, R. B. K. (2007) Addressing the Future of Clinical

Information Systems — Web-Based Multilayer Visualization. IEEE Trans. Inf. Technol. Biomed.

11, 127–140.

(44) Choplin, R. H., Boehme, J. M., and Maynard, C. D. (1992) Picture archiving and

communication systems: an overview. RadioGraphics 12, 127–129.

(45) Meyer-Ebrecht, D. (1994) Picture archiving and communication systems (PACS) for

medical application. Int. J. Biomed. Comput. 35, 91–124.

33

(46) Müller, H., Michoux, N., Bandon, D., and Geissbuhler, A. (2004) A review of content-based

image retrieval systems in medical applications—clinical benefits and future directions. Int. J.

Med. Inf. 73, 1–23.

(47) Leif, R. C., and Leif, S. B. (2001) DICOM-compatible format for analytical cytology data

that can be expressed in XML, pp 238–248.

(48) Danos, V., Feret, J., Fontana, W., Harmer, R., and Krivine, J. (2008) Rule-Based Modelling,

Symmetries, Refinements, in Formal Methods in Systems Biology (Fisher, J., Ed.), pp 103–122.

Springer Berlin Heidelberg.

(49) Clunie, D. D. A. (2000) DICOM Structured Reporting. PixelMed Publishing, Bangor, Pa.

(50) Java Standard Edition (SE).

http://www.oracle.com/technetwork/java/javase/overview/index.html (accessed Oct 29, 2015).

(51) dcm4che2 DICOM Toolkit.

http://www.dcm4che.org/confluence/display/d2/dcm4che2+DICOM+Toolkit (accessed Oct 29,

2015.

(52) The JavaTM API for RESTful Web Services. https://jcp.org/en/jsr/detail?id=311 (accessed

Oct 29, 2015).

34

For Table of Contents Use Only

We present in this paper the first data acquisition standard for synthetic biology - DICOM-SB.

Built on a modular data model for the experimental process, DICOM-SB optimizes data storage

and has a communication layer supporting exchange of information between modalities and

repositories. To demonstrate these features, we use the example of the biopart characterization

pipeline at Imperial College London.

35

Documents

Template for Electronic Submission to ACS Journals · Web viewDcm4che is a popular collection of open source applications for healthcare. The dcm4che toolkit constitutes an excellent