Upload
vandien
View
221
Download
5
Embed Size (px)
Citation preview
An XML Database for Gene Expression
MSc Dissertation Thesis
Lalit Kumar Registration No.: 061056187
[email protected] / [email protected]
M.Sc. Bioinformatics September 2007
Heriot-Watt University
Edinburgh, United Kingdom http://www.hw.ac.uk
Supervisors:
Dr. Albert Burger [email protected]
School of Mathematical and Computer
Sciences, Heriot-Watt University Edinburgh, United Kingdom
http://www.macs.hw.ac.uk
Dr. Yiya Yang
Human Genetics Unit Medical Research Council
Edinburgh, United Kingdom http://www.hgu.mrc.ac.uk
i
Declaration
I, Lalit Kumar, confirm that this work submitted for assessment is my own and is
expressed in my own words. Any uses made within it of the words of other authors in
any form e.g., ideas, equations, figures, text, tables, programs etc are properly
acknowledged. A list of references employed is included.
………………………………
Lalit Kumar
ii
Acknowledgements I wish to express my sincere gratitude towards my supervisors Dr. Albert Burger and
Dr. Yiya Yang. Thanks to Dr. Burger for reading my thesis drafts, listening to my
problems and for supporting me every step of the way. Thanks to Dr. Yang for her
understanding and putting up with all my learning-curve troubles. Their guidance and
generous time throughout this project helped me in timely completion of the project
and also in appreciating the potential of XML technologies.
I can not ever thank enough my family for their love and support which enabled me
to pursue this MSc. While Mummy, Papa, Uncle and Auntie’s blessings gave me
encouragement; Kishore, Sudha, Pooja, Naveen, Kapil and Manish’s love remained a
source of meaning of life. The arrival of a new member in family, Dhairya, Kishore
and Sudha’s son, provided immense joy.
Thanks to the Scottish Executive that awarded me the Scottish International
Scholarship for the perusal of this MSc. The scholarship was managed by the British
Council. Thanks to Alison Kanby, Regional Services Officer, British Council,
Scotland who did an excellent work in managing my scholarship.
Internet and mobile phones have made world a much smaller place. My old friends
remained in touch with me when I moved to UK for studies. I thank Daniel Reinharz
for his words of wisdom and for his friendship that I treasure. Supatra Kundu,
Sangeeta Yaduvanshi and Reena Torawat have been great friends through all ups and
downs. No words would suffice for Shubhda.
Scotland has given me a number of new friends. I am thankful to my friends and
fellow SISP scholars Aishwarya, Anurag, Naren, Ruchi and Sujai for their cheerful
company on so many occasions. Chris, Daylan, Kanchan, Basil, Rizwan and Atiq also
have been pleasant folks to be with.
Edinburgh is an amazing city.
iii
Abstract Extensible Markup Language (XML) is fast becoming a standard method of
information exchange among computing devices. The flexible nature of XML has
made it possible to develop subject specific languages out of it (for example MathML
is XML used for describing mathematical notations). People with varying computer
skills can manage XML easily because of its ability to keep the data in human readable
format. Due to these reasons, among others, XML is being widely used in
Bioinformatics applications, as well.
In Bioinformatics, gene expression databases have become greatly useful tools. Gene
expression data is fairly complex and is gathered using various experimentation
techniques in laboratories. The transmission of this data from laboratories to gene
expression databases is now being standardized. The efforts are on to develop
standard formats of storing this data. MISFISHIE is one of such standards.
The EMAP project of the Medical Research Council’s Human Genetics Unit (MRC
HGU) has developed a gene expression database called EMAGE, which is an object
database. Now the MRC HGU wants to investigate the possibility of converting
EMAGE into an XML based and MISFISHIE compliant database. The objective of this
dissertation project is to develop this XML database.
iv
Table of Content
Declaration................................................................................................................i
Acknowledgements.................................................................................................. ii
Abstract .................................................................................................................. iii
Table of Content...................................................................................................... iv
List of Figures.........................................................................................................vi
1. Introduction .........................................................................................................1
1.1 Dissertation Thesis Outline.................................................................................. 3
2. Background..........................................................................................................5
2.1 Bioinformatics....................................................................................................... 5 2.1.1 Gene Expression Data .................................................................................................. 5 2.1.2 Gene Expression Databases.......................................................................................... 8
2.2 Introduction to XML............................................................................................ 9
2.3 XML in Bioinformatics Applications................................................................ 11
2.4 The EMAP Project ............................................................................................. 15
2.5 EMAGE: A Gene Expression Database ........................................................... 16
3. Project Description ............................................................................................19
3.1 Problem Situation............................................................................................... 19
3.2 Project Aims and Objectives ............................................................................. 20
3.3 Overview of Solution Development................................................................... 20
4. Development of XML Schema...........................................................................22
4.1 Existing Data Structure ..................................................................................... 22
4.2 XML Schema for Original Dataset ................................................................... 25
4.3 MISFISHIE Standard........................................................................................ 27
4.4 Rationale for a New XML Schema ................................................................... 28
4.5 Validation of New XML Schema....................................................................... 29
5. Transformation of Original Dataset .................................................................32
5.1 Investigated Approaches.................................................................................... 32 5.1.1 JDOM based Application ........................................................................................... 32 5.1.2 XML Shredding using DB2 9 Database ..................................................................... 33 5.1.3 XSL Transformation................................................................................................... 34
5.2 XSL Transformation.......................................................................................... 34 5.2.1 Rationale of Using XSL Transformation.................................................................... 34
v
5.2.2 Mapping Scheme........................................................................................................ 35 5.2.3 Mapping Tools ........................................................................................................... 37 5.2.4 Automated Code Generation ...................................................................................... 39 5.2.5 Transformation Process .............................................................................................. 40
6. Database Preparation and Querying ................................................................41
6.1 Why IBM DB2 9 ................................................................................................. 41 6.1.1 pureXMLTM Technology ............................................................................................ 44
6.2 Insertion of XML into DB2 Database............................................................... 44
6.3 Querying of XML Data...................................................................................... 47 6.3.1 XQuery Language ...................................................................................................... 47 6.3.2 SQL/XML .................................................................................................................. 49
6.4 User Interface Development .............................................................................. 50 6.4.1 Java Server Pages (JSP).............................................................................................. 50 6.4.2 Hypertext Preprocessor (PHP).................................................................................... 52
6.5 User Interface for a Few Queries ...................................................................... 53
7. Conclusion .........................................................................................................61
7.1 Summary of the Work Done.............................................................................. 61
7.2 Summary of Evaluation ..................................................................................... 61
7.3 Accomplishments................................................................................................ 62
7.4 Limitations Encountered ................................................................................... 63
7.5 Skills Acquired.................................................................................................... 63
7.6 Future Work ....................................................................................................... 64 7.6.1 Comprehensive Query Interface Development........................................................... 64 7.6.2 Interface for Inserting New Data ................................................................................ 64 7.6.3 Performance Evaluation ............................................................................................. 65 7.6.4 Query Optimization .................................................................................................... 65
7.7 Final Thoughts.................................................................................................... 65
References ..............................................................................................................67
Appendix A.............................................................................................................69
Other Appendices (on CD-ROM)..........................................................................92
vi
List of Figures Figure 1: Protein synthesis process [18]........................................................................................... 2
Figure 2: Spatial queries formulated using EMAGE Java interface [4] ......................................... 18
Figure 3: Screenshot of the EditiX interface.................................................................................... 26
Figure 4: Screenshot of DTD/Schema menu of EditiX..................................................................... 26
Figure 5: Screen shot of Oxygen XML Editor.................................................................................. 29
Figure 6: Screenshot of a portion of Mapping generated using MapForce .................................... 39
Figure 7: Structure of the relational table created to hold the XML documents ............................. 45
Figure 8: Input screen for query about gene expression detection.................................................. 54
Figure 9: Output of the gene expression detection query ................................................................ 54
Figure 10: Input screen for query that counts the fully or partially sequenced assays ................... 57
Figure 11: Output of the query that counts the fully or partially sequenced assays........................ 57
Figure 12: Input screen for the query that finds components where a gene is expressed................ 58
Figure 13: Output of the query that finds components where a gene is expressed .......................... 59
1
1
1. Introduction Recent years have seen rapid development in the natural science disciplines like
biotechnology, cell biology, molecular biology, genetics and bioinformatics. As a
consequence, increasingly vast amount of related data is being produced from the
practices in these fields. A matter of concern has been the management of this data
and making it readily available as and when required. Traditionally, the data produced
by the experiments in these fields used to be distributed by the means of journals and
other types of print publications. However, a few problems have been associated with
this traditional approach:
• Lack of standardized formats for data publication made it difficult to compile
data from different sources
• Availability of print publications is not same everywhere in the world
• Data compilations produced in print media are difficult to search through
• Analysis of printed data and generating information out of it is difficult
The problem of absence of standard formats has, lately, been mitigated to
considerable extent with the advent and steady development of such formats.
Information Technology (IT) tools have been very useful in the solution of the other
problems listed above. A variety of databases containing bioinformatics data have
become available in past couple of decades. Online availability of most of these
databases ensures the global reach and easy access to the data. With the growing
importance and availability of gene expression data; many gene expression databases
have also come into existence (See section 2.1.2).
2
Gene expression refers to the presence (or absence) of the effect of a particular gene.
Genome1 contains genes which determine the amino acid2 sequences of the resultant
proteins. In addition, genome also contains a comprehensive mechanism for
controlling the synthesis of functional proteins from genes.
Figure 1: Protein synthesis process [18]
Often the amino acid sequence produced from the genetic blueprint undergoes
extensive modifications before it becomes a functional protein. Moreover, the
functionality of the matured proteins itself is regulated by a number of other factors
which can suppress or enhance the functionality. Therefore, merely knowing the gene
and protein sequences is not enough. It is very important to “functionalise” the
genome by finding out the structure of genes and the regulatory mechanisms which
give rise to the functional proteins. In short, it is important to know how, when, why
and where a gene expresses itself. The systematic collection of gene expression data
1 Genome refers to all the genetic (hereditary) material in an organism. Mostly, it is composed of DNA and sometimes RNA. 2 Amino acids are building blocks of proteins. A protein is composed of chain(s) of amino acid molecules.
3
helps in deducing the eventual effects or functions of the gene in the body of the
organism. [15]
Gene expression databases not only store the information about genes and their
expression sites in organism’s body –but also these databases contain other relevant
information, for example the details of experiments that were conducted to find out
about gene expression. Often, these databases are put online and access is given
through a web interface. Using this interface the users can query the available data.
The Medical Research Council’s Human Genetics Unit (MRC HGU) has developed a
mouse gene expression database called EMAGE. It has been developed as a part of a
larger project called the Edinburgh Mouse Atlas Project (EMAP). The underlying
technology of the present EMAGE database is that of an object database. The format
of data stored in this database does not comply with the relevant standards like
MISFISHIE. Nowadays, XML that is fast becoming a standard for data exchange
among computing devices, therefore, MRC HGU intends to investigate the feasibility
of organizing the present data in compliance with the MISFISHIE standard and
migrating the EMAGE object database to an XML based database.
This dissertation project takes the first step towards the intended migration. While
developing the XML version of the EMAGE database, the project aims to research
and document the nitty-gritty of the process. The performance evaluation of the
resultant XML database is not a part of this project and would be carried out
separately by MRC HGU.
1.1 Dissertation Thesis Outline
This thesis document describes author’s work on the abovementioned migration.
Chapter 2 introduces the relevant background information. It explains about what
gene expression is and how the data about gene expression is collected. Further it
introduces a few well established gene expression databases before moving on to a
brief introduction to the concept of XML. Nowadays, XML is being used in a
4
number of bioinformatics applications. The next section in chapter 2 talks about
advantages and disadvantages of the use of XML in bioinformatics applications. Then
the EMAP project and the EMAGE database are described in the subsequent
sections.
Chapter 3 outlines the objectives that the work involved in this dissertation was
supposed to fulfill. It also presents an overview of the solution strategy that was
adopted for this purpose.
The subsequent chapters describe the process that was followed to develop the XML
database. Chapter 4 explains the development and validation of the XML schema
which would act as a template for the XML documents to be stored in the database.
Chapter 5 explains the process of transformation of existing EMAGE data into XML
documents prepared as per the new XML schema. Chapter 6 details the development
of a relational database which would hold the XML documents. It also talks about the
retrieval of the desired data from the database by the means of XQuery language.
This chapter also details the development of user interface through which the users
will be able to query the new XML database.
Chapter 7 presents the summary of and conclusions drawn from the project. This
project lays the foundation of the overall migration process. There are several other
tasks which should to be done but are beyond the scope of this project. Some of
these tasks are discussed under the Future Work section of this chapter.
5
2
2. Background
2.1 Bioinformatics
“Bioinformatics is an interdisciplinary research area which uses computers for storage, retrieval,
manipulation and distribution of information related to biological macromolecules such as DNA,
RNA and proteins.” [11]
Bioinformatics is used to perform the functions like analysis of the biological
sequence information, recovery of evolutionary patterns, prediction of gene function
and biological data mining using computer applications. Bioinformatics deals with
and generate enormous amount of complex data. This gives rise to the need of
development of more efficient and sophisticated computer tools to manage and
analyze this data.
2.1.1 Gene Expression Data
Just like computers work according to the software instructions –the development
and functioning of the living organisms are controlled by the instructions encoded in
their genetic material (which is mostly DNA and sometimes RNA). Genes are the
portions of the genetic material that contain the encoded instructions. These
instructions initiate and control the process of proteins formation in the cells of
organism. The proteins then carry out various functions in the cells. The process of
instructions in a gene getting translated into functional protein(s) is called gene
expression.
6
A paper by D’haeseleer, Liang and Somogyi (1999) titled “Gene Expression Data
Analysis and Modelling” [3] provides elementary information about what gene
expression is and how it is measured. The paper is written in the form of a tutorial.
Although the focus of the paper is on data analysis and modelling, which is out of the
scope of this dissertation project, the general information about gene expression was
found to be useful for understanding of gene expression data.
The paper begins by emphasizing that in this “Age of Genomics” we are not dealing
with the data related to the isolated genes and their products (i.e. proteins). The
advancements in the field of genomics have given rise to enormous amount of gene
expression data. This dataset not only includes the data about isolated gene
expressions and proteins but also contain data about complex interactions among
genes and proteins. Analysis of this data is of high importance because the more
information we have on gene expressions, the more we would be able to understand
about how the organisms function.
In its section 2, the paper introduces the methods that are used nowadays for
obtaining the gene expression data. In most organisms, genes transcribe messenger
RNA (mRNA) and then mRNA gets translated into proteins (See Figure 1).
Therefore, by measuring the concentration of a particular type of mRNA in the cell, it
is possible to assess the expression level of the gene which transcribed that mRNA.
Higher concentration of mRNA implies the higher level of expression of the
associated gene. The higher concentration of mRNA, however, does not always
confirm the higher level of gene expression because expression of some genes is
known to be regulated after transcription [10]. In this approach of measuring gene
expression, the DNA microarrays are often used for gene expression profiling. DNA
microarrays are slides made of glass onto which cDNA3 is deposited by high-speed
robotic printing.
The D’haeseleer, Liang and Somogyi (1999) paper is not particularly written for
providing the general description of gene expression and related measurement
3 cDNA or complimentary DNA is made by the process of reverse transcription of mRNA.
7
techniques. However, the basic information that it provides about gene expression is
good and was found to be useful.
Christiansen et al (2006) in their paper “EMAGE: a spatial database of gene
expression patterns during mouse embryo development” [4] say that the major
challenge in modern biology is to functionalize genomes that have been sequenced,
and to understand the interactions among genes and their products. It has become of
very high importance to know the sites in the body where genes express and in situ4
techniques are available that are used for determining the sites of gene expressions.
These techniques include immunohistochemistry and in situ hybridization. Using
these techniques the concentration of the gene products in the cell can be visualized.
���� Immunohistochemistry and in situ hybridization
Immunohistochemistry is a method of localizing proteins in the cells. The
antibodies in an organism bind themselves to specific antigens. Antigens are
antibody generating molecules and are part of the immune system. Antigens
are usually proteins or polysaccharides. If an antibody is available that works in
situ, then it is possible to know the distribution of the associated antigen
protein throughout the body of the organism. In immunohistochemistry, the
antibody with a known target protein antigen is used to find the presence,
distribution or absence of the protein in tissues. Once applied, the antibody
will bind itself with the target protein if it is present in the tissue cells. These
antibodies are used in combination with the coloring or fluorescent agents so
that the location and concentration of the antibody could be known. [16]
In situ hybridization is a technique of using a labeled cDNA or RNA sequence
to find the location of DNA or RNA sequence in tissues. The labeled
sequence binds itself with the complementary naturally present DNA/RNA
sequence, thus revealing its presence and location.
4 in situ, in context of biology, indicates observation of a biological activity in the place where it naturally happens. For example, observation of cell activities in a living organism would be in situ observation.
8
The Christiansen et al (2006) paper states that traditionally the gene expression data
has been archived by publishing in journals. But this approach does not allow easy
access to the published data. The lack of availability in electronic format, lack of
proper citations and lack of standardization of the writing format are the main
reasons which hamper the fast retrieval and distribution of the gene expression data.
To overcome these problems, nowadays gene expression databases are being
developed to make use of the benefits provided by the Information Technology.
2.1.2 Gene Expression Databases
Collecting and managing gene expression data has become an important task in
bioinformatics. To cater to the need of management of large amounts of data from
different types of expression assays, sevaral gene expression databases have been
developed. Some of these are mentioned below.
GXD is the “Gene Expression Database” developed and managed by Jackson
Laboratories, United States. This database contains the gene expression data of the
mouse development. It gathers data from published literature and researchers also
submit data directly to the database via electronic submissions. A number of web
forms are provided for querying the data available in this database. The records in the
database have links to other data resources which makes it easier to find relevant
information about the data in GXD. Home page of GXD is available at:
http://www.informatics.jax.org/mgihome/GXD/aboutGXD.shtml
GENSET stands for “Gene Expression Nervous System Atlas” and it contains the
gene expression data of the mouse central nervous system. This atlas is managed by
the National Center for Biotechnology Information (NCBI). The aim of GENSAT is
to create a mapping of expression of all genes that express in mouse brain at various
stages of its development cycle. The data is freely available to anyone through the
homepage of GENSAT, which is available at:
http://www.ncbi.nlm.nih.gov/projects/gensat
9
“Gene Expression Atlas” is managed by Genomic Institute of the Novartis Research
Foundation. It contains gene expression information related to mouse and human
being. This atlas can be accessed at:
http://expression.gnf.org/cgi-bin/index.cgi
EMAGE is the gene expression database developed and managed by the Human
Genetics Unit of MRC. This dissertation project aims to develop an XML based
version of EMAGE database. This database will be discussed in more details in
Section 2.5. The home page of EMAGE database is available at:
http://genex.hgu.mrc.ac.uk/Emage/database/emageIntro.html
2.2 Introduction to XML
The development of World Wide Web (WWW) gained an immense momentum when
Hypertext Markup Language (HTML) became de facto language for the web
development. HTML provides a set of markups (or tags) which are used for making
page layouts and content formatting on web pages. For example, when we want to
write text in bold typeface on a web page, we write as below:
<b>Cardiovascular</b>
Here, the <b> and </b> tags informs the web browser to display the contained text
in bold typeface. This way, though, HTML can be used to control the display of data;
it does not provide any information as to what the data is about. In the above
example, HTML does not tell anything what the word “Cardiovascular” means. This
is where XML comes into the picture.
To get the basic information about XML, a book titled “XML in a Nutshell, 3rd
Edition” by Harold and Means was consulted. [6] The book introduces XML as a
general purpose language to mark up the data in a document with simple and human-
readable tags. XML does not have a finite set of tags and rather allows the developers
to define and use their own tags (which makes XML an extensible and customizable
10
language). A developer can form the relevant XML tags to give meaning to the word
“Cardiovascular”. For example:
<organsystem>Cardiovascular</organsystem>
The XML tags <organsystem> and </organsystem> tell that the word
“Cardiovascular” indicates an organ system. By adding more XML tags, the meaning
of the text can be made even clearer. For example,
<mouse>
<organsystem>Cardiovascular</organsystem>
</mouse>
Now the information is more focussed as it is related with the cardiovascular organ
system of mouse. It is important to note that, unlike in HTML, the tags <mouse>
and <organsystem> are custom made and do not have any effect on the
appearance of the data. In XML, the same information can be provided different
meaning by changing the tags around it. For example:
<venom>
<targetorgans>Cardiovascular</targetorgans>
</venom>
Here “Cardiovascular” is no longer interpreted in context of the mouse organ system.
The change in tags has given a new meaning to the information. Now it talks about
the target organs of a venom.
In addition to this basic understanding of what exactly XML is, the book by Harold
and Means was also used for learning about features and syntax of XML (XML
syntax is a set of rules for writing XML in correct form). The book presents the XML
concepts and fundamentals in a clear, concise and easy to understand way. A large
number of examples are given and that makes the understanding of the XML easy
even for those who have no previous knowledge about this language.
11
2.3 XML in Bioinformatics Applications
In recent years, XML has been increasingly used for managing the biology data. The
data generated in the discipline of biology is flexible in nature and this is one of the
reasons that XML becomes a good candidate for the bioinformatics applications like
gene expression databases.
Achard, Vaysseix and Barillot (2000) produced a paper “XML, Bioinformatics and
Data Integration” [5] in which the authors discuss the suitability of XML for the
bioinformatics applications. In the beginning, the authors briefly introduce the XML
concepts and then set out to outline some of the areas where XML is being used. The
paper states that a number of commercial and academic actors are now using XML
for managing their data and “within a few years (XML) will be as widespread as HTML is
today”. XML is also used as a framework for developing specialized languages to be
used in various different fields. For instance, Wireless Markup Language (WML) is
based on XML and is used in the wireless application development. In the field of
biology, one of the uses of XML is to annotate the gene or protein sequence data.
Two of the examples of specialized languages developed using XML for biology field
are listed in the paper as:
1. Bioinformatics Sequence Markup Language (BSML): It is an extensible
language specification for bioinformatics data like DNA, RNA, protein
sequences and their graphical properties
2. BIOpolymer Markup Language (BioML): The paper quotes the developer of
the BioML as saying “BioML’s goal is to allow the expression of complex annotation
for protein and nucleotide sequence information. BioML was designed to mimic the
hierarchical structure of a living organism.”
The paper continues by discussing the data management in Bioinformatics. The
authors note that bioinformatics deals with very large quantities of data and managing
this data is one of the key concerns which have started to become bottleneck in the
development of the discipline. The amount of data in bioinformatics, however, is not
12
the real problem because disciplines like particle physics generate even more data
than bioinformatics and are still able to manage it efficiently. The main issues in
managing bioinformatics data arise from certain characteristics of this data. For
instance:
1. It is complex to model bioinformatics data. There are numerous data types
with complex relationships among them
2. New data types keep emerging regularly and proper modelling and integration
of these new data types often requires changes in the whole semantics of the
system. This happens because bioinformatics is under rapid development and
the new data often redefines the previously known concepts
3. Analysis of known data generates even more data and this new data has to be
integrated back into the original data
4. Experimental raw data needs to be archived because the researchers and
scientists often want to consult it in order to confirm the results given by the
computer analysis
5. The granularity of bioinformatics data is finer than in many other fields.
Objects and entities are smaller in size and therefore a unit amount of data
often contains a larger number of objects
6. Data is accessed, queried, exchanged and updated frequently
7. Data is used by a variety of users (with varying computer skills) which include
biologists, programmers, database administrators and data analysts etc.
Based on these observations, the paper suggests that the data management
technology in bioinformatics must be scalable, flexible and expressive. It also points
out some technical issues related with bioinformatics data management. These issues
include the sustained and rapid growth in the amount of data, data is stored in a
number of different types of databases, data is redundant and data is often stored in
different flat-file formats which make indexing difficult.
13
The paper then identifies some pros and cons of XML in context of bioinformatics.
Some of the strengths of XML are listed below:
1. XML is very flexible. It is human readable and therefore can be easily edited
by people with little computer skills.
2. It is capable of linking data and is Internet oriented. This capability enables
XML to provide cross references among various data sources on the Internet.
3. XML allows defining the customized specifications. The ever changing
bioinformatics data lacks standardization and therefore XML can be used to
construct a customized specification for the data.
Alongside these advantages, XML has its own share of weaknesses as well. Some of
these weaknesses are listed below:
1. XML is a text based format and has overhead of data parsing5.
2. XML itself does not provide facilities like indexing and clustering of data to
improve the performance.
3. The expressiveness of XML is not sufficient for molecular biology. Unlike
object-oriented technologies, XML does not provide mechanism for
inheritance and does not have concept of relationships among data as such.
There are no elaborate data types and constraints available.
This paper provided a good understanding of challenges in managing bioinformatics
data and role of XML in this context. The paper, however, was written in the year
2000. Now it is 2007 and seven years are like an age both in computer science and
bioinformatics. Though the challenges posed by bioinformatics in the area of data
management still exist, the XML technologies have become much more powerful
than these were in the year 2000. As the authors of the paper also hoped, XML
schemas have been able to solve many of the weaknesses listed above. In addition,
5 In computer science, data parsing means the analysis of syntax of input against a pre-defined set of rules.
14
the commercial databases management systems are now providing built-in XML
capabilities and as a result the performance of data stored in XML format has
improved.
The paper continues further by comparing some approaches that are used for
bioinformatics data management. These approaches include flat-files, Abstract Syntax
Notation One (ASN.1), Common Object Request Broker Architecture (CORBA),
Java Remote Method Invocation (RMI) and Object Databases Management System
(ODBMS). Out of these, the ODBMS was of particular interest because the current
EMAGE gene expression database is based on this approach. The paper stated that
ODBMS provides a rich data model which fits well to the requirements of
bioinformatics field. In addition, ODBMS:
• Provide indexing, object clustering and query optimization
• Have a standardized data definition and query language
• Guarantee security, concurrent access, integrity, consistency and reliability
• Provide Application Programming Interfaces (APIs)
While stating that XML does not provide many of these ODBMS features, the paper
suggest that the use of a combination of ODBMS and XML technologies might be
best for the bioinformatics needs. In this combination, the ODBMS is used for data
management and XML is used for display and exchange. The data is queried from the
ODBMS and ODBMS returns the results in XML format to enable users to exchange
the results easily.
It is said in conclusion of the paper that XML was not completely mature at the time
when the paper was written. The paper concludes that because of its simplicity,
flexibility and interconnection capabilities XML is a very promising candidate for a
standard language of data exchange.
By studying this paper a fairly deep insight was gained into the challenges present in
the bioinformatics data management. It explains, in detail, these challenges and also
15
the promise of XML to overcome these challenges. The paper has also been useful
because it compares ODBMS and XML technologies and points out the strengths
and weaknesses of both of these.
2.4 The EMAP Project
As this dissertation is related with the Edinburgh Mouse Atlas Project (EMAP) of the
MRC HGU, the author looked for published literature about EMAP project that
could help in gaining the basic understanding of the overall project. The EMAP
project has set up a website (http://genex.hgu.mrc.ac.uk) to distribute the information
about its activities. In addition to the general information and publications, the
website provides the outcomes of the EMAP project that have been obtained so far.
Detailed information was found about the EMAP project from a paper titled “EMAP
and EMAGE: A Framework for Understanding Spatially Organized Data” produced
by Baldock et al (2003) [2]. The relevant parts of this paper are being briefly reviewed
below.
Baldock et al (2003), in their paper, state that “the EMAP project has implemented a spatio-
temporal framework for capturing spatially organized and mapped data”. This framework
consists of three main components:
1. A 3D grey-level (voxel) models of the mouse embryos
2. An anatomical ontology
3. A mapping between the spatial context of the digital model and the textual
context of the anatomy
The paper continues by outlining the details of these components. It describes
various tools (like Atlas browsing tools) and resources that have been developed
under the EMAP project. The information given in this paper about the Edinburgh
Mouse Atlas Gene Expression (EMAGE) database was of particular interest because
the primary aim of this dissertation is to convert the current EMAGE object database
16
into an XML based database. The EMAGE part of this paper would be reviewed in
the next section.
The Baldock et al paper is very useful in knowing the context and background of this
dissertation project. It gives details of the aims and objectives of the EMAP project
and also describes the progress that had been made by the time the papers was
written.
2.5 EMAGE: A Gene Expression Database
The EMAGE database is a major component of the EMAP project. It holds the gene
expression data and is to be converted into an XML database under this dissertation
project. Current version of EMAGE database is available at:
http://genex.hgu.mrc.ac.uk/Emage/database/emageIntro.html
Christiansen et al (2006) describe the EMAGE database in their paper [4]. The paper
states that this database contains the data related with the gene expression in mouse
embryo which is gathered using in situ techniques. This data is described by a
combination of associated text and space. The text description presents a list of
anatomical parts where a particular gene expresses itself. The space description is
used for showing the sites of expression in the images and 3D models of the mouse
embryo.
The paper explains the structure of EMAGE database and its content. The EMAGE
database is an integral part of the EMAP framework of Digital Atlas of Mouse
Development. The framework contains the 3D models of the post-implantation
Theiler Stages6 of mouse embryo. The domains (areas or spaces) within these 3D
models are mapped directly with anatomy ontology. As a result, corresponding
anatomical information can be fetched from the database upon selection of a domain
in a 3D model.
6 Theiler Stages describe and encompass gestation period of 18 days in mouse embryo. In total, there are 26 Theiler Stages.
17
The EMAGE database operates in a client-server environment. The server software
has been developed in C++ while the client software has been developed in Java. The
server software accesses the database and the client software communicates with the
server using Common Object Request Broker Architecture (CORBA).
The authors of the paper inform that in July 2005, there were 1905 records in the
EMAGE database. These records pertain to 704 genes and 22 Theiler Stages. Out of
these records, 10% were directly submitted by the individual laboratories; 46% were
submitted by screening consortia and 44% are those that have been previously
published in journals etc. The incorporation of previously published data is done in
cooperation with the GXD7, a gene expression database developed by the Jackson
Laboratory, USA. For this purpose, both these databases (EMAGE and GXD) have a
joint global copyright agreement with the Company of Biologists Ltd and Elsevier
B.V.
The paper also details the methods of interaction with the EMAGE database. The
EMAGE interface has been developed in Java which enables it to run on any
platform (like Windows, MacOSX, Solaris, UNIX and Linux) which has Java 1.4.2 or
higher installed on it. The EMAGE Java interface can be downloaded from the
EMAGE home page http://genex.hgu.mrc.ac.uk/Emage/database. The interface is
downloaded to user’s computer through JavaWebStart which ensures that on each
start the user gets the latest version of the interface software. This interface can be
used for querying and browsing the EMAGE database. It can also create local
database on researchers’ local computers where they can store the partial data while
completing the dataset. Once the dataset is complete, the researchers can submit this
data to the central EMAGE database. The individual laboratories and screening
consortia submit data directly to EMAGE. There is a dedicated Editorial Staff for
EMAGE database which curates the data that comes from various sources.
7 http://www.informatics.jax.org/mgihome/GXD/aboutGXD.shtml
18
Figure 2: Spatial queries formulated using EMAGE Java interface [4]
Christiansen et al (2006) paper is a particularly useful resource of information that
was found on the subject. It describes the complete setup of the EMAGE database in
a concise form. The paper provides a good understanding of how the EMAGE
database setup was working when the paper was written. This helped in
brainstorming about the solution for the conversion of EMAGE database into a
XML based database.
Before continuing, a quick review of a part of Baldock et all (2003) paper would be
appropriate in this context. The paper discusses the EMAGE database and provides
the same information as the Christiansen et al (2006) paper but in more details. A few
aspects of the overall working of the EMAGE system (e.g. how exactly the textual
and spatial data is mapped with the 3D models) become clearer by studying the
Baldock et al (2003) paper. This paper has provided an insight into the finer details of
the EMAGE functioning. Although these details are not directly related with this
dissertation project but it still helped in visualizing the present system as a whole.
19
3
3. Project Description
3.1 Problem Situation
The MRC HGU is intended to investigate the feasibility and approaches of
converting the existing EMAGE object database into an XML based database. The
need of this changeover is to address some of the concerns and problems which are
being encountered in the present setup of the database and the software client. Some
of these problems are listed below:
• The researchers who submit the data directly to EMAGE have to install the
software client.
• Though the client is based on Java technology and, therefore, is platform
independent (as long as appropriate version of Java is installed), yet the
researchers sometime face problems due to lack of required version of JVM
or Java capable browser.
• Often the results of the experiments come slowly and the researchers want to
store the partial results in their local computers as the results keep trickling in.
They send the data to the central EMAGE database when the dataset is
complete. With the present setup, it is difficult for the researchers to
conveniently keep the partial records of the data on their local machines.
By developing an XML database for gene expressions, complete with a web based
interface, the MRC HGU is aiming to solve of the above problems. In addition, XML
20
is rapidly becoming a global standard for structured data exchange among the
computing devices. Therefore, it is only rational to keep the data in the latest standard
format. This would make it easier to exchange the data with other organizations,
databases and other computer systems.
3.2 Project Aims and Objectives
To give structure to the project and to ensure that the project is completed within the
given timeframe, the scope of the project was clearly defined. Following aims and
objectives were determined for this dissertation project:
• To analyse the database structure of the existing EMAGE object database
• To develop an equivalent XML schema for the current EMAGE database
structure
• To design a new XML schema for the proposed XML database. This schema
needs to be MISFISHIE standard compliant and will be developed with the
help of MRC HGU staff
• To transform the existing EMAGE data into new XML documents as per the
new XML schema
• To develop an XML based database which would hold the XML documents
• To develop a web based user interface for querying the data present in the
XML database
3.3 Overview of Solution Development
After initial research and brainstorming, it was found that the following solution
could lead to the development of an XML based implementation of EMAGE, as
required:
21
1. Prepare a typical dataset of about 3000 XML documents drawn from the
existing data in EMAGE. This dataset would be used for the purpose of this
project
2. Develop an equivalent valid XML schema for the existing dataset. This
schema would be used for transformation purpose
3. Develop a MISFISHIE compliant valid XML schema
4. Develop a method to transform the original dataset as per the new
MISFISHIE compliant XML schema
5. Use IBM DB2 to create a database and put the transformed XML documents
into it
6. Use XQuery and SQL/XML to retrieve the required data from the DB2
database
7. Use a server-side scripting language (like JSP, PHP, ASP etc.) to present the
retrieved data to the user
In subsequent chapters, the details of the how these steps were implemented are
given.
22
4
4. Development of XML Schema
4.1 Existing Data Structure
The existing EMAGE database is an object oriented database which uses the
ObjectStore® Object Database Management System (ODBMS). The ODBMS
databases are designed to efficiently handle the data from applications which are
created using object-oriented programming languages (e.g. C++ and Java). The data
created by such applications is represented in the form of objects and these objects
can directly be stored in the ODBMS.
It was required to know the structure of the current EMAGE object database and to
have the existing data in order to proceed further in the project. For this purpose, the
structure of the current EMAGE database was exported in the form of CORBA
Interface Definition Language (IDL) by the MRC HGU staff. Also, approximately
3600 records were exported from the EMAGE database in the form of XML
documents. This dataset of about 3600 XML documents would be referred to as the
original dataset.
The IDL representation of a SpecimenDetails object is given below (See the
attached CD-ROM for complete EMAGE data structure in IDL form).
struct SpecimenDetails
{
string species;
23
string strain;
string sex;
boolean wildType;
MutantDetailsSeq mutants;
string stageFormat;
string stage;
string assayType;
string fixationMethod;
string embedding;
string clearingMethod;
string notes;
};
The elements corresponding to variables in an IDL struct were to be created in the
new MISFISHIE compliant XML schema. For example, the species string variable
became the element commonName under the parent element organismType in the
new XML schema.
<xsd:complexType name="organismType">
<xsd:sequence>
<xsd:choice>
<xsd:element name="commonName" type="nonEmptyToken"/>
<xsd:element name="taxon" type="taxonType"/>
</xsd:choice>
<xsd:element name="stage" type="stageSystemType"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="tissue" type="xsd:token" minOccu rs="0"/>
<xsd:element name="strain" type="nonEmptyToken"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
24
This correspondence between the variables of IDL and the elements of the new
schema was established through a mapping scheme. (See section 5.2.2)
While creating the new XML schema, conventions given below were followed so as
to make the schema easily understandable by present and future developers:
• All the element names in the schema were written using camel case8 with first
letter in small case (often refer to as lowerCamelCase)
• All the element type names ended in word “Type”. For example,
sequenceType
• nonEmptyToken and nonEmptyString types were used to indicate that
the element must have a value
• Extended and restricted simple types were kept anonymous unless these were
the utility types
• Complex types were always named and never left anonymous
• All element names were spelled in singular form
• minOccurs and maxOccurs were used to show the cardinality (the number
of occurrences that an element should have)
• Instead of ID the name accession was used for the unique identification
fields. Accession is used in most of the bioinformatics applications to give
data a unique identity.
• Union of the elements was used to indicate the preferred or common values
The resultant new XML schema was a fairly complex one and consisted of more than
850 lines of code.
8 In camel case, the words are written joined together with each word’s first letter as capital. E.g., AntibodyAssayType
25
4.2 XML Schema for Original Dataset
The partial existing dataset of EMAGE records in the form of XML document was
to be used for the purpose of this project. The dataset consisted of approximately
3600 XML documents. But these documents were not created on the basis of an
XML schema. These were generated by automatically exporting the EMAGE data in
the form of XML documents. This conversion was done according to CORBA
Interface Definition Language, which defined the structure of the EMAGE object
database.
However, it was necessary for the project to have an XML schema for the original
dataset. Without a schema, the dataset could not be validated and mapped with the
new XML schema. To get around this problem, the author initially began to manually
write the schema for the dataset. This schema was to be a simple schema because it
was being written for the data that already existed. No rules and constraints were
needed to be enforced in this schema. Simultaneously, the author looked for a tool
which could create a schema for the XML documents of the original dataset.
It was found that EditiX 5.1 is capable of generating a W3C9 standard compliant
XML schema from an existing XML document. After this tool was found, the
schema for the original dataset was generated using this tool. A sample document
from the dataset was provided to this tool as input and it generated a basic schema
for the document.
9 W3C stands for the World Wide Web Consortium. This consortium defines the standards for the World Wide We. For more information, see http://www.w3.org
26
Figure 3: Screenshot of the EditiX interface
Figure 4: Screenshot of DTD/Schema menu of EditiX
The automatically generated schema did not define any constraints for the data and it
was not necessary either. However, thus generated schema was not all correct. A few
modifications were done by the author and also by Dr. Yiya Yang of MRC HGU in
order to make the schema capture the biological information in a more correct way.
27
4.3 MISFISHIE Standard
The XML schema developed at MRC for the purpose of this project is MISFISHIE
compliant. MISFISHIE stands for “Minimum Information Standard For In Situ
Hybridization and Immunohistochemistry Experiments”. A Minimum Information
Standard (MIS) is an information reporting guideline that specifies the minimum
information required to achieve a particular aim. In case of MISFISHIE, this goal is
to enable the reproduction of the results of experiments related with the in situ
hybridization and immunohistochemistry. The MISFISHIE Standard Working Group
defines the MISFISHIE standard as below:
“MISFISHIE specification details the minimum information that should be provided when
publishing, making public, or exchanging results from visual interpretation-based tissue gene
expression localization experiments such as in situ hybridization, immunohistochemistry, and
reporter construct genetic experiments (GFP/green fluorescent protein, β-galactosidase), etc.” [12]
The structure of information in the current EMAGE database has not been built
according to an international standard. This makes it difficult to exchange
information among various organizations due to lack of compatibility in the
information structure. Therefore, it was decided to base the new XML schema on the
MISFISHIE standard.
The MISFISHIE specification, in its current form, describes the following aspects of
the in situ hybridization and immunohistochemistry experiments:
• Experimental Design
• Specimens
• Probe or Antibody Information
• Staining Protocols and Parameters
• Imaging Data and Parameters; and
• Image Characterization
28
4.4 Rationale for a New XML Schema
Creation of an XML schema is one of the very first steps in the process of designing
an XML database. The schema defines the structure of the data that is going to reside
in the XML documents. The XML documents should conform to the underlying
XML schema to enable the database and any other application using it to function
smoothly.
The constraints in the existing EMAGE object database were not strictly defined.
While the development of the equivalent XML database was to begin from the
scratch; it was prudent to invest some time for coming up with a good functional
schema. The schema designing staff at MRC HGU developed the schema in
consultation with the EMAGE editors and other biologists –so that a tightly defined
schema could be developed. Such a schema would enable the database to capture all
the biological information in a meaningful and consistent fashion.
XML schema acts like a template which the associated documents have to comply
with. It is the successor of Document Type Definition (DTD) and is more elaborate
and powerful than DTD. The XML schema allows defining the rules and constrains
which the data must follow. It ensures the consistency in the data. Following are
some of the constraint types that can be applied using the XML Schema Definition
(XSD): [9]
• XSD can define which elements are allowed to appear in an XML document
• The correctness of the data in the XML document can be validated
• Restrictions on data values can be defined. For example, it can be defined that
an element must either have “DNA” or “RNA” as its data. No third value
would be allowed
• Patterns of data can be defined. For example, data in an element must start
with character “t” and end with character “e”
29
• Data can be converted between data types
The XML schema is extensible; which means a schema can be used in another schema.
Extensibility also means that the developers can define their own user defined data
types using the standard data types available as part of the XML Schema Definition.
Another good feature of XML schema is that it is written in XML itself. XML
schema supports the namespaces and a variety of standard data types.
4.5 Validation of New XML Schema
The XML schema which was provided to the author by MRC HGU was not a valid
schema and contained syntactical and other errors. To begin with the project, having
a valid new XML schema was mandatory because this schema was the foundation of
the whole project.
There are a number of tools available to work with XML schema. These tools help
user to create and validate the schemas. The author downloaded evaluation versions
of various such tools and tested them for their features and user-friendliness. For
validation of the XML schema, Oxygen XML Editor 8.2 was used. It’s a very user-
friendly tool but takes relatively much more memory than some other tools.
Figure 5: Screen shot of Oxygen XML Editor
30
Use of Oxygen XML Editor made the task of validation easier as it provides features
like several visual aids and automatic completion of end tags. However, still the
process of correcting schema took quite some time. It was found, in the end, that
most of the errors reported by the editor were related with the incorrect definition
and use of namespace in the schema. It is important to correctly define namespaces
while developing the schema. Also, from author’s experience, it is recommended to
check the namespace definition first while debugging. The wrong definition or usage
of namespace gives rise to a large number of errors in the schema and schema editing
tools are unable to identify the wrong definition of namespace as the cause of these
errors.
���� Concept of Namespace in XML
Namespace in XML is a mechanism of avoiding conflicts among the names of
elements. For instance, consider the following two XML documents:
<table>
<tr><td>5 prime</td></tr>
</table>
and
<table>
<helix>3 prime</helix>
</table>
If these two documents are merged together, there would be a conflict
between the two <table> elements. Such conflicts can be avoided by using
namespace as below:
<htm:table xmlns=”http://www.w3.org/namespace1”>
<htm:tr><htm:td>5 prime</htm:td></htm:tr>
</htm:table>
and
31
<x:table xmlns=”http://www.w3.org/namespace2”>
<x:helix>3 prime</x:helix>
</x:table>
Now, both the <table> elements are identified by two different namespace
prefixes “htm” and “x” and there would not be any conflict if two documents
are merged together. [6] [9]
32
5
5. Transformation of Original Dataset After successfully validating the new XML schema, the transformation of the original
dataset according to the new XML schema was the next step. The inputs to this step
were the XML documents in the dataset and the validated new XML schema. As an
output, the XML documents were to be transformed into new structure documents.
The data in these documents would remain same but the names and positions of the
elements would change as per the new schema.
5.1 Investigated Approaches
5.1.1 JDOM based Application
JDOM is a Document Object Model (DOM) designed for the Java platform. It is
freely available in the form of a set of API which could be used to develop
applications. The API can be downloaded from http://www.jdom.org website.
JDOM is Java-centric, Java-optimized and combines DOM with Simple API for
XML (SAX). For parsing of XML documents, JDOM uses external parsers and it is
possible to specify a particular parser to be used for the purpose. The XPath and
XSLT support is integrated in JDOM. [13]
This approach was investigated by start building a Java application which would use
JDOM. The fundamental logic behind the application was as below:
33
• Read the source XML document element by element
• Find a new name for the element from an XML document containing
mapping information
• Write the element into a new XML document using data from the source
element and element name from the mapping information.
A basic application was developed which worked as per this logic. However, the
XML documents in the original dataset and the expected transformation was far
more complex than could be handled by this basic application. It was also noted that
building an application which could handle the transformation task at hand would
take much more time than was available. As a result, this approach was found to be
unsuitable for this project.
5.1.2 XML Shredding using DB2 9 Database
IBM DB2 9 database management system10 allows already existing XML documents
to be “shredded” into the columns of a relational table. It was thought that this
facility could be used for carrying out transformation. The logic behind this approach
was as below:
• Shred the XML documents into a relational database table. One XML
document would fill one row (record) in the table
• Develop a Java application which would connect to the relational database
table
• Read a record from the relational table and write the values in the record in a
new XML document according to the new element names and positions
This approach was also found unsuitable because it needed the XML schema for the
original dataset to be annotated. The shredding functions of DB2 database shred an
10 See section 6.1 for more details
34
XML document as per the annotations given in the document. It was estimated that
the time required for the correct annotation of the schema might exceed the time at
hand. Therefore, this approach was also found to be unsuitable for this project.
5.1.3 XSL Transformation
XSL is a language that is used for formatting the XML data in HTML web page. XSL
Transformation (XSLT) is a mechanism of transforming the XML document into
other formats like HTML, XHTML or XML. This approach was found to be suitable
for the project because it takes least amount of time among other approaches that
were considered.
5.2 XSL Transformation
XSL stands for Extensible Stylesheet Language. XSL Transformation (XSLT) is a
language that is used for transforming the XML data into various other formats
including XML. The XSLT itself is XML based. During transformation the source
XML document remains unchanged and the target document is created using only the
content of the source document.
5.2.1 Rationale of Using XSL Transformation
After investigating different approaches described in the previous section, it was
found that using XSL transformation was the most suitable method to perform the
task at hand. The decision of selecting XSL transformation was based on many
advantages that this approach provides. These advantages are explained below.
The foremost reason of using XSL transformation was that it was the quickest way to
carry out transformation of the XML documents in the original dataset. The
dissertation project was to be completed in a limited amount of time and meeting
deadline was of utmost importance. Therefore, an approach that would take least
time in performing the desired work would be the most suitable one. There are tools
35
available which can help user in creating XSLT code based on the mapping between
two XML schemas. This code is then used for transformation of the original XML
source documents. The two XML schemas that were used in this project were
significantly large and complex and there existed a large number of mappings
between them. Consequently, the resultant XSLT code was to be quite large too.
Therefore, it would have been time consuming task to manually produce the correct
XSLT code. But availability of mapping tools saved the efforts of developing code
that was repetitive in nature and did not require complex logic application.
XSL transformation is a standard method of transforming XML documents. This is
the reason why tools have been built in order to make the XSL transformations
easier. There is a well established W3C standard for XSLT which has provisions like
defining variables, constructs like if and when and loop constructs like for-each. All these
facilities make XSLT a full-fledged programming language for writing transformation
code.
Another advantage of using XSLT is that it is possible to modify the XSLT code to
generate different kinds of target formats from the source XML. This dissertation
project required producing XML-to-XML transformation but, if needed, the same
XSLT code could be reused to write code for transforming the original XML data
into other formats like HTML, XHTML and PDF etc.
5.2.2 Mapping Scheme
In order to produce XSLT code, it was required to know which elements/attributes
of the schema for the original source documents would map to which
element/attributes of the new XML schema. For this purpose a mapping scheme was
provided by the staff at the MRC HGU. The scheme provided numbers to the IDL
variables in the IDL structure of the existing EMAGE database. Then the
corresponding elements in the new XML schema were marked with these numbers.
This mapping scheme was provided to the author for the purpose of carrying out the
transformation work.
36
A small portion of this mapping scheme is given below to demonstrate how source
schema was mapped with the target schema.
// details for any publication
struct PublicationDetails
{
string authors; ............. 56
string journal; ............. 57
string title; ............. 58
string volume; ............. 59
string issue; ............. 60
unsigned short year; ............. 61
string pages; ............. 62
string accessionNo; ............. 63
};
Snippet: A portion of the IDL structure showing mapping numbers in bold
<xsd:complexType name="publicationType">
<xsd:sequence>
<xsd:element name="author" type="nonEmptyToken"/>
... 56
<xsd:element name="journal" type="xsd:token"
minOccurs="0"/>... 57
<xsd:element name="title" type="nonEmptyToken"/>
... 58
<xsd:element name="volume" type="xsd:token" minOccu rs="0"/>
... 59
<xsd:element name="issue" type="xsd:token" minOccur s="0"/>
... 60
<xsd:element name="year" type="xsd:token"/>
... 61
<xsd:element name="page" type="xsd:token" minOccurs ="0"/>
... 62
37
<xsd:group ref="accession" minOccurs="0"/>
... 63
</xsd:sequence>
</xsd:complexType>
Snippet: A portion of the new XML schema showing mapping numbers in bold
As this portion of the mapping scheme shows, the element number 56 of the IDL
structure would map to the author element under the publicationType element
of the new XML schema.
5.2.3 Mapping Tools
Two tools were tested that are available for performing mapping between XML
schemas. These were Altova® MapForce 2007 Enterprise Ed. and Stylus Studio® 2007
XML Enterprise Suite. Both provide the same functionality and both provide
graphical interfaces for mapping. Stylus Studio is an integrated tool for working with
different XML technologies while MapForce is designed specifically for mapping
purpose. The author downloaded the evaluation versions of these two tools from the
websites of their respective makers and began to explore the facilities provided by
these tools. In the end, the author chose to use Altova® MapForce for the
development of the XSLT code. The reasons behind this decision were:
• Stylus Studio consumed a lot of computer resources and took much time in
starting on author’s machine.
• The author was using evaluation versions of both the software under
academic licenses. The makers of the Stylus Studio did not extend the
evaluation period after 30 days and the software could not be used further.
The makers of MapForce, however, extended the academic license.
• In general, in terms of author’s experience, MapForce was found to be more
user-friendly.
38
MapForce software accepts a source and a target schema from the user. Then it
presents the elements and attributes of both the schemas on the screen. User can drag
an element/attribute of the source schema and drop it on corresponding
element/attribute of the target schema to make a connection between them. The
connection indicates the mapping between the two elements/attributes.
This tool also provides graphical drag-and-drop interface for introducing the:
• String Functions including concatenate string, find sub-string, find string
length etc.
• Functions for testing if a node exists or does not exist
• Math functions like add, divide, multiply, subtract, ceiling, floor etc.
• Logical functions including equal-to, less-than, greater-than etc.
• Conversion functions for converting data to string, number and boolean
values
• XSLT specific functions like current, document, element-available, generate-
id etc.
• Constructs like if, when and filter nodes
39
Figure 6: Screenshot of a portion of Mapping generated using MapForce
Mapping tools generate XSLT code in the background as user proceeds with
mapping. MapForce can generate code in many languages like XSLT 1.0, XSLT 2.0,
XQuery, Java, C# and C++.
Another very useful feature of MapForce is that it can accept a sample XML
document based on the source schema. Afterwards, as the user proceeds with
mapping between source and target schema, the software can instantly show how the
sample XML document will get transformed on the basis of the defined mapping.
5.2.4 Automated Code Generation
Although the software like MapForce are very helpful in generating the XSLT code
on the basis of mapping, it was noted that the software could not always be used for
generating the code as desired. In such cases, after finishing the mapping work,
manual editing needs to be done in the automatically generated code to make it fit for
the purpose. Some of the important information was being missed by the
40
automatically generated XSLT code while it transformed the original dataset. To
resolve this some changes in the code were done manually. Later, some of the
manipulations were also done by the MRC staff in consultation with biologists in
order to capture the required biological meaningfulness in the XML data.
Also the code generated by MapFroce was unnecessarily complicated. The XSLT
code was created all in one template. That was one of the reasons that the code
became complicated. The author believes that it would be a good idea to develop a
mapping software that could XSLT generate code on the basis of template approach.
The software should allow the user to create templates and then it should allow
mapping on the basis of these templates instead of doing mapping on an element-to-
element basis. Although element-to-element mapping is easier to do and seems
intuitive; but the resultant code of such mapping is far more complicated then it
should be. Towards the end of the dissertation, the MRC HGU staff simplified the
XSLT code by breaking it into templates. This would make the code easier to handle
by MRC HGU staff for further developments after this dissertation project is over.
5.2.5 Transformation Process
Once the required XSLT code was ready and validated, the next step was to use this
code and carry out transformation of approximately 3600 XML documents in the
original dataset. It was not a good idea to do transformation of one document at a
time as it would have taken too much time. Therefore, the author looked for a tool
that could carry out batch transformation.
Altova® XMLSpy 2007 Professional Ed. is a tool that can do the work of carrying out
batch XSL transformation. In order to do this, the user needs to create a new project
in XMLSpy and add the location of the directory of the source XML documents in
the “XML Files” folder of the project. In the properties of the project, user can
specify the XSL file which would be used to do XSL transformation. In addition,
location of an XML schema can also be specified which would be used to validate the
transformed XML documents.
41
6
6. Database Preparation and Querying Having transformed the original XML documents derived from the EMAGE object
database, it was time to prepare a new database which could be used to store the
transformed documents. IBM DB2 9 was chosen as the platform for the database
creation. It was not a difficult choice to make for the reason that MRC HGU had
already acquired and setup the DB2 9 Server in their premises. Therefore, it was good
to use the same platform for this dissertation so that the results of this project could
easily be migrated to the MRC HGU Server. However, this was not the only reason
why DB2 9 was a good choice. This latest version of DB2 database management
system is a very good tool for the XML data management.
6.1 Why IBM DB2 9
The DB2 9 database management system from IBM comes with the pureXMLTM
technology which has revolutionized the storage and management of the XML data.
pureXMLTM technology is designed to overcome various XML storage and retrieval
problems which have been there since long.
Traditionally, XML data management has had involved one or more of the following
approaches: [14]
• Store XML documents in the file system
42
• Stuff the XML data into relational databases using large objects (LOBs)
• Shred the XML data among different columns in relational tables
These are the obvious approaches but they often fail to perform well. The file system
approach is easy but, if the number of documents grows larger, file system is not
scalable like a database. Searching through large number of documents in the file
system proves to be a very slow process. Also features like concurrency, security,
recoverability and usability are also not available to the data stored in the file system.
By stuffing XML data into VARCHAR data type or large objects (like CLOB or
BLOB) in relational databases –some of the concerns related with storage in the file
system could be overcome. But the issue of low performance still remains because
LOBs are good only if the whole XML document is to be retrieved from the
database. However, searching for the portions of XML (like elements, attributes or
sub-trees) would still be a tedious task as all the documents needs to be scanned at
run-time in order to perform search. [14]
Shredding is the process of decomposition of the XML documents. The decomposed
portions of an XML document are stored in the columns of the relational tables. To
achieve this, the XML schema is annotated with the mapping information which is
then used by the shredding facility to store the portions of XML into the appropriate
columns of the relational tables. The normalization rules of the relational design and
the complexity of the XML document may cause the document to span over a large
number of columns. Also to retrieve data or to reconstruct the XML document
would require writing of complex queries. Sometimes reconstruction may even
become impossible. [14]
In order to know about the capabilities and features of DB2, a paper titled
“DB2/XML: Designing for Evolution” produced by Beyer, Özcan, Saiprasad and
Linden (2005) was studied [1]. The paper states that “DB2 provides native XML storage,
indexing, navigation and query processing through both SQL/XML and XQuery”. DB2 has
capability of storing XML data in the relational tables. This way, the XML type data is
physically stored in DB2 preserves all the information in the XQuery data model,
43
which means DB2 supports the XML fidelity. It can also shred the XML data into
relational form, thus supporting the relation fidelity. DB2 supports textual fidelity as well
by allowing to store XML data into CLOB columns. Also, an XML column in a DB2
table does not require being associated with an XML schema for the purpose of
validation. Validation can be done during insertion of data or at the time of query.
The paper notes that the retrieval of data from large XML documents is slower
because XML itself does not provide any indexing capability. In DBMS products
supporting XML, it is possible to create indexes on the entire XML documents but
the speed of data retrieval still remains quite slow. DB2 overcomes this major
problem by indexing on the XPath expressions instead of indexing the whole
document. This makes the queries execute significantly faster. The paper claims that
the XML support in DB2 has been designed while keeping evolution in mind. As a
result, the design decisions were taken to facilitate the enhancements of the XML
schemas as and when required. The paper then goes on explaining these features of
DB2 by the means of a case study. This paper provided the author with a good
understanding of DB2 features with respect to XML. Although the paper is not
written for the beginners in the field of DB2/XML it still proves useful in gaining
insight about the issues related with XML data management and how DB2 claims to
resolve these issues. Also DB2 9 with pureXMLTM technology has been released after
the publication of this paper. Some of the concerns raised in this paper regarding
XML data management have been addressed to in a more efficient manner by the
pureXMLTM technology.
To acquire more practical understanding of DB2; a courseware from IBM
Corporation [7] on DB2 9 was studied. This courseware is detailed and suitable for
those who are new to DB2 or XML databases. Not only it provides information on
DB2 functioning but also it provides good content on topics like XML concepts,
XPath and XQuery etc. The pureXMLTM technology of DB2 9 comes as a new
approach of storing XML data natively in databases. This technology would be
discussed in the next section.
44
6.1.1 pureXMLTM Technology
The pureXMLTM technology in IBM DB2 9 provides XML data type to natively store
XML data. It also provides efficient data management techniques to store the
hierarchical structures. Hierarchical structures quite commonly present in the XML
data. The pureXMLTM technology used in DB2 9: [7]
• Provides seamless consolidation of diverse data sources
• Provides XQuery/SQL interface, which enables faster and easier
development than the previously available methods
• Eliminates need for proprietary software to shred the XML data, which
means XML searches have become faster
• Provides flexible XML schema, which makes changes in the schema much
quicker
• Assists in conversion of data available as .DOC, .XLS and .PPT formats into
XML format
• Has XML support in all the APIs
6.2 Insertion of XML into DB2 Database
In order to query the XML documents through a web interface, it was important to
put all these documents in a relational database. This way, the web application would
be able to connect to the database server via the APIs like JDBC/ODBC and access
the XML data stored in the database.
For this purpose, a DB2 9 database was created and in the database a simple table
was created with the following structure:
45
Figure 7: Structure of the relational table created to hold the XML documents
The structure of the database table is simple because all the data actually resides in the
XML documents. The table is used merely to put the XML documents in a relational
database. Therefore, one column, XMLDOC, was created to hold the XML
documents and another one, RECID, to hold the unique identification of an XML
document. Once the database was ready, a simple Java application was written which
could be used to insert records in the relational table.
At this point of time, a problem occurred which took significant time to be solved.
When Java application attempted to connect to the DB2 9 database, following error
was shown at the runtime:
java.lang.ClassNotFoundException: COM.ibm.db2.jdbc. app.
DB2DriverClassNotFoundException:
COM.ibm.db2.jdbc.app.DB2Driver
The ClassNotFoundException is thrown by Java Runtime Environment when it is
not able to find the definition of a class which is being called by the code. The DB2 9
database server comes with the driver which could be used to connect to a database
via JDBC. The driver class is found in a package called
COM.ibm.db2.jdbc.app.DB2Driver . This package is stored in a zip file called
db2java.zip and could be found in the installation directory of DB2 9 server. On
the Windows XP machine used by the author, the location of this file was:
C:\Program Files\IBM\SQLLIB\java\db2java.zip
If this error occurs, in terms of author’s experience, the first thing that should be
checked is whether the location of db2java.zip file is included in the
CLASSPATH environment variable. If it is not included then the CLASSPATH
variable should be edited to include the location of the zip file. The Java Runtime
46
uses CLASSPATH variable in order to find the classes that are being used in a Java
program.
The problem, however, could not be solved even after setting the CLASSPATH
variable right. The author was using Eclipse platform for Java development. It was
found that for Eclipse, it is a must to add any referenced files in the reference library
of the Eclipse project. To add db2java.zip file to the reference library, the author
followed these steps:
Go to Project Properties of the project > Select Java Build Path > Add External Jar Files.
Zip files can also be added using the “Add external Jar files” option. Once the
db2java.zip file was added to the reference library the Java Runtime was able to
find the driver required to connect to the DB2 9 database. Apparently, when the Java
programs are run using the console provided along with the Eclipse platform, the
referred files have to be added to the reference library. The CLASSPATH variable
must be set right if the application is run from outside the Eclipse environment.
It would useful to mention here that the XML documents were inserted in the XML
column of the table as a binary stream. The Java program used the FileInputStream
method to open the XML file as a stream and then this input stream was passed to
the XML column. Following snippet shows the code.
try{
PreparedStatement ps=null;
String strSQL="INSERT INTO LK.EMAGEXML Values(?,?)" ;
ps=con.prepareStatement(strSQL);
ps.setString(1, fileNameForID);
File xmlFile=new File(fileLoc);
ps.setBinaryStream(2, new
FileInputStream(xmlFile),(int)xmlFile.length());
ps.execute();
47
}catch(Exception e){e.printStackTrace();}
6.3 Querying of XML Data
6.3.1 XQuery Language
XQuery is the language that is used for querying the XML data. What SQL does for
Relational Database Management System (RDBMS), XQuery does the same for XML
data. XQuery uses XPath and FLWOR expression to locate the data in an XML
document.
XPath is the syntax for navigation in an XML document. It uses the path expressions,
similar to the directory structure in a computer file system, to select one or more
nodes in the document. For example, in the following XML snippet:
<specimen>
<organism>
<commonName>mouse</commonName>
<stage>
<name>dpc</name>
<value>9.5</value>
</stage>
<strain>-</strain>
</organism>
<type>whole mount</type>
<sex>unknown</sex>
<genotype>
<wildType>true</wildType>
</genotype>
</specimen>
48
To select all the <wildType> nodes, the XPath would be:
/specimen/genotype/wildType
The initial front slash (/) represents the root element in the XML document. The
XPath syntax has more than a hundred built-in functions which makes the node
selection and other tasks easier. For example, to get the text present within the
<value> node, the text() function could be used:
/specimen/organism/stage/value/text()
Similarly, to select the last <specimen> element within the root element, the
XPath would be:
/specimen[last()]
The XPath syntax comes very handy while working with XQuery. XQuery uses
XPath expressions to select the required nodes in the XML document. The FLWOR
(For, Let, Where, Order by and Return) expressions of XQuery makes it easy to
manipulate the nodes and data selected by the XPath expressions. A simple example
is given below:
for $x in doc("EMAGE_100.xml")/specimen
let $cName:=$x/organism/commonName
where $x/organism/stage/value/text()>=”9”
return $cName
This XQuery expression will first open the “EMAGE_100.xml” document and then
it would select all the <specimen> elements within the document. Then XQuery
will extract the values of commonName element in a variable $cName. After that it
would return only those variable $cName in whose case the dpc stage value would be
higher than 9. Same process will be repeated for all the <specimen> elements found
in the XML file.
49
6.3.2 SQL/XML
SQL/XML is an extension of the SQL standard. It provides several functions which
could be used to construct the XML data in SQL queries. With introduction of XML
data type in several commercial databases, the need of retrieving and manipulating the
XML data became a major concern [17]. The SQL/XML extension answers this
concern. The extension provides functions to be used in combination with the SQL
queries as well as XQueries. For instance:
SELECT RECID,
XMLQUERY('$t/hguMrcSubmission/inSituAssay/entityBei ngDetecte
d/symbol/text()' PASSING XMLDOC AS "t") AS xmltxt F ROM
LK.EMAGEXML
The above SQL query retrieves data from an XML column just like column of any
other data type. While RECID is a VARCHAR type of column, XMLDOC column’s
data type is XML. While SQL SELECT statement can retrieve data from RECID, the
data from XMLDOC column is retrieved using XMLQUERY function of
SQL/XML extension. Following query shows how SQL SELECT statement and
XQuery FLWOR expression could be used together:
SELECT XMLSERIALIZE(XMLQUERY('for \$doc in
\$t/hguMrcSubmission let
\$status:=\$doc/annotation/expressionByOntology/exp ression/s
trength/text()
where \$status=\"detected\"
return \$doc/inSituAssay/entityBeingDetected/symbol /text()'
PASSING XMLDOC AS \"t\" RETURNING SEQUENCE) AS CLOB (32K))
FROM LK.EMAGEXML
50
6.4 User Interface Development
Although IBM DB2 9 is a good candidate for development of an XML database, in
terms of author’s experience, working with DB2 has been a challenging task. There
were many issues related with DB2 which made the progress of the project slow. It
took significant time to resolve these issues and sometimes the issue could not be
resolved and an alternative approach had to be opted in order to successfully finish
the project with in the available time.
6.4.1 Java Server Pages (JSP)
For the development of the user interface, initially, a web application based on Java
Server Pages (JSP) was tried out. The setup involved Tomcat web server, JDBC and
DB2. Tomcat (http://tomcat.apache.org) is a freely available application server which
implements the JSP and Servlet specifications. Tomcat is an oft-used web server for
JSP/Servlet applications. After installing Tomcat, necessary configuration was done.
All the instructions to configure Tomcat come along with the installation package.
The web server was successfully installed and it was serving the simple JSP pages
without any problem. However, when the code to connect to the DB2 database was
included in a JSP page, the web server could not serve the page. The author searched
through the Internet to find ways to solve this DB2/Tomcat connection problem –
but the problem could not be solved. In terms of author’s experience, connecting
with database management systems like SQL Server or Oracle through Tomcat is not
a problem. It’s a rather straightforward procedure. But with DB2 it was not as
straightforward.
The first problem which appeared while connecting with DB2 database through
Tomcat was in the form of following exception:
java.lang.ClassNotFoundException:
com.ibm.db2.jdbc.app.DB2Driver
It was learnt that the DB2 driver comes in following two versions:
51
Application driver: com.ibm.db2.jdbc.app.DB2Driver
and Network driver: com.ibm.db2.jdbc.net.DB2Driver
The network driver is used for making connection with DB2 through Java applets.
The application driver is used for the local applications. As, in case of this
dissertation, the JSP application needed to locally connect with DB2, therefore the
application driver was used. The above mentioned ClassNotFoundException
was resolved by including the db2java.zip file in the Eclipse environment (See section
6.2 for more details). This, however, did not help much because after this the
following exception appeared:
SQLException: [IBM][JDBC Driver] CLI0627E The resul t set is
not scrollable
This exception arises when the JSP code tries to navigate through the resultset11
obtained from DB2 database. Navigation through the resultset is very important and
the intended web application could not be developed without this facility. The only
possible solution for this error that could be found in the IBM DB2 manual and on
Internet was that the resultset should be SCROLL SENSITIVE. This could be done
as below:
Statemnt stmtObj =
con.createStatement(ResultSet.TYPE_SCROLL_INSENSITI VE,
ResultSet.CONCUR_UPDATABLE);
The problem of resultset scrolling, however, was not solved even after using the
above solution. Further research on how to solve this problem was not yielding any
success and, as a result, the JSP implementation of the user interface was taking too
much time. At this point of time, in order to complete the project within time, the
author decided to investigate an alternative approach to develop the user interface by
using PHP instead of JSP.
11 A resultset is an object which contains the records matching with the criteria specified in the query.
52
6.4.2 Hypertext Preprocessor (PHP)
The decision to use PHP, after much time consuming research to solve the
Tomcat/JDBC/DB2 related problems, was made because PHP is one of the most
widely used server-side scripting languages. It was hoped that PHP would have
readily available solution for DB2 connectivity.
PHP is a language used for developing dynamic web pages. To make use of it, a PHP
enabled web server is required to be installed. The author made use of the Apache
web server which is freely available from Apache Software Foundation’s website
(http://httpd.apache.org/download.cgi). In addition, PHP needs to be downloaded from
http://www.php.net and installed. In order to connect to the DB2 database, a PECL
extension also needs to be installed on the computer where PHP has been installed.
PECL stands for PHP Extension Community Library. It is a mechanism of
distributing PHP extensions. The PECL required for working with DB2 is ibm_db2
and it could be downloaded from http://pecl.php.net/package/ibm_db2. The
configuration information of PHP installation on a computer resides in the php.ini
file. The PECL should be installed in the directory where PHP has been installed (in
case of PHP 5 installed on a Windows machine, the ibm_db2 PECL consist of one
DLL file which needs to be copied to the PHP directory). Following changes needs
to be done in the php.ini configuration file for making PHP work with DB2:
• Add the ibm_db2 extension so that it is loaded whenever PHP is loaded in
the web server. (Syntax is extension=php_ibm_db2.dll )
• include_path variable should be correctly set to the locations where
required include files are located. The location of ibm_db2 PECL must be in
this path
Besides this, it is also recommended to set the following variables as shown. This
helps in tracking errors in the PHP pages.
• display_errors = On
53
• error_reporting = E_ALL
• log_errors = On
When PHP gets installed, it automatically changes the Apache web server
configuration file so that whenever Apache starts, it loads PHP module as well. But if
there is any problem, it is recommended to check if Apache is loading PHP module
or not. This could be ascertained by checking if the following line is present in the
Apache configuration file (httpd.ini)
LoadModule php5_module "C:\\Program
Files\\PHP\\php5apache2_2.dll"
This line shows that PHP5 module located in the given location should be loaded
when Apache starts.
6.5 User Interface for a Few Queries
The development of full-fledged query interface was not part of this dissertation.
Therefore, a smaller set of queries was suggested by MRC HGU which could be
implemented and a user interface could be developed for demonstration purpose.
These queries were provided in the form of plain English questions. Appropriate
XQueries were to be prepared which could be executed to get the answer data from
the XML database. The user interface for inputting the query criteria and for
outputting the returned data was also prepared using PHP. Details of these queries
and user interface are given below.
Query 1: What genes are detected (or not detected) in an anatomical structure
between a specified range of Theiler Stages?
This query has three inputs: status of detection of gene expression, anatomical
structure and range of Theiler Stages. Anatomical structure was to be entered as a
reference. The EMAGE database stores anatomical reference to EMAP Mouse Atlas.
The user enters reference and the name of the anatomical structure, images and other
54
details are fetched from the Mouse Atlas and shown in the result. However, this
connectivity between EMAGE XML database and EMAP Mouse Atlas was not to be
developed as part of this dissertation (this would have required using web services).
Consequently, the query will take reference EMAP ID as input from the user and will
return a record if it has a matching EMAP ID associated with it. To receive the inputs
from the user, the following interface was developed:
Figure 8: Input screen for query about gene expression detection
When user enters the query input through this interface and clicks “Search” button;
the result of the query is displayed as below:
Figure 9: Output of the gene expression detection query
The XQuery code which brings the results for this query from the XML database is
given below:
55
SELECT XMLSERIALIZE(XMLQUERY('for \$doc in \$t/hguMrcSubmission let \$status:=\$doc/annotation/textAnnotation/expressio nByOntology/expression/strength/text() let \$anatomyref:=\$doc/annotation/textAnnotation/expre ssionByOntology/accession/text() let \$TStage:=\$doc/annotation/referenceStage/value /text() let \$id:=\$doc/@accession let \$DPCStage:=\$doc/specimen/organism/stage/value /text() let \$inSituAssayPresence:=if(fn:exists(\$doc/inSituAss ay/firstLabel/text())) then(\"ISH\") else(\"\") let \$antibodyAssayPresence:=if(fn:exists(\$doc/antibod yAssay/firstLabel/text())) then(\"IHC\") else(\"\") let \$reporterAssayPresence:=if(fn:exists(\$doc/reporte rAssay/firstLabel/text())) then(\"ISR\") else(\"\") let \$inSituSymbol:=if(fn:exists(\$doc/inSituAssay/enti tyBeingDetected/symbol/text())) then(\$doc/inSituAssay/entityBeingDetected/symbol/t ext()) else(\"\") let \$antibodySymbol:=if(fn:exists(\$doc/antibodyAssay/ entityBeingDetected/symbol/text())) then(\$doc/antibodyAssay/entityBeingDetected/symbol /text()) else(\"\") let \$reporterSymbol:=if(fn:exists(\$doc/reporterAssay/ entityBeingDetected/symbol/text())) then(\$doc/reporterAssay/entityBeingDetected/symbol /text()) else(\"\") let \$specimenType:=\$doc/specimen/type/text() let \$inSituProbeID:=if(fn:exists(\$doc/inSituAssay/det ectionReagent/name/text())) then(\$doc/inSituAssay/detectionReagent/name/text() ) else() let \$antibodyProbeID:=if(fn:exists(\$doc/antibodyAssay /detectionReagent/name/text())) then(\$doc/antibodyAssay/detectionReagent/name/text ()) else() let \$genotype:=if(\$doc/specimen/genotype/wildType /text()) then(\"wild-type\") else(\"mutant\") where \$status=\"$detection\" and"; if($anatomaicalname!="")
56
{$query=$query."\$anatomyref=\"$anatomaicalname\" a nd";} $query=$query."\$TStage>=\"$TSfrom\" and \$TStage<=\"$TSto\" return <tr> <td class=\"tab_item\">{\$inSituSymbol}{\$antibodySymbo l}{\$reporterSymbol}</td> <td class=\"tab_item\">{data(\$id)}</td> <td class=\"tab_item\">{\$inSituProbeID}{\$antibodyProb eID}</td> <td class=\"tab_item\">TS{\$TStage}</td> <td class=\"tab_item\">{\$DPCStage}dpc</td> <td class=\"tab_item\">{\$inSituAssayPresence}{\$antibo dyAssayPresence}{\$reporterAssayPresence}</td> <td class=\"tab_item\">{\$specimenType}</td> <td class=\"tab_item\">{\$genotype}</td> </tr>' PASSING XMLDOC AS \"t\" RETURNING SEQUENCE) AS CLOB (32K)) FROM LK.EMAGEXML";
Query 2: How many fully (or partially) sequenced assay records are available
for different types of assays?
This query has two inputs. One is the status of sequence (fully or partially sequenced)
and the type of assay (ISH, IHC or ISR)12. To receive these inputs from the user,
following interface was developed:
12 ISH is used for in situ assay; IHC for antibody assay and ISR for reporter assay
57
Figure 10: Input screen for query that counts the fully or partially sequenced assays
After selecting the sequence status and assay type, when user clicks the “Search”
button, the XQuery informs the count of the matching records:
Figure 11: Output of the query that counts the fully or partially sequenced assays
XQuery code that generates the above output is given below:
SELECT XMLSERIALIZE(XMLQUERY('for \$doc in \$t/hguMrcSubmission let \$assayType:=\$doc/experiment/assayType/text() let \$id:=\$doc/@accession let \$inSituSeqSts:=\$doc/inSituAssay/entityBeingDetect ed/sequence/sequenceField/@sequenceStatusType let \$antibodySeqSts:=\$doc/antibodyAssay/entityBeingDe tected/se
58
quence/sequenceField/@sequenceStatusType let \$reporterSeqSts:=\$doc/reporterAssay/entityBeingDe tected/sequence/sequenceField/@sequenceStatusType where \$assayType=\"$assayType\" and (if (\$assayType eq \"ish\") then \$inSituSeqSts eq \"$seqStatus\" else if(\$assayType eq \"ihc\") then \$antibodySeqSts eq \"$seqStatus\" else false()) return <tr><td>{data(\$id)}</td></tr>' PASSING XMLDOC AS \"t\" RETURNING SEQUENCE) AS CLOB (32K)) FROM LK.EMAGEXML
Query 3: Which components express a particular gene?
A gene symbol is the only input for this query. Following interface accepts this input:
Figure 12: Input screen for the query that finds components where a gene is expressed
And the following interface displays the results:
59
Figure 13: Output of the query that finds components where a gene is expressed
The XQuery code for this query is given below:
SELECT XMLSERIALIZE(XMLQUERY('for \$doc in \$t/hguMrcSubmission let \$status:=\$doc/annotation/textAnnotation/expressio nByOntology/expression/strength/text() let \$anatomyref:=\$doc/annotation/textAnnotation/expre ssionByOntology/accession let \$inSituSymbol:=if(fn:exists(\$doc/inSituAssay/enti tyBeingDetected/symbol/text())) then(\$doc/inSituAssay/entityBeingDetected/symbol/t ext()) else(\"\") let \$antibodySymbol:=if(fn:exists(\$doc/antibodyAssay/ entityBeingDetected/symbol/text())) then(\$doc/antibodyAssay/entityBeingDetected/symbol /text()) else(\"\") let \$reporterSymbol:=if(fn:exists(\$doc/reporterAssay/ entityBeingDetected/symbol/text())) then(\$doc/reporterAssay/entityBeingDetected/symbol /text()) else(\"\")
60
where \$status=\"detected\" and (\$inSituSymbol=\"$geneSymbol\" or \$antibodySymbol=\"$geneSymbol\" or \$reporterSymbol=\"$geneSymbol\") return <tr><td>{data(\$anatomyref)}</td></tr>' PASSING XMLDOC AS \"t\" RETURNING SEQUENCE) AS CLOB (32K)) FROM LK.EMAGEXML
61
7
7. Conclusion
7.1 Summary of the Work Done
The project developed an XML version of the EMAGE gene expression database.
Presently EMAGE is an object based database and is part of the larger EMAP project
of the Human Genetics Unit of MRC. In the development process, the project
investigated various approaches, technologies and tools that could be used to perform
the similar tasks. The data present in the current EMAGE database was transformed
into the new XML format. An XML schema was developed in order to validate the
correctness of the newly created XML documents. As part of the project, web based
query interface was also developed that could be used by the potential users of the
XML database to retrieve the desired data.
7.2 Summary of Evaluation
The purpose of this project was to convert the existing EMAGE into an XML
database and to explore and evaluate the tools, technologies and approaches involved
in the process. This evaluation has been discussed at relevant places in this thesis
document. Most of the work involved in this project would not produce a visible
outcome. Therefore, a user evaluation was neither possible nor was it required. The
query interface is the only visible outcome. However, the interface was supposed to
be developed more for the demonstration purpose. It is not the full-fledged interface.
62
The user interface development and its evaluation have not been part of this project
and have, therefore, been listed in the Future Work section. The performance
evaluation of the developed XML database is also a future activity.
7.3 Accomplishments
Following contributions have resulted from this project:
• The transformation code for the existing EMAGE data into XML documents
and an XML schema for validation of the transformed documents are in
place. This was accomplished while working in cooperation with the MRC
staff, particularly, Dr. Yiya Yang.
• Various tools that can help working with different XML technologies have
been explored. The advantages and disadvantages of these tools have been
noted in this document. It would help MRC select the appropriate tools for
further development of the XML database.
• An XML database was setup using IBM DB2 9. This database holds the XML
documents.
• Retrieval of desired data from XML database using XQuery has been
demonstrated.
• A web based query interface for the potential users has been developed. Users
can interact with the XML database through this interface and get answers to
their queries.
• Difficulties encountered and solutions used throughout this project have been
documented in this dissertation thesis. This should be of help to the
developers of the project in future.
63
7.4 Limitations Encountered
The main limitation encountered in the project was that a lot of time had to be spent
in doing research about the “side things”. Though equally important but these things
do not directly contribute to the measurable output of the project. These side things
included, for example, setting up the web server and DB2 connection, and validating
the XML schema. This left lesser time for doing tangible work. Another considerable
limitation was that the author was not involved in the XML schema design from the
beginning. As a result, sometimes it was difficult to understand the meaning, logic or
need of a particular element in the schema.
7.5 Skills Acquired
From this project, the author acquired considerable skills in a number of
technologies. These include development of XML schema, doing XSL
transformations, Using XPath and XQuery for XML data retrieval, working with IBM
DB2 9, Eclipse Java platform, JSP and PHP. In addition, the author became a good
user of various XML development tools, like:
• Oxygen XML Editor 8.2 [ SyncRO Soft Ltd, www.oxygenxml.com ]
• EditiX 5.2.2 [ JAPIsoft, www.editix.com ]
• XMLWriter 2.7 [ Wattle Software, www.xmlwriter.net ]
• Altova® XMLSpy 2007 Professional Ed. [ Altova®, www.Altova.com/xmlspy ]
• Altova® MapForce 2007 Enterprise Ed. [ Altova®, www.Altova.com/mapforce ]
• Stylus Studio® 2007 XML Enterprise Suite [ DataDirect Technologies,
www.stylusstudio.com ]
64
Above all, the author gained a good understanding (or the Weltanschung13) of how
XML databases could be used to store the data which has traditionally been stored in
relational form and how this XML data could be used to build websites. And,
naturally, the project also gave insight into the content and working of the gene
expression databases.
7.6 Future Work
The XML database and a user interface catering to a few queries have been
successfully developed as part of this project. The dissertation has been successful in
achieving its aim of researching about various approaches that could be helpful in
developing an XML database. Beyond this project, however, more work needs to be
done before the XML database could replace the object based EMAGE databse.
Future work that could be carried out to build on the work done in this dissertation is
given below.
7.6.1 Comprehensive Query Interface Development
The XML database contains gene expression data which needs to thoroughly analyze
by the users in order to draw conclusions. A comprehensive and well designed query
interface would be very useful in this regard. The potential users of the database are
mostly researchers who may have varying computer skills; therefore the query
interface should be easy to use. Application of website usability principles while
developing the interface is highly recommended.
7.6.2 Interface for Inserting New Data
This dissertation project only needed to deal with the existing data in the EMAGE
database. A future requirement would be an interface through which researchers from
around the world could add more data to the XML version of EMAGE. The
13 Weltanschung is a German term used in Philosophy. It means the “mental construct” or the “world view”
65
interface should allow the users to save the partial data either on the EMAGE server
or in the local machine of the researcher. This is important because data contained in
one EMAGE record often comes slowly as the related laboratory experiments
progress. An XML enabled web interface would be a good choice in this case. The
user will save the partial data through a web form and then the web application could
save the data in an XML file on server and keeps the data in file format until user or
database administrator submits it to the EMAGE database.
7.6.3 Performance Evaluation
This would be a very interesting thing to do. The XML version of EMAGE database
has been created but it should be investigated whether it can match or exceed the
performance of the present object based database. Even though EMAGE does not
contain a high number of records at present, the equivalent XML database might take
more time than the object database in retrieving the search results. In case of XML
database, the query engine has to go through all the XML documents available in the
database. This may, or may not, affect the performance significantly. A carefully
carried out performance evaluation could reveal the answer.
7.6.4 Query Optimization
On the basis of the performance evaluation results, it might be found useful to
optimize the queries written in the database application to handle the questions from
the user. The objective of this exercise should be to minimize the time of getting
answer after submitting the query to the database.
7.7 Final Thoughts
XML has turned out to be a very important concept. With continuous development
of the related technologies, standards and tools, the scope of XML applications is
increasing by the day. The initiative taken by various relational database venders to
provide native XML support through their products is surely going to make the XML
66
a “first class citizen” among other data types in the databases. This will help XML in
realizing its full potential as it could be used more easily in relational database
applications as well. Bioinformatics, like almost all other fields in need of data
management, has a number of applications where XML can make the things easier
and better. Gene expression databases are among these applications.
67
References
[ 1 ] Beyer, K., Özcan, F., Saiprasad, S., Linden, B., (2005). DB2/XML: Designing for Evolution, Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data SIGMOD '05, ACM Press New York, USA
[ 2 ] Baldock, R.., A., et al, (2003). EMAP and EMAGE: A Framework for Understanding Spatially Organized Data, Neuroinformatics, Vol. 1, 2003, pp 309-325
[ 3 ] D’haeseleer, R.., Liang, S., Somogyi, R.., (1999). Gene Expression Data Analysis and Modeling, Pacific Symposium on Biocomputing
[ 4 ] Christiansen, J., et al, (2006). EMAGE: a spatial database of gene expression patterns during mouse embryo development, Nucleic Acids Research, Database issue, Vol. 34, 2006, pp D637–D641
[ 5 ] Achard, F., Vaysseix, G., Barillot, E., (2001). "XML, Bioinformatics and Data Integration", Bioinformatics Review, Oxford University Press, Vol. 17, No. 2, 2001, pp 115-125
[ 6 ] Harold, E.R.., Means, W.S., (2004). "XML in a Nutshell", 3rd Ed., O'Reilly Media
[ 7 ] DB2 9 Bootcamp, Student Notes, (2006). IBM Corporation
[ 8 ] Edinburgh Mouse Atlas Project, < http://genex.hgu.mrc.ac.uk > [Accessed on: 27th May 2007]
[ 9 ] W3 Schools, < www.w3schools.com > [Accessed on: 25th May 2007]
[ 10 ] Wikipedia, < http://en.wikipedia.org/wiki/Gene_expression > [Accessed on: 25th May 2007]
[ 11 ] Xiong, J., (2006). Essential Bioinformatics. Cambridge University Press.
[ 12 ] MISFISHIE Standard Working Group. <http://mged.sourceforge.net/misfishie>. [Accessed on: 09th August 2007]
[ 13 ] JDOM.org. < http://www.jdom.org>. [Accessed on: 02nd July 2007]
[ 14 ] DB2 9 pureXML Guide, Redbooks, IBM Corp. 1st Ed., January 2007
68
[ 15 ] Hishiki, T., Kawamoto S., Morishita S., Okubo K., (2000). BodyMap: a human and mouse gene expression database, Nucleic Acids Res., Vol. 28, No. 2, January 2000, pp 136–138
[ 16 ] Duerr, J., Immunohistochemistry, WormBooks.org, pp 1-6
[ 17 ] Funderburk, J. E., Malaika, S., Reinwald, B., (2002). XML programming with SQL/XML and XQuery, IBM Systems Journal, Vol. 41, No. 4, 2002, pp 642-665
[ 18 ] The National Institute of Health. Stem Cell Information. <http://stemcells.nih.gov/info/scireport/appendixA.asp>. [Accessed on: 02nd September 2007]
69
Appendix A Given below is the complete new MISFISHIE compliant XML schema
<?xml version="1.0" encoding="UTF-8"?> <!-- edited with XMLSpy v2007 rel. 3 (http://www.al tova.com) by Lalit Kumar (Heriot-Watt University) --> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSc hema"> <xsd:annotation> <xsd:documentation>MISFISHIE compliant schema for ISH/IHC data transfer</xsd:documentation> </xsd:annotation> <!-- =================================================== ===================== README: . purpose of this xml schema is to capture th e data requirement analysis and to promote data sharing/integration amo ng projects. . design guiding rule is to minimise name len gth, varieties of constructs and depth of nesting/hierarchy without losi ng essense. . all names use lower camel case. . all type names end with "Type". . mainly Venetian Blind stype. . nonEmptyToken & nonEmptyString to indicate the elements which must have value. . extended/restricted simple types are annoyo mous unless they are utility type . complex types are always named. . all element names are singular while minOcc urs/maxOccurs are used for cardinality. . accession (rather than ID) is used for any public identity . union is used to indicate preferred/common values =================================================== ===================== = --> <!-- =================================================== ===================== root element
70
=================================================== ===================== = --> <xsd:element name="hguMrcSubmission" type="hguMrcSubmissionType"/> <xsd:complexType name="hguMrcSubmissionType"> <xsd:sequence> <xsd:element name="administration" type="administrationType"/> <xsd:element name="specimen" type="specimenType"/> <xsd:element name="experiment" type="experimentType" minOccurs="0"/> <xsd:element ref="assayRef"/> <xsd:element name="result" type="resultType"/> <xsd:element name="annotation" type="annotationType" minOccurs="0" maxOccurs="unbo unded"/> <xsd:element name="contributor" type="contributorType"/> <xsd:element name="reference" type="referenceType" minOccurs="0"/> </xsd:sequence> <xsd:attribute name="accession" type="xsd:token" use="required"/> <xsd:attribute name="status" type="xsd:token" use="optional"/> </xsd:complexType> <!-- =================================================== ===================== specimen element =================================================== ===================== = --> <xsd:complexType name="specimenType"> <xsd:sequence> <xsd:element name="organism" type="organismType"/> <xsd:element name="type"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="wholemount"/> <xsd:enumeration value="section"/> <xsd:enumeration value="section from wholemount"/> <xsd:enumeration value="whole cells"/> <xsd:enumeration value="sections of cells"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="sex" type="xsd:token" minOccurs= "0"/> <xsd:element name="genotype" type="genotypeType"/>
71
<xsd:element name="phenotype" type="phenotypeType" minOccurs="0"/> <xsd:element name="physiologicalState" type="xsd:token" minOccurs="0"/> <xsd:element name="supplier" type="supplierType" minOccurs="0"/> <xsd:element name="note" type="xsd:string" minOccurs="0"/> <xsd:element name="tissueExamined" type="tissueType" minOccurs="0" maxOccurs="unbounde d"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="organismType"> <xsd:sequence> <xsd:choice> <xsd:element name="commonName" type="nonEmptyToken"/> <xsd:element name="taxon" type="taxonType"/> </xsd:choice> <xsd:element name="stage" type="stageSystemType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="tissue" type="xsd:token" minOccurs="0"/> <xsd:element name="strain" type="nonEmptyToken" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="taxonType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:group ref="accession" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="genotypeType"> <xsd:choice> <xsd:element name="wildType"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="T"/> <xsd:enumeration value="t"/> <xsd:enumeration value="Y"/> <xsd:enumeration value="y"/> <xsd:enumeration value="TRUE"/> <xsd:enumeration value="true"/> <xsd:enumeration value="True"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="mutantAllele" type="mutantAlleleType" maxOccurs="unbounded"/> </xsd:choice> </xsd:complexType> <xsd:complexType name="mutantAlleleType">
72
<xsd:choice> <xsd:sequence> <xsd:group ref="triplet"/> <xsd:element name="alleleOnFirstChromatid" type="nonEmptyToken"/> <xsd:choice minOccurs="0"> <xsd:element name="alleleOnSecondChromatid" type="nonEmptyToken" /> <xsd:element name="nonPairedOrMissingChromosome" type="nonEmptyT oken"/> </xsd:choice> </xsd:sequence> <xsd:element name="localAllele" type="localAlleleType"/> </xsd:choice> </xsd:complexType> <xsd:complexType name="localAlleleType"> <xsd:sequence> <xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:element name="description" type="nonEmptyToken"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="phenotypeType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:group ref="accession" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="tissueType"> <xsd:choice> <xsd:element name="name" type="nonEmptyToken"/> <xsd:group ref="accession"/> </xsd:choice> </xsd:complexType> <!-- =================================================== ===================== experiment element =================================================== ===================== = --> <xsd:complexType name="experimentType"> <xsd:sequence> <xsd:element name="description" type="nonEmptyString"/> <xsd:element name="design" type="xsd:token" minOccurs="0"/> <xsd:element name="experimentalFactor" type="xsd:token" minOccurs="0"/> <xsd:element name="assayType" type="xsd:token" minOccurs="0"/> <xsd:element name="numberOfAssaysPerformed" type="xsd:integer"/>
73
<xsd:element name="controlData" type="xsd:token" minOccurs="0"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <!-- =================================================== ===================== assay element =================================================== ===================== = --> <xsd:element name="assayRef" type="assayType"/> <xsd:element name="inSituAssay" type="inSituAssayT ype" substitutionGroup="assayRef"/> <xsd:element name="antibodyAssay" type="antibodyAs sayType" substitutionGroup="assayRef"/> <xsd:element name="reporterAssay" type="reporterAs sayType" substitutionGroup="assayRef"/> <xsd:complexType name="assayType"> <xsd:sequence> <xsd:element name="firstLabel" type="nonEmptyToken"/> <xsd:element name="lastLabel" type="xsd:token" minOccurs="0"/> <xsd:element name="exogenousFactor" type="xsd:token" minOccurs="0"/> <xsd:element name="fixationReagent" type="xsd:token" minOccurs="0"/> <xsd:element name="embeddingReagent" type="xsd:token" minOccurs="0"/> <xsd:element name="clearingMethod" type="xsd:token" minOccurs="0"/> <xsd:element name="detectionProcedure" type="detectionProcedureType"/> <xsd:element name="protocol" type="protocolType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="protocolType"> <xsd:sequence> <xsd:choice> <xsd:element name="value" type="nonEmptyString"/> <xsd:element name="linkedProtocol" type="nonEmptyString"/> </xsd:choice> <xsd:element name="type" minOccurs="0"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="general"/>
74
<xsd:enumeration value="specimen pre treatment"/> <xsd:enumeration value="reagent production"/> <xsd:enumeration value="detection reagent binding"/> <xsd:enumeration value="staining"/> <xsd:enumeration value="embedding"/> <xsd:enumeration value="imaging"/> </xsd:restriction> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="reagentTypeType"> <xsd:sequence maxOccurs="unbounded"> <xsd:element name="value" type="xsd:token"/> <xsd:element name="order"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="secondary"/> <xsd:enumeration value="tertiary"/> <xsd:enumeration value="quaternaery"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="detectionProcedureType"> <xsd:sequence> <xsd:element name="signalDetectionMethod" type="nonEmptyToken"/> <xsd:element name="type" minOccurs="0"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="direct"/> <xsd:enumeration value="indirect"/> </xsd:restriction> </xsd:simpleType>
75
</xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="detectionReagentType"> <xsd:sequence> <xsd:choice> <xsd:element name="name" type="nonEmptyToken"/> <xsd:element name="accession" type="nonEmptyToken"/> </xsd:choice> <xsd:element name="concentration" type="xsd:token" minOccurs="0"/> <xsd:element name="reagentType" type="reagentTypeType" minOccurs="0"/> <xsd:choice minOccurs="0"> <xsd:element name="supplier" type="supplierType"/> <xsd:element name="localGenerated" type="xsd:token"/> </xsd:choice> <xsd:element name="permanentLabel" type="nonEmptyToken" minOccurs="0" maxOccurs="unbou nded"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <!-- =================================================== ===================== reporter assay element =================================================== ===================== = --> <xsd:complexType name="reporterAssayType"> <xsd:complexContent> <xsd:extension base="assayType"> <xsd:sequence> <xsd:element name="detectionReagent" type="detectionReagentType" minOccurs="0" maxOccurs ="unbounded"/> <xsd:element name="entityBeingDetected" type="entityBeingDetectedByReporterType" minOccurs= "0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="entityBeingDetectedByReport erType"> <xsd:sequence> <xsd:group ref="triplet"/> <xsd:element name="note" type="nonEmptyString"/> </xsd:sequence> </xsd:complexType>
76
<!-- =================================================== ===================== inSitu/antibody sharing elements =================================================== ===================== = --> <xsd:complexType name="sequenceType"> <xsd:choice> <xsd:element name="sequenceField" type="sequenceFieldType"/> <xsd:element name="description" type="nonEmptyString"/> </xsd:choice> </xsd:complexType> <xsd:complexType name="sequenceFieldType"> <xsd:sequence maxOccurs="unbounded"> <xsd:choice minOccurs="0"> <xsd:element name="sequenceInFile" type="fileType"/> <xsd:element name="sequenceDirect" type="xsd:token"/> </xsd:choice> <xsd:group ref="accession"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> <xsd:choice minOccurs="0"> <xsd:sequence> <xsd:element name="startLocation" type="xsd:integer"/> <xsd:element name="endLocation" type="xsd:integer"/> </xsd:sequence> <xsd:sequence> <xsd:element name="startLocationOfFragment" type="xsd:integer"/> <xsd:element name="fragmentSize" type="xsd:integer"/> </xsd:sequence> <xsd:sequence> <xsd:element name="endLocation" type="xsd:integer"/> <xsd:element name="fragmentSize" type="xsd:integer"/> </xsd:sequence> <xsd:element name="fivePrimePrimer" type="xsd:token"/> <xsd:element name="threePrimePrimer" type="xsd:token"/> </xsd:choice> </xsd:sequence> <xsd:attribute name="sequenceStatusType"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType>
77
<xsd:restriction base="xsd:token"> <xsd:enumeration value="fully-sequenced"/> <xsd:enumeration value="partially-sequenced"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:attribute> </xsd:complexType> <xsd:complexType name="variantType"> <xsd:sequence> <xsd:element name="name" type="xsd:token"/> <xsd:element name="activity" type="xsd:token"/> <xsd:element name="activityType" minOccurs="0"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="inactive form"/> <xsd:enumeration value="activated form"/> <xsd:enumeration value="both"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="originType"> <xsd:choice> <xsd:element name="cellLine" type="cellLineType"/> <xsd:element name="organism" type="organismType"/> </xsd:choice> </xsd:complexType> <!-- =================================================== ===================== antibody assay element =================================================== ===================== --> <xsd:complexType name="antibodyAssayType"> <xsd:complexContent> <xsd:extension base="assayType"> <xsd:sequence>
78
<xsd:element name="detectionReagent" type="detectionReagentType" minOccurs="0" maxOccurs ="unbounded"/> <xsd:element name="primaryReagentStorage" type="xsd:token" minOc curs="0"/> <xsd:element name="entityBeingDetected" type="entityBeingDetectedByAntibodyType" minOccurs= "0" maxOccurs="unbounded"/> <xsd:element name="thingToGenerateDetectionReagent" type="thingToGenerateAntibodyType" minOccurs="0"/> <xsd:element name="type" type="antibodyType" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="entityBeingDetectedByAntibo dyType"> <xsd:sequence> <xsd:group ref="triplet"/> <xsd:element name="sequence" type="sequenceType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="proteinVariant" type="proteinVariantType" minOccurs="0"/> <xsd:element name="anatomicalStructure" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/> <xsd:element name="speciesSpecificity" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/> <xsd:element name="molecularGroup" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/> <xsd:element name="cdMarker" type="xsd:token" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="proteinVariantType"> <xsd:complexContent> <xsd:extension base="variantType"> <xsd:sequence> <xsd:element name="proteinIsoform" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="thingToGenerateAntibodyType "> <xsd:sequence> <xsd:element name="antigen" type="nonEmptyToken"/> <xsd:element name="supplier" type="supplierType" minOccurs="0"/> <xsd:element name="sequence" type="sequenceType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="originOfAntigen" type="originType" minOccurs="0"/> <xsd:element name="proteinDomainCovered" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/>
79
<xsd:element name="postTranslationalModification" type="xsd:toke n" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="carrierOrFusion" type="xsd:token" minOccurs="0"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="antibodyType"> <xsd:sequence> <xsd:choice> <xsd:element name="monoclonal" type="monoclonalType"/> <xsd:element name="polyclonal" type="polyclonalType"/> </xsd:choice> <xsd:element name="chainSubType" type="xsd:token" minOccurs="0"/> <xsd:element name="productionMethod" minOccurs="0"/> <xsd:element name="purificationMethod" type="xsd:token" minOccurs="0"/> <xsd:element name="immunoGlobulinIsoType" minOccurs="0"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="I"/> <xsd:enumeration value="i"/> <xsd:enumeration value="G"/> <xsd:enumeration value="gm"/> <xsd:enumeration value="GM"/> <xsd:enumeration value="Gm"/> </xsd:restriction> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="monoclonalType"> <xsd:sequence> <xsd:element name="hybridoma" type="xsd:token" minOccurs="0"/> <xsd:element name="phageDisplay" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="polyclonalType"> <xsd:sequence> <xsd:element name="speciesImmunized" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="cellLineType"> <xsd:sequence> <xsd:element name="name" type="xsd:token"/> <xsd:group ref="accession"/> </xsd:sequence>
80
</xsd:complexType> <!-- =================================================== ===================== inSitu assay element =================================================== ===================== = --> <xsd:complexType name="inSituAssayType"> <xsd:complexContent> <xsd:extension base="assayType"> <xsd:sequence> <xsd:element name="detectionReagent" type="probeDetectionReagentType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="primaryReagentStorage" type="xsd:token" minOc curs="0"/> <xsd:element name="entityBeingDetected" type="entityBeingDetecte dByProbeType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="thingToGenerateDetectionReagent" type="thingToGenerateProbeType" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="probeDetectionReagentType"> <xsd:complexContent> <xsd:extension base="detectionReagentType"> <xsd:sequence> <xsd:element name="chemistry"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="RNA"/> <xsd:enumeration value="DNA"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="direction"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token">
81
<xsd:enumeration value="sense"/> <xsd:enumeration value="antisense"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="entityBeingDetectedByProbeT ype"> <xsd:sequence> <xsd:group ref="triplet"/> <xsd:element name="sequence" type="sequenceType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="spliceVariant" type="spliceVariantType" minOccurs="0"/> <xsd:element name="anatomicalStructure" type="xsd:token" minOccurs="0" maxOccurs="unbounded "/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="spliceVariantType"> <xsd:complexContent> <xsd:extension base="variantType"> <xsd:sequence> <xsd:element name="transcriptSplice" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="thingToGenerateProbeType"> <xsd:sequence> <xsd:element name="cloneName" type="nonEmptyToken"/> <xsd:element name="supplier" type="supplierType" minOccurs="0"/> <xsd:element name="sequence" type="sequenceType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="templateDNAType" type="templateDNATypeType" minOccurs="0"/> <xsd:element name="originOfTemplate" type="originType" minOccurs="0"/> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="templateDNATypeType"> <xsd:sequence> <xsd:choice> <xsd:element name="genomic" type="xsd:token" maxOccurs="3"/>
82
<xsd:element name="cdna" type="xsd:token" maxOccurs="3"/> </xsd:choice> </xsd:sequence> </xsd:complexType> <!-- =================================================== ===================== result element =================================================== ===================== = --> <xsd:complexType name="resultType"> <xsd:sequence maxOccurs="unbounded"> <xsd:element name="element" type="supplementaryFileType"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="supplementaryFileType"> <xsd:sequence maxOccurs="unbounded"> <xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:element name="file" type="fileType"/> <xsd:element name="resolution" type="xsd:token" minOccurs="0"/> <xsd:element name="mode" type="xsd:token" minOccurs="0"/> <xsd:element name="magnification" type="xsd:token" minOccurs="0"/> <xsd:element name="photographicPlatform" type="xsd:token" minOccurs="0"/> <xsd:element name="description"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="default image for single section"/> <xsd:enumeration value="default image for default section in multi-sections"/> <xsd:enumeration value="any other image of default section in multi-sections"/> <xsd:enumeration value="any other assay image"/> <xsd:enumeration value="multi-section montage image"/> <xsd:enumeration value="movie of 3D voxel"/>
83
<xsd:enumeration value="best frame of the movie of 3D voxel"/> <xsd:enumeration value="OPT default image"/> <xsd:enumeration value="OPT wlz"/> <xsd:enumeration value="OPT movie"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="height" type="xsd:integer" minOccurs="0"/> <xsd:element name="width" type="xsd:integer" minOccurs="0"/> <xsd:element name="position" type="sectionType" minOccurs="0"/> <xsd:choice minOccurs="0"> <xsd:element name="nonOverlayChannel" type="channelType"/> <xsd:element name="multipleOverlayChannel" type="channelType" minOccurs="2" maxOccurs="unbound ed"/> </xsd:choice> <xsd:element name="note" type="xsd:string" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="channelType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:element name="falseColour" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <!-- =================================================== ===================== annotation element =================================================== ===================== = --> <xsd:complexType name="annotationType"> <xsd:sequence> <xsd:element name="referenceStage" type="stageSystemType" minOccurs="0"/> <xsd:element name="annotator"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token">
84
<xsd:enumeration value="EMAGE editor"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="confidenceOfAnnotator" type="confidenceType" minOccurs="0" maxOccurs="unbo unded"/> <xsd:element name="textAnnotation" type="textAnnotationType" minOccurs="0"/> <xsd:element name="imageAnnotation" type="imageAnnotationType" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="confidenceType"> <xsd:sequence> <xsd:element name="aspect"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="morphological match to model"/> <xsd:enumeration value="pattern clarity and extraction"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="level"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="low"/> <xsd:enumeration value="medium"/> <xsd:enumeration value="high"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:complexType name="textAnnotationType"> <xsd:sequence>
85
<xsd:element ref="textAnnotationRef" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="imageAnnotationType"> <xsd:sequence> <xsd:element ref="imageAnnotationRef" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:element name="textAnnotationRef" type="expressionAnnotationType"/> <xsd:element name="expressionByOntology" type="expressionByOntologyType" substitutionGroup="textAnnotationRef"/> <xsd:element name="imageAnnotationRef" type="expressionAnnotationType"/> <xsd:element name="expressionByWholemount" type="expressionByImageType" substitutionGroup="imageAnnotationRef"/> <xsd:element name="expressionByVoxel" type="expressionByImageType" substitutionGroup="imageAnnotationRef"/> <xsd:complexType name="expressionAnnotationType"> <xsd:sequence> <xsd:element name="expression" type="expressionType"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="expressionByOntologyType"> <xsd:complexContent> <xsd:extension base="expressionAnnotationType"> <xsd:sequence> <xsd:group ref="accession"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="expressionByImageType"> <xsd:complexContent> <xsd:extension base="expressionAnnotationType"> <xsd:sequence> <xsd:element name="correspondingResultName" type="xsd:token" min Occurs="0"/> <xsd:element name="file" type="fileType"/> <xsd:element name="section" type="sectionType" minOccurs="0" maxOccurs="unbound ed"/> <xsd:element name="referenceModel" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="sectionType"> <xsd:sequence>
86
<xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:element name="x" type="xsd:decimal"/> <xsd:element name="y" type="xsd:decimal"/> <xsd:element name="z" type="xsd:decimal"/> <xsd:element name="theta" type="xsd:decimal"/> <xsd:element name="phi" type="xsd:decimal"/> <xsd:element name="distance" type="xsd:decimal"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="expressionType"> <xsd:sequence> <xsd:element name="strength"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="not detected"/> <xsd:enumeration value="detected"/> <xsd:enumeration value="present"/> <xsd:enumeration value="not examined"/> <xsd:enumeration value="uncertain"/> <xsd:enumeration value="possible"/> <xsd:enumeration value="strong"/> <xsd:enumeration value="moderate"/> <xsd:enumeration value="weak"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="pattern" minOccurs="0"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="graded"/> <xsd:enumeration value="homogenous"/> <xsd:enumeration value="single cell"/>
87
<xsd:enumeration value="spotted"/> <xsd:enumeration value="regional"/> <xsd:enumeration value="n/a"/> <xsd:enumeration value="not applicable"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="location" minOccurs="0" maxOccurs="unbounded"> <xsd:simpleType> <xsd:union memberTypes="anyMeaningfulToken"> <xsd:simpleType> <xsd:restriction base="xsd:token"> <xsd:enumeration value="dorsal"/> <xsd:enumeration value="ventral"/> <xsd:enumeration value="anterior"/> <xsd:enumeration value="posterior"/> <xsd:enumeration value="caudal"/> <xsd:enumeration value="deep"/> <xsd:enumeration value="lateral"/> <xsd:enumeration value="medial"/> <xsd:enumeration value="proximal"/> <xsd:enumeration value="radial"/> <xsd:enumeration value="surface"/> <xsd:enumeration value="n/a"/> </xsd:restriction> </xsd:simpleType> </xsd:union> </xsd:simpleType> </xsd:element> <xsd:element name="note" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <!-- =================================================== =====================
88
administration element =================================================== ===================== = --> <xsd:complexType name="administrationType"> <xsd:sequence> <xsd:element name="softwareType" type="xsd:token" minOccurs="0"/> <xsd:element name="softwareVersion" type="xsd:token" minOccurs="0"/> <xsd:element name="creationDate" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="createdBy" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="lastModificationDate" type="xsd:string" minOccurs="0"/> <xsd:element name="modifiedBy" type="xsd:token" minOccurs="0"/> </xsd:sequence> <xsd:attribute name="schemaVersion" type="xsd:tok en" fixed="1.0"/> </xsd:complexType> <!-- =================================================== ===================== contributor element =================================================== ===================== = --> <xsd:complexType name="contributorType"> <xsd:sequence> <xsd:element name="author" type="xsd:string" minOccurs="0"/> <xsd:element name="contactPerson" type="personType"/> <xsd:sequence minOccurs="0" maxOccurs="unbounded"> <xsd:element ref="roleRef"/> </xsd:sequence> </xsd:sequence> </xsd:complexType> <xsd:group name="contact"> <xsd:sequence> <xsd:element name="tel" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="email" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="fax" type="xsd:token" minOccurs="0"/> <xsd:element name="url" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:group> <xsd:element name="roleRef" type="personType"/>
89
<xsd:element name="submitter" type="personType" substitutionGroup="roleRef"/> <xsd:element name="principalInvestigator" type="pe rsonType" substitutionGroup="roleRef"/> <xsd:element name="acknowledgement" type="acknowledgementType" substitutionGroup="roleR ef"/> <xsd:complexType name="personType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:group ref="contact"/> <xsd:element name="organization" type="organizationType"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="organizationType"> <xsd:sequence> <xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:group ref="address"/> <xsd:group ref="contact"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="acknowledgementType"> <xsd:complexContent> <xsd:extension base="personType"> <xsd:sequence> <xsd:element name="description" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:group name="address"> <xsd:sequence> <xsd:element name="addressOne" type="xsd:string"/> <xsd:element name="addressTwo" type="xsd:string" minOccurs="0"/> <xsd:element name="addressThree" type="xsd:string" minOccurs="0"/> <xsd:element name="city" type="xsd:string" minOccurs="0"/> <xsd:element name="county" type="xsd:string" minOccurs="0"/> <xsd:element name="postcode" type="xsd:string" minOccurs="0"/> <xsd:element name="country" type="xsd:string" minOccurs="0"/> </xsd:sequence> </xsd:group> <!-- =================================================== ===================== reference element =================================================== =====================
90
= --> <xsd:complexType name="referenceType"> <xsd:sequence> <xsd:element name="history" type="linkType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="relation" type="linkType" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="publication" type="publicationType" minOccurs="0" maxOccurs="unb ounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="linkType"> <xsd:sequence> <xsd:element name="type" type="xsd:token" minOccurs="0" maxOccurs="unbounded"/> <xsd:group ref="accession"/> <xsd:element name="url" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="publicationType"> <xsd:sequence> <xsd:element name="author" type="nonEmptyToken"/> <xsd:element name="journal" type="xsd:token" minOccurs="0"/> <xsd:element name="title" type="nonEmptyToken"/> <xsd:element name="volume" type="xsd:token" minOccurs="0"/> <xsd:element name="issue" type="xsd:token" minOccurs="0"/> <xsd:element name="year" type="xsd:token"/> <xsd:element name="page" type="xsd:token" minOccurs="0"/> <xsd:group ref="accession" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <!-- =================================================== ===================== utility types =================================================== ===================== --> <!-- begin to relax so that existing data can be v alidated --> <xsd:simpleType name="nonEmptyString"> <xsd:restriction base="xsd:string"> <!-- <xsd:minLength value="1"/> --> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="nonEmptyToken"> <xsd:restriction base="xsd:token"> <!-- <xsd:minLength value="1"/> --> </xsd:restriction> </xsd:simpleType>
91
<!-- end to relax so that existing data can be val idated --> <xsd:simpleType name="anyMeaningfulToken"> <xsd:restriction base="xsd:token"/> </xsd:simpleType> <xsd:complexType name="stageSystemType"> <xsd:sequence> <xsd:element name="name" type="xsd:token" minOccurs="0"/> <xsd:element name="value" type="nonEmptyToken"/> </xsd:sequence> </xsd:complexType> <xsd:group name="accession"> <xsd:sequence> <xsd:element name="accession" type="nonEmptyToken" minOccurs="0"/> <xsd:element name="source" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:group> <xsd:group name="triplet"> <xsd:sequence> <xsd:element name="symbol" type="nonEmptyToken"/> <xsd:element name="accession" type="xsd:token" minOccurs="0"/> <xsd:element name="name" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:group> <xsd:complexType name="supplierType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:element name="catalogueNumber" type="xsd:token" minOccurs="0"/> <xsd:element name="lotNumber" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="fileType"> <xsd:sequence> <xsd:element name="name" type="nonEmptyToken"/> <xsd:element name="type" type="xsd:token" minOccurs="0"/> <xsd:element name="zipFileName" type="xsd:token" minOccurs="0"/> </xsd:sequence> </xsd:complexType> <xsd:element name="nonEmptyToken" type="xsd:token" />
</xsd:schema>
92
Other Appendices (on CD-ROM) Other appendices have been included on the CD-ROM attached
with this dissertation thesis. These appendices include the code
used/produced during this project. The CD-ROM contains an
index page listing all these appendices. This thesis document is also available
in electronic form on CD-ROM disk.
The CD will automatically run when inserted in the CD Drive. In case, it doesn’t run;
go to CD Drive and open index.htm file. Following page having listing of all the files
in CD will appear.