Integration of Bioinformatics Web Services
through the Search Computing Technology
Davide Chicco [email protected]
Dipartimento di Elettronica e Informazione
Doctoral Minor Research
Project Defense
19th November 2012
SUMMARY
1. The problem
• Multidomain questions
2. The proposed solution
• GPDW Data Warehouse
• Search Computing
• Bio-SeCo
3. Developed and added services
• Exploiting GPDW Data Warehouse
• Semantic Similarity
4. Conclusions
3
Data and search service scenario
in the Life Sciences
• In the Life Sciences: numerous data, sparsely distributed in
many heterogeneous sources
• Many are ranked data (or partially ranked) of various
types, representing different phenomena, e.g.:
– physical ordering, e.g. within a genome
– analytical order through algorithmically assigned
scores, e.g. representing levels of sequence similarity
– experimentally measured values, such as gene
expression levels
• The ordering may represent a range of different notions,
such as quantity, confidence, or location
4
Life Science questions and their answering
– Several Life Science questions:
- are complex
- to be answered require integration and comprehensive
evaluation of different data
– often distributed, many of which ranked
• Answering complex questions requires integration of vertical
search services to create multi-topic searches
• where the different topic searches either refine or augment previous
search results
• Bioinformatics data integration platforms exist
– Ordered data are poorly served or no supported at all by
current data integration platforms
5
Life Science multidomain question
Example: “Which genes encode proteins in different
organisms with high sequence similarity to a protein X and
have some biomedical features in common e.g. up/down
significantly co-expressed in the same biological tissue or
condition Y and involved in the biological function Z?”
Information to answer such queries is available on the Internet,
but no available software system is capable of computing the
answer
The user should search in
different resources, often
indipendent.
6
• Several integrated databanks, including:
• Entrez Gene, Ensembl
• Homologene
• IPI, UniProt/Swiss-Prot
• Gene Ontology, GOA
• BioCyc, KEGG, Reactome
• InterPro, Pfam
• OMIM, eVOC, …
• Numerous integrated data, including:
• 8,085,152 genes of 8,410 organisms
• 31,347,655 proteins of 367,853 specie
• 33,252 Gene Ontology terms and 61,899 relations (is a, part of)
• 27,667 biochemical pathways
• 14,163 protein domains; 7,215 OMIM genetic disorders; …
Homologene
Entrez
Gene IPI
eVOC KEGG
Reactome Gene
Ontology
On-line databanks
Genomic and Proteomic
Data Warehouse
Database
server
GOA BioCyc
Automatic
updating
procedures
GPDW Data Warehouse
7
Search Computing project at PoliMi
Search Computing (SeCo) aims at:
1. Developing the informatics framework required for
computing multi-topic searches by combing single topic
search results from search engines, which are often ranked,
with other data and computational resources
• directly supporting multi-topic ordered data
• taking into account order when the results of several
requests are combined
• enabling exploration and expansion of search results
2. Applying SeCo technology in different fields, including Life
Sciences => Bio-SeCo: Support answering complex
bioinformatics queries
8
Bio-SeCo: SeCo technologies to answer Life Science questions
Life Science example query:
“Which genes encode proteins in different organisms with
high sequence similarity to a protein X and have some
biomedical features in common, e.g. up/down significantly
co-expressed in the same biological tissue or condition Y
and involved in a biological function Z?”
This multi-topic case study question can be decomposed into
the following four single topic sub-queries, each of these sub-
queries can be mapped to an available search service.
9
Bio-SeCo: SeCo technologies to answer Life Science questions
• “Which proteins in different organisms have high sequence
similarity to a protein X ?”
BLAST, a sequence similarity search program, in one
of its many implementations, e.g. WU-BLAST or
NCBI-Blast
• “Which genes encode which proteins ?”
GPDW (Genomic and Proteomic Data Warehouse), a
query service to a database of genomic and proteomic
data (GPDW_protein2gene)
10
Bio-SeCo: SeCo technologies to answer Life Science questions
• “Which genes are up/down significantly co-expressed in the
same biological condition / tissue Y ?”
Array Express Gene Expression Atlas, a search
engine of gene expression data
• “Which genes are involved in a biological function Z ?
GPDW (Genomic and Proteomic Data Warehouse), a
query service to a database of genomic and proteomic
data (GPDW_gene2biologicalFunctionFeature)
11
Bio-SeCo: SeCo technologies to answer Life Science questions
Each quesiton part answer is integrated with others, with all
the ranked results found
BLAST
ArrayExpress
GPDW_gene2biologicalFunctionFeature
GPDW_protein2gene
12
What I have done for my Minor Research project
13
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Is_involved_in
Semantic network: before
14
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Genetic Disorder
Is_involved_in
Codes
Is_involved_in
Pathway
Is_involved_in Is_involved_in
Is_involved_in
Is_functional_similar_to
Semantic network: now
15
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Genetic Disorder
Is_involved_in
Codes
Is_involved_in
Pathway
Is_involved_in Is_involved_in
Is_involved_in
Is_functional_similar_to
Services I added: GPDW exploitation
A Genetic Disorder is an illness caused by abnormalities in genes
or chromosomes, especially a condition that is present from
before birth.
In biochemistry, Metabolic Pathways are series of chemical
reactions occurring within a cell. In each pathway, a principal
chemical is modified by a series of chemical reactions.
16
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Genetic Disorder
Is_involved_in
Codes
Is_involved_in
Pathway
Is_involved_in Is_involved_in
Is_involved_in
Is_functional_similar_to
Services I added: GPDW exploitation
Which Genetic Disorders is the Gene X involved in ?
GPDW (Genomic and Proteomic Data Warehouse), a query
service to a database of genomic and proteomic data
(GPDW_gene2geneticDisorder)
17
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Genetic Disorder
Is_involved_in
Codes
Is_involved_in
Pathway
Is_involved_in Is_involved_in
Is_involved_in
Is_functional_similar_to
Services I added: GPDW exploitation
Which Genetic Disorders is the Protein Y involved in ?
GPDW (Genomic and Proteomic Data Warehouse), a query
service to a database of genomic and proteomic data
(GPDW_protein2geneticDisorder)
18
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Genetic Disorder
Is_involved_in
Codes
Is_involved_in
Pathway
Is_involved_in Is_involved_in
Is_involved_in
Is_functional_similar_to
Services I added: GPDW exploitation
Which Genes does the Genetic Disorder X involve?
GPDW (Genomic and Proteomic Data Warehouse), a query
service to a database of genomic and proteomic data
(GPDW_geneticDisorder2gene)
19
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Genetic Disorder
Is_involved_in
Codes
Is_involved_in
Pathway
Is_involved_in Is_involved_in
Is_involved_in
Is_functional_similar_to
Services I added: GPDW exploitation
Which Proteins does the Genetic Disorder X involve?
GPDW (Genomic and Proteomic Data Warehouse), a query
service to a database of genomic and proteomic data
(GPDW_geneticDisorder2gene)
20
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Genetic Disorder
Is_involved_in
Codes
Is_involved_in
Pathway
Is_involved_in Is_involved_in
Is_involved_in
Is_functional_similar_to
Services I added: GPDW exploitation
Same questions and GPDW services for Metabolic Pathways: • GPDW_gene2pathway
• GPDW_protein2pathway
• GPDW_pathway2gene
• GPDW_pathway2protein
21
Gene Protein
Biological Function Feature
Gene Expression
Is_similar_to
Is_encoded_by
Has
Is_involved_in
Genetic Disorder
Is_involved_in
Codes
Is_involved_in
Pathway
Is_involved_in Is_involved_in
Is_involved_in
Is_functional_similar_to
Services I added: GPDW exploitation
A Biological Function Feature is an item of information about a
gene or a protein. It defines a certain peculiarity of a biomolecular
entity. E.g.: “is involved in lung cancer”
GPDW_protein2biological_function_feature
22
Services I added
• These new services (Genetic Disorder and Pathway) are
very useful and important, but they don’t take advantage of
the main novelty provided by the Search Computing
technology: the Integration of ranked results
• There’s no ranking on “being involved” in a Genetic
Disorder or a Pathway…
23
Services I added: Gene Semantic Similarity
• The other service (SemSim) I integrated on Bio-SeCo is
related to the computation of the semantic similarity of a
gene into a list of genes:
• This service provides ranked results (given a gene X, it
returns a list of gene ranked from the most semantic similar
to X to the less semantic similar one)
• SemSim takes advantage of the Search Computing
potentiality of integrating ranked results
Gene
Is_functional_similar_to
24
Semantic Similarity?!? What does it mean?
• Keypoint: given the gene X and gene Y, how much similar
are they?
• Semantically similar genes can be involved in similar
activities, can be involved in similar pathways, and can have
many annotations in common
• To measure this similarity, I chose Latent Semantic
Indexing method, based on a matrix build with gene-
related annotations
25
Biomolecular annotation
• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features
• This information is expressed through controlled
vocabularies, sometimes structured as ontologies, where
every controlled term of the vocabulary is associated with a
unique alphanumeric code
• The association of such a code with a gene or protein ID
constitutes an annotation
Gene /
Protein
Biological function feature
Annotation
gene2bff
26
Biomolecular annotation
• The association of an information/feature with a gene or
protein ID constitutes an annotation
• Annotation example:
• gene: GD4
• feature: “is present in the mitochondrial membrane”
Gene /
Protein
Biological function feature
Annotation
gene2bff
27
Latente Semantic Indexing:
Singular Value Decomposition – SVD
– Annotation matrix A {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0
28
Latente Semantic Indexing:
Singular Value Decomposition – SVD
– Annotation matrix A {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0
29
Compute SVD:
Compute reduced rank approximation:
• An annotation prediction is performed by computing a reduced
rank approximation Ak of the annotation matrix A
(where 0 < k < r, with r the number of non zero singular values
of A, i.e. the rank of A)
TA U V
TA U V
TA U V TA U V TA U V
T
k k k kA U V
k
T
k k k kA U V T
k k k kA U V T
k k k kA U V T
k k k kA U V
k
Latente Semantic Indexing: Singular Value Decomposition – SVD
30
Compute reduced rank approximation:
• A : genes – features matrix
• Uk : gene vectors matrix
• Σk : singular value matrix
• VTk : feature vectors matrix
T
k k k kA U V
k
T
k k k kA U V T
k k k kA U V T
k k k kA U V T
k k k kA U V
k
Latente Semantic Indexing: Singular Value Decomposition – SVD
31
• Uk : gene vectors matrix
• Σk : singular value matrix
• VTk : feature vectors matrix
• These matrices can be considered for measuring the distances
between objects (genes or feature) in the k-dimensional space.
• For example, is possibile to compute the distance between
two gene vector to understand their similarity level. The same
thing could be done for features.
Latente Semantic Indexing: Singular Value Decomposition – SVD
32
• Uk : gene vectors matrix
• Σk : singular value matrix
• VTk : feature vectors matrix
• For our implementation of the LSI, we chose to compute the
cosine similarity as measure of the semantic similarity
between genes.
Latente Semantic Indexing: Singular Value Decomposition – SVD
33
• A preprocessing software computes the Singular Value
Decomposition (SVD) algorithm
• It prints the matrices (Uk, Σk, VT
k) in three different files
• These files are inserted into the \data\ directory of the SemSim
REST web application
• SemSim (JSP + Java) computes the Latent Semantic Indexing
(LSI) measures and returns the ranked list of genes
Minor Research Project
34
• Developed with REST technology
• Integrated on Bio-SeCo as an external service, with a wrapper
• Input: gene (ID, name, taxonomy)
Minor Research Project
35
• Input: list of genes ranked on their semantic similarity with the
input gene
Minor Research Project
36
• Now is possible to answer to many other biological questions.
For example:
Among the proteins that are encoded by genes, in Chicken
organism, with higher functional semantic similarity to gene X,
which are those with higher sequence similarity to protein Y ?
Minor Research Project
SemSim ProteinByGene Sequence
Alignment
Input
Output
37
• Now is possible to answer to many other biological questions,
that involve Gene Semantic Similarity computation, Genetic
Disorders or Metabolic Pathways. For example:
Among the proteins that are encoded by genes, in Chicken
organism, with higher functional semantic similarity to gene X,
which are those with higher sequence similarity to protein Y ?
Minor Research Project
SemSim ProteinByGene Sequence
Alignment
Input
Output DEMO
38
Thanks for your attention