Detecting Phenotype-Specific Interactions Between Biological
Processes
Nadeem A. Ansari
Department of Computer ScienceWayne State University
Detroit, MI 48202
1
Outline
• Biological background• Motivation and problem description• Challenges and limitations• Mathematical background• Detecting changed interactions between
biological processes in a phenotype• Improvements• Results• Summary
2
Outline
• Biological background• Motivation and problem description• Challenges and limitations• Mathematical background• Detecting changed interactions between
biological processes in a phenotype• Improvements• Results
3
Cells, proteins, and DNA
• Cells: fundamental units of life that contain all the working machinery necessary for their functioning
• Proteins: the main contributors of this working machinery
• Deoxyribonucleic acid (DNA): contains the blueprint for making the working machinery
• Gene expression: the process of making the working machinery
4
DNA
• Linear molecule of two strands; each composed of subunits called Nucleotides
• Nucleotide types: Adenine – ACytosine – CGuanine – GThymine – T
5
DNA
6
• Base pairing:… A A C G G A T …… T T G C C T A …
Transcription
• Information stored in DNA letters is transcribed into Ribonucleic acid (RNA)
• RNA: a chain of nucleotides - A, C, G, U (uracil)
7
… G T G C A T … DNA… C A C G U A … RNA
Translation
8
• Information stored in RNA is translated into chains of amino acids - proteins
Gene expression
• The process of making the working machinery of a cell.
9
10
• Regions of DNA that are synthesized into functional RNA and proteins are known as genes
• An observable characteristic (or trait) of an organism caused by gene expression is known as a phenotype.
Gene expression measurement – why?
• Various stimuli cause change in gene expression
• Change in expression level results in under or over production of working machinery– diseases / phenotypes
• All cells contain same DNA – express genes selectively
11
• Measuring gene expression can help us understand underlying biological phenomenon
Gene expression measurements
• Typically researchers measure gene expression in two different tissues or cell samples
– Cells treated with a drug vs. untreated cells• Genes expressed differently than in a controlled
sample are called differentially expressed (DE) genes
• High throughput technologies like DNA microarrays measure expression levels of thousands of genes
12
Genes and annotations• Functional characteristics of gene products are
stored in annotation databases like gene ontology• Gene Ontology (GO): a controlled and structured
vocabulary– Molecular functions, biological processes, and
cellular components• Structured as directed acyclic graphs (DAGs)
– Nodes represent terms– Edges represent relationships
• Parent-child relations (more than one parent)– Is-a, part-of, and regulates (negatively, positively)
13
Biological processes – GO subset
• GO is a set of terms and their definitions organized in a structure that reflects their relationships
• GO also provides a set of annotations, describing what is known about each gene (products)
14
Outline
• Biological background• Motivation and problem description• Goals, Challenges and limitations• Mathematical background• Detecting changed interactions between
biological processes in a phenotype• Improvements• Results
15
Motivation and problem description
• Various stimuli cause differential gene expression, which results in the over and under production of proteins
• Over and under production of proteins can result in the expression of a disease and disease-specific phenotype
• Understanding genes behavior can help us understand diseases in ways never thought before – e.g. drug targets for curing diseases
16
Motivation and problem description
• Current approaches look for the biological functions that are under or over represented in the phenotype-specific gene expression patterns
• However, life is complex and biological functions also interact
• These interactions change in a phenotype• Understanding changed interactions between
biological functions is important in understanding the underlying biological mechanism that resulted in the phenotype
17
Outline
• Biological background• Motivation and problem description• Goals, Challenges and limitations• Mathematical background• Detecting changed interactions between
biological processes in a phenotype• Improvements• Results
18
Goals
• Our goal is to detect the interactions between biological functions that have changed significantly in a given phenotype
• We detect these interactions between the biological processes from GO annotated with differentially expressed genes in a phenotype
19
Challenges and limitations• There is no simple way to establish which
biological functions are important– No universally accepted statistical model exists
• Finding relationship between biological processes using mathematical models is challenging
• No known statistical model exists that detects changed interactions in a given phenotype
• Using GO annotations presents its own challenges
20
Challenges and limitations• GO is incomplete and updated on continuous
basis– Missing information regarding gene annotations
• GO contains inconsistencies– New research may make previous annotations
obsolete• GO hierarchy poses challenge of dependencies
– Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term
21
Outline
• Biological background• Motivation and problem description• Goals, Challenges and limitations• Mathematical background• Detecting changed interactions between
biological processes in a phenotype• Improvements• Results
22
Information retrieval (IR)
• Problem: Given a query, find relevant documents from a collection
• Vector space model (VSM)– Represent document and keywords in a matrix
• Documents as columns with keywords as components – columns are document vectors
– Represent query as a (column) vector– Find document vectors closer to query vector
• Documents are relevant to query
23
Example – document retrieval
A Document collectionD1 How to bake bread without recipesD2 The classic art of Viennese pastryD3 Numerical recipes: the art of scientific
computingD4 Breads, pastries, pies, and cakes: quality
baking recipesD5 Pastry: a book of best French recipes
24Example taken from Berry et al., SIAM: Review 41, 2 (1999)
Example – document retrieval
A Document collectionD1 How to bake bread without recipes
D2 The classic art of Viennese pastry
D3 Numerical recipes: the art of scientific computing
D4 Breads, pastries, pies, and cakes: quality baking recipes
D5 Pastry: a book of best French recipes
25
T1 T2 T3 T4 T5 T6Terms bake recipe bread cake pastry pie
Example – document retrieval
A Document collectionD1 How to bake bread without
recipesD2 The classic art of Viennese
pastryD3 Numerical recipes: the art of
scientific computingD4 Breads, pastries, pies, and
cakes: quality baking recipesD5 Pastry: a book of best French
recipes
A D1 D2 D3 D4 D5
T1 1 0 0 1 0T2 1 0 1 1 1T3 1 0 0 1 0T4 0 0 0 1 0T5 0 1 0 1 1T6 0 0 0 1 0
26
T1 T2 T3 T4 T5 T6Terms bake recipe bread cake pastry pie
Term by document matrix
Example (IR VSM)
A D1 D2 D3 D4 D5
T1 1 0 0 1 0T2 1 0 1 1 1T3 1 0 0 1 0T4 0 0 0 1 0T5 0 1 0 1 1T6 0 0 0 1 0
27
T1 T2 T3 T4 T5 T6Terms bake recipe bread cake pastry pieQuery 1 0 1 0 0 0
• User searching for documents related to “baking bread”
• Query vector:
TD )000111(1
• Document vector:
TQ )000101(
Finding relevant (similar) documents
28
TtqqqQ )...( 21
j
jj
DQ
DQDQsimilarity
T
),(
A D1 D2 … Dn
T1 a11 a12 … a1n
T2 a21 a22 … a2n
… … … … …Tm am1 am2 … amn
Ttjjjj aaaD )...( 21
222
21
2211 ...
mT
mjmjjjT
qqqQQQ
aqaqaqDQ
Correlation
29
• Determines if two random variables vary together
• Linear correlation between X and Y:– Positive correlation - X increases as Y increases– Negative correlation - X decreases as Y increases– No linear correlation - no linear relationship
mm
m
YYXX
YYXX
22 )()(
))((
YYXX
XYXYr
.
mxxxX ,...,, 21 myyyY ,...,, 21
(Pearson correlation coefficient)
Pearson correlation coefficient – geometric interpretation
30
22 )()(
))((
YYXX
YYXXrXY
cTc
cccc
mmm
YX
yxyx
YyXxYyXxYYXX
mm
11
))(())(()()( 11
),( cccc
ccXY YXsimilarity
YX
YXr
T
Outline
• Biological background• Motivation and problem description• Goals, Challenges and limitations• Mathematical background• Detecting changed interactions between
biological processes in a phenotype• Improvements• Results
31
Detecting interactions that have changed significantly in the phenotype• Represent differentially expressed genes, in a
phenotype, and their biological functions as a matrix – vector space model with biological processes as column vectors
• Find associations between pairs of biological processes
• Compare these associations with the corresponding associations in the absence of such phenotype
• Detect association that are significantly different in the phenotype 32
Data inputs - genes and functions
• Reference genes and functions set (R)– M genes on a microarray– N GO terms annotated with M genes
• In a biological condition under study (E)– m < M differentially expressed (DE) genes– n <= N GO terms annotated with m DE genes
33
Gene function matrix – reference data
34
GF f1 f2 … fN
g1 a11 a12 … a1N
g2 a21 a22 … a2N
… … … …
gM aM1 aM2 … aMN
Gene function matrix – reference data
35
otherwisef term GO with
annotated is g geneIf i
0
1}{
jijNMR aGF
Example gene-function matrix
Gene function matrix – experiment data
36
otherwisef term GO withannotated is g gene DEif i
0
1}{
jijnmE aGF
Example gene-function matrix
Gene function matrix – reference and experiment Data
37
• Experiment gene-function matrix is subpart of reference gene-function matrix
Challenges and limitations• GO is incomplete and updated on continuous
basis– Missing information regarding gene annotations
• GO contains inconsistencies– New research may make previous annotations
obsolete• GO hierarchy poses challenge of dependencies
– Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term
38
Our approach to solve challenges
• Use singular value decomposition (SVD)• SVD can find missing relationships between genes
and annotations in the latent semantic space and also remove noise from data
• Noise: multiple words describing the same concepts
39
• SVD is a factorization of a matrix into three matrices consisting of singular vectors and singular values corresponding to the original matrix
Singular value decomposition (SVD)
• Columns of matrix G (F) are left (right) singular vectors of GF
• S is a diagonal matrix of singular values si. – The values on the main diagonal are ordered in non-
increasing order and represent variability in data 40
• SVD of a GF matrix
Matrix approximation – dimensionality reduction• An approximated matrix can be computed by
keeping only the first k largest singular values
41
• We select k that retains the desired data variance (say x%) using the equation:
Approximated matrix – column view
42
• We approximate both reference and experiment matrices
• The approximated experiment gene-function matrix is not a sub-part of the approximated reference gene-function matrix
Correlation Between Functions
• Indicates the strength and direction of a linear relationship between two biological processes
• Pearson correlation coefficient rfi,fj between a pair of functions fi and fj is computed as:
43
ji
jiff
ff
ffr
T
ji
,
• Matrices (RRNxN and RE
nxn) of correlation coefficients are computed for reference and experiment data (respectively)
Pair-wise Correlation Coefficients for Reference and Experiment data
• RRnxn contains the pair-wise correlation
coefficients between the first n functions in the absence of phenotype 44
=
Fisher Z Transform – Correlation Coefficient To Z-values• Correlation coefficients from samples of large
population can be mapped to z values using Fisher z-transform, which approximates normal distribution
• For a correlation coefficient r, the Fisher z-transform Zr can be computed as:
45
• Compute ZRr from RR
NxN and ZEr from RE
nxn
Detecting Changes Between Functional Interactions
• Hypothesis: Correlation between two biological processes in the given phenotype differs from the correlation in the reference data
46
Hypothesis Test statistic
Outline
• Biological background• Motivation and problem description• Goals, Challenges and limitations• Mathematical background• Detecting changed interactions between
biological processes in a phenotype• Improvements• Results
47
Improvements
• The dependencies between GO terms can somewhat be removed using weights in our matrix.
48
Scheme 1-1
• This is a binary scheme and was discussed while describing our main method
49
otherwise term GO with
annotated is gene DEif
0
1}{
j
i
ijnmE
f
g
aGF
otherwise term GO with
annotated is geneIf
0
1}{
j
i
ijNMR
f
g
aGF
Scheme 1-e
• ei is the normalized log-transformed fold-change measured for gene gi in the given condition
50
otherwisef term GO withannotated is g gene DEif i
0
}{j
i
ijnmE
eaGF
otherwise term GO with
annotated is geneIf
0
1}{
j
i
ijNMR
f
g
aGF
Scheme IR 1-1
gb: Gene (annotation) bias – GO DAG relatediab: Inverse annotation bias – experiment related 51
otherwise term GO with
annotated is gene DEif
0
1
}{j
iij
ijnmE
f
gw
aGF
ijij iabgbw
jj f#
gb with annotated genesof
1
sannotation Total for sannotationof i
i
giab
#ln
Scheme IR 1-e
52
and change-fold dtransforme-log normalized the is
otherwise term GO with
annotated is gene DEif
i
j
iiji
ijnmE
e
f
gwe
aGF
0
}{
ijij iabgbw
otherwise term GO with
annotated is geneif
0
1
}{j
iij
ijNMR
f
gw
aGF
Outline
• Biological background• Motivation and problem description• Goals, Challenges and limitations• Mathematical background• Detecting changed interactions between
biological processes in a phenotype• Improvements• Results
53
Breast cancer data set
• Veer et al. (2002) found some differentially expressed genes in breast cancer– 24,000 reference genes on the microarray– 13,201 annotated biological processes from GO– 231 genes were found to be differentially
expressed– 246 annotated biological processes with the DE
genes• Since then no satisfactory prediction has been
made in this regard54
Breast Cancer Data Set Results
A subset of predicted biological pairs with significant interaction change
Scheme GO Term 1 GO Term 2 p-value1-1, IR 1-e Proteolysis Positive regulation of
apoptosis.0001
1-1 Transcription DNA replication initiation .026
1-1 DNA repair Regulation of transcription, DNA-dependant
.033
IR 1-1 Vesicle-mediated transport
Transcription from RNA polymerase II promoter
.002
IR 1-1 DNA replication initiation
Phosphinositide-mediated signaling
.00001
55
Breast Cancer Data Set Results Summary
Number of predicted biological pairs with significant interaction change
Scheme Cat. 1 Cat. 2 Cat. 3 Accuracy1-1 10 5 1 93.7%1-e 16 6 2 91.6%IR 1-1 9 7 2 88.8%IR 1-e 15 9 2 92.3%Total 50 27 7 91.6%
56
Cat. 1: Known interactions and trivialCat. 2: Known interactions and non-trivialCat. 3: Unknown
Lung cancer data set
• Beer et al. (2002) found some differentially expressed genes in lung cancer– 5541 reference genes on the microarray– 2908 annotated biological processes from GO– 87 genes were found to be differentially expressed– 248 annotated biological processes with the DE
genes
57
Lung Cancer Data Set Results Summary
Number of predicted biological pairs with significant interaction change
Scheme Cat. 1 Cat. 2 Cat. 3 Accuracy1-1 16 3 2 90.4%1-e 39 3 2 95.4%IR 1-1 29 2 0 100.0%IR 1-e 38 9 3 94.0%Total 122 17 7 95.21%
58
Summary• Various stimuli cause differential gene expression,
which results in the expression of a disease and disease-specific phenotype
• Biological processes interact and their interaction change in a given phenotype
• We proposed methods to detect such significantly changed interactions in the observed phenotype
• We used vector space model, matrix approximation, and statistical hypothesis testing to find changed interactions between biological processes from GO
• Results showed 89% or more accuracy for our proposed methods
59
References:
• Ansari, N. A., Bao, R., and Drăghici, S. Detecting phenotype-specific interactions between biological processes from microarray data and annotations. Bioinformatics, under revision.
• Drăghici, S. Data Analysis Tools for DNA Microarrays. Chapman and Hall/CRC Press, 203 (first print), 2006 (second print)
• Berry, M. W., Drmac, Z., and Jessup, R. E. Matrices, vectors spaces, and information retrieval. SIAM: Review 41, 2 (1999), 335-62
• Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391-407
• Done, B., Khatri, P., Done, A., and Drăghici, S. Predicting novel human Gene Ontology annotations using semantic analysis. IEEE/ACM Transactions on CBB (2009)
60
Special Thanks to
• Dr. Sorin Draghici
61
Thank You
62