Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
FINAL VERSION OF TNNLS-2014-P-4075 1
A Further Study On Mining DNA Motifs Using
Fuzzy Self-Organizing MapsSarwar Tapan and Dianhui Wang, Senior Member, IEEE
Abstract—SOM-based motif mining, despite being a promisingapproach for problem solving, mostly fails to offer a consistentinterpretation of clusters in respect to the mixed compositionof signal and noise in the nodes. The main reason behind thisshortcoming comes from the similarity metrics used in dataassignment, specially designed with the biological interpretationfor this domain, are not meant to consider the inevitablenoise mixture in the clusters. This limits the explicability ofthe majority of clusters that are supposedly noise dominated,degrading the overall system clarity in motif discovery. Thispaper aims to improve the explicability aspect of learning processby introducing a Composite Similarity Function (CSF) that isspecially designed for the k-mer-to-cluster similarity measure inrespect to the degree of motif properties and embedded noisein the cluster. Our proposed motif finding algorithm in thispaper is built on our previous work READ [1] and termed asREADcsf , that performs slightly better than READ and showssome remarkable improvements over SOM-based SOMBREROand SOMEA tools respectively in terms of F -measure on thetesting datasets. A real dataset containing multiple motifs is usedto explore the potential of the READcsf for more challengingbiological data mining tasks. Visual comparisons with the verifiedlogos extracted from JASPAR database demonstrate that ouralgorithm is promising to discover multiple motifs simultaneously.
Index Terms—Computational DNA motif discovery, Compositesimilarity metrics, Robust elicitation algorithms, Fuzzy self-organizing maps.
I. INTRODUCTION
In continuation to our previous study on fuzzy-SOM
(FSOM)-based motif discovery [1], this paper addresses a
persistent problem in the existing SOM-based tools (e.g., [2]–
[5]) such that they commonly demonstrate a critical limitation
in addressing the practical fuzzy-mixture of signal and noise
in the clusters. These tools ignore the presence of noise in
the clusters at all clustering states and optimize the clusters
based on only their degree of motif properties, despite the
known fact that every cluster practically comprises some
degree of noise in most cases due to the specific nature of
the problem. Such ignorance to embedded noise consequently
limits the explicability of the noise-dominated clusters that
occupy the largest portion in the maps, which is in general
a common problem in any clustering-based approach for this
task. The primary reason for this is the critical limitation of the
existing similarity metrics that are meant to consider the motif
properties and ignore the embedded noise in the clusters during
k-mer (k-length subsequence) assignment. Improvement in
D. Wang is with the Department of Computer Science and InformationTechnology, La Trobe University, Melbourne, Victoria 3086, Australia. e-mail:[email protected]
this aspect of SOM-based motif discovery is necessary and
has motivated this study.
This paper extends our proposed READ framework [1] to
address previously unsolved issues in this approach for motif
discovery. Technical contributions of this paper include:
1) Improving the explicability aspect of clustering algo-
rithms using SOM networks through introducing a new
similarity metric which is designed to offer a rational
treatment to the embedded noise-mixture in clusters; and
2) Investigating the learning behaviour of SOM networks
for subtle pattern discovery task.
The nature of the problem necessitates describing two chal-
lenging properties of the k-mer dataset: (i) a considerably low
signal-to-noise ratio [6], [7], causing noise-dominated clusters
to largely populate the maps; and (ii) due to natural degenera-
tion caused by evolutionary pressure, motif (signal) elements
(binding sites) often have a close resemblance to noise, which
causes the unavoidable presence of some degree of noise in the
putative motif clusters. Thus, an explicable clustering requires
both signal and noise elements (also, their characteristics) in
the clusters to be combinedly and complementarily considered,
possibly through using specially designed similarity metrics
in the clustering process, contrasting the use of existing
similarity metrics that are mostly designed with a ‘signal only
characterization’ approach for motif discovery.
The use of biologically inclined similarity metrics, such as
MISCORE [8] and log-likelihood metric [9], offers an explica-
ble assignment of the putative binding site k-mers to putative
clusters (i.e., clusters with a good degree of motif properties)
through characterizing functional motif properties in the clus-
ters [8]. Their use in the iterative optimization of clusters aims
to consistently improve the degree of motif properties of the
clusters irrespective of their dominant signal type. This causes
a non-trivial inconsistency throughout the clustering process,
since there is a consistent attempt to improve every noise-
dominated cluster with a better degree of motif properties in
the same manner that only suits the optimization of putative
motif clusters. Thus, applying such similarity metrics limits
the interpretation to only the putative motif clusters as the
noise-dominated ones become inexplicable, imposing a major
drawback in terms of system clarity.
This paper proposes a new and adaptive similarity met-
ric, named Composite Similarity Function (CSF), which is
designed to address the discrete composition of signal and
noise in the clusters during k-mer distribution. The CSF-based
similarity quantification between a k-mer and a cluster gives
a composite similarity measure in respect to the current signal
composition (noise level) of the cluster using two separate
FINAL VERSION OF TNNLS-2014-P-4075 2
but complementary modelling schemes, connected through
adaptive composition weight. In CSF, the first component
is our MISCORE [8], which is a useful signal modelling
scheme with biological interpretation, capable of measuring
the potential of a k-mer through characterising several motif
properties of a cluster. The second one is a newly developed
background signal modelling scheme, named B-MISCORE
[10], which gives the similarity measure of a motif and its
elements to the backgrounds through a large random sampling
of the backgrounds (see preliminaries).
Technically, applying CSFs in clustering-based motif dis-
covery offers the following benefits: (i) a consistent interpre-
tation of all clusters in the system; (ii) a useful indication of the
noise level of each cluster throughout the iterations, offering an
effective monitoring of the ongoing clustering process; and (ii-
i) a means of embedding a discrete optimization of the putative
clusters throughout the iterations, potentially increasing the
chances of obtaining more putative motif candidates (detailed
in section III-D).
The remainder of this paper is organized as follows. Section
II provides some preliminaries used in this paper; Section III
details the proposed CSFs; Section IV describes the READcsf
algorithm; Section V reports our experimental results; and
Section VI concludes this paper.
II. PRELIMINARIES
The Positional Frequency Matrix (PFM)-based motif model,
denoted by M , is a matrix, i.e., M = [f(bi, j)]4×k, where
bi ∈ χ = A,C,G, T and j = 1, . . . , k, and each entry
f(bi, j) represents the probability of nucleotide bi at the j-
th position. Similarly, a k-mer Ks = q1q2 . . . qk is encoded
as a binary matrix K = [k(bi, j)]4×k with k(qi, j) = 1 and
k(bi, j) = 0 for bi 6= qi.
A. Background modelling using B-MISCORE
B-MISCORE [10] is a new modelling scheme for evalu-
ating a motif or its elements in respect to the backgrounds.
Firstly, a large collection of random sets denoted as ζ =G1, G2, G3 . . . , , |ζ| ≥ 1000, are generated, where a
random set Gl consists of randomly grouped k-mers from
backgrounds, i.e., Gl = K1,K2,K3, . . . , 25 ≤ |Gl| ≤ 50.
Then, the background probability of each K ∈ Γ, where Γis the k-mer dataset produced from the input sequences, is
computed using a first order Markov chain transition matrix
β = [π(a, a′)]4×4 as,
P (K,MB) = p(b1)∏
∀(a,a′)
π(a, a′)k(a,a′), (1)
where k(a, a′) gives the count of di-nucleotide aa′ in K ,
and p(b1) is the independent background probability of the
nucleotide appearing at the first position in K . The background
probability of the k-mers are globally normalized as,
Pn(K) =P (K)− min
∀K∈ΓP (K)
max∀K∈Γ
P (K) − min∀K∈Γ
P (K). (2)
The background score of Ki ∈ Γ, can then be measured
using a random k-mer set, namely Gl, as,
dB(Ki, Gl) =1
|Gl|
∑
∀Kp∈Gl
Pn(K)d(Ki,Kp), (3)
where d(Ki,Kp) is the Hamming distance [8] between two
k-mers.
It can be deduced from (3) that dB(Ki, Gl) is a weighted
measure of Ki being a background class element in respect
to ∀Kp ∈ Gl, where d(Ki,Kp) applies the weight to the
contribution of each Kp (∀Kp ∈ Gl) in evaluating the
similarity of Ki to the backgrounds. Then, a large collection
of random sets ζ is used to obtain a discriminative background
score of Ki (B-MISCORE) with denotation rb(Ki) as,
rb(Ki) = min∀Gl∈ζ
dB(Ki, Gl), (4)
where a smaller rb(Ki) score represents a higher chance of
Ki being a background class element, and vice versa.
For a given set (S) of k-mers, the B-MISCORE-based
Model Score (BMMS) can be written as,
Rb(S) =1
|S|
∑
∀K∈MS
rb(K), (5)
where a larger Rb(∗) score represents a higher potential of
the model being a putative motif, and vice versa.
B. Fuzzy-SOM (FSOM) batch learning
Let Γ be the set of all binary encoded k-mers from the
input sequences, N represent the number of nodes in the
FSOM network where the j-th node has a 2-dimensional grid
coordinate as zj = [zj1, zj2] and a node-PFM as Mj . Then
the batch update rule of FSOM can be written from [1] as,
Mj(t+ 1) =
∑|Γ|i=1
∑Nk=1 µ
mki(t) hjk(t)Ki
∑|Γ|i=1
∑N
k=1 µmki(t) hjk(t)
, (6)
where µki(t) is the fuzzy membership [11] of Ki to Mk(t)that can be computed as,
µki(t+ 1) =
[
N∑
l=1
(
Φ (Ki,Mk(t))
Φ (Ki,Ml(t))
)2
m−1
]−1
, (7)
where Φ is a similarity metric and the exponential term m > 1essentially controls the amount of fuzziness in µ.
In (6), hjk(t) is the following neighbourhood function,
hjk(t) = exp
−‖zj − zk‖
2
2σ(t)2
, (8)
where the neighbourhood range σ(t) can be monotonically
shrunken using the criterion mentioned in [12] as,
σ(t+ 1) = σ(t0)exp
−2σ(t0)t
tmax
, (9)
where σ(t0) is a fairly large initial σ and tmax is the maximum
epoch set by the user.
FINAL VERSION OF TNNLS-2014-P-4075 3
III. COMPOSITE SIMILARITY FUNCTIONS (CSFS)
The Composite Similarity Function (CSF) associated with
a given j-th node at t-th iteration can be written as,
Θj(Ki,Mj(t)) =
λj(t) r(Ki,Mj(t)) + (1 − λj(t))[1 − rb(Ki)],(10)
where r(Ki,Mj(t)) is the MISCORE-based similarity [8]
between Ki and Mj(t) in respect to the motif properties in the
j-th cluster and rb(Ki) is the B-MISCORE-based similarity
measure of Ki to the backgrounds. In (10), λj(t) is the
(adaptive) composition weight that reflects the current noise
level in the node, where a higher such (0 < λj(t) < 1) value
represents signal domination over noise in the j-th cluster and
sensibly assigns a higher weight to r(Ki,Mj(t)) score in the
CSF-based similarity measure, and vice versa.
B-MISCORE in (4) gives the similarity of Ki to the back-
grounds, where a smaller score gives a higher similarity score
referring to a higher chance of Ki being categorized as noise in
the dataset. In the CSF-based similarity measure, an inverse of
the B-MISCORE score is taken as [1−rb(Ki)], interpreting the
dissimilarity of Ki to the backgrounds. This makes it comple-
mentary to MISCORE in measuring the merit of the k-mers,
where r(Ki,Mj) gives how similar Ki is to Mj in respect
to the embedded motif properties in Mj(t) and [1 − rb(Ki)]gives how dissimilar Ki is from the backgrounds. This follows
a sensible understanding that a higher dissimilarity of a k-
mer from the backgrounds and a higher similarity of the k-
mer to a given cluster with a good degree of motif properties
conjunctively quantify a higher potential of the k-mer to be a
putative motif element. In (10), a smaller Θj(Ki,Mj(t)) score
gives a higher composite-similarity between Ki and Mj(t),and vice versa.
A. Adaptation
The adaptation of composition weights is functionally re-
quired to reflect the changes in the noise level of each node
throughout the clustering iterations. The composition weights
are updated at the end of each iteration as,
λj(t+ 1) =
|Γ|∑
i=1
µji(t) rb(Ki)
|Γ|∑
i=1
µji(t) [r(Ki,Mj(t)) + rb(Ki)]
, (11)
where Γ is the k-mer dataset and µji(t) is the fuzzy mem-
bership of i-th k-mer to j-th node at t-th iteration that can be
computed using the CSFs as,
µij(t) =
[
N∑
l=1
(
Θj(Ki,Mj(t))
Θl(Ki,Mj(t))
)2
m−1
]−1
, (12)
where λj(t) reflects the current noise-level in the j-th node
and N is the number of nodes in the network.
Fig. 1. Conceptualisation of CSF-based clustering of k-mers in FSOM
network for DNA motif discovery, where nodes are illustrated with discretesignal and noise composition.
Eq. (11) can be re-written by dividing the terms by Ωj =∑|Γ|
i=1 µji(t) as:
λj(t+ 1)
=Ω−1
j
∑|Γ|i=1 µji(t) rb(Ki)
[Ω−1j
|Γ|∑
i=1
µji(t) r(Ki,Mj(t))] + [Ω−1j
|Γ|∑
i=1
µji(t) rb(Ki)]
=Rf
b (Mj(t))
Rf (Mj(t)) +Rfb (Mj(t))
,
(13)
where Rf (Mj(t)) is the fuzzy extension of the MISCORE-
based Motif Score (MMS), previously described in [1], for
quantifying motif properties of a given fuzzy cluster; and
Rfb (Mj(t)) gives the background similarity score of the fuzzy
cluster that can be rationally interpreted as the current degree
of noise in the cluster, implying λj(t) as the current noise level
of the j-th cluster at t-th iteration. Note that λj(t+1) > λj(t)is the result given by (11) by an increase of motif character-
istics and consequently a decrease in the noise level of j-th
node in terms of its dissimilarity to the backgrounds, while
λj(t+ 1) < λj(t) is caused by the opposite.
The adaptation process enables λj(t) to reveal a relative
measure of signal and noise composition in j-th cluster (for
FINAL VERSION OF TNNLS-2014-P-4075 4
20 40 60 80 1000
1
2
3
4
epoch t
Z(S
5,N
5)
λAdaptive
λ0.5
λ1.0
20 40 60 80 1000
1
2
3
4
epoch t
Z(S
10,N
10)
λAdaptive
λ0.5
λ1.0
20 40 60 80 1000
1
2
3
4
epoch t
Z(S
15,N
15)
λAdaptive
λ0.5
λ1.0
20 40 60 80 1000
1
2
3
4
epoch t
Z(S
20,N
20)
λAdaptive
λ0.5
λ1.0
Fig. 2. Discrimination of signal nodes from the noise nodes by different modes of adaptation in CSFs using CREB [13] transcription factor dataset.
1 < j < N ) at any clustering cycle t. During the initialization
stage of training, each cluster usually demonstrates a high
presence of noise and gradually some of the clusters get
improved in terms of motif modelling. The value of λ is
associated with the degree of mixture of signals and noise
in each node (cluster) that can be useful in discriminating
potential motif models from the random ones. Note that
this value is only a relative indicator rather than a physical
quantification of noise level in the cluster. The implementation
of CSFs in FSOM is conceptualized in Fig. 1.
Remark 1: In typical SOM-based algorithms for DNA
motif discovery, CSFs can be applied to find the BMU (Best
Matching Unit) with denotation ci(t) for a given Ki at t-thiteration as: ci(t) = argmin
lΘl(Ki,Ml(t)). The adaptation
given in (11) then can be simplified for a crisp set of k-mers
as, λj(t + 1) =
∑
∀K∈Vj(t)rb(K)
∑
∀K∈Vj(t)[r(K,Mj(t)) + rb(K)]
, where
Vj(t) is the j-th crisp cluster produced at t-th iteration.
B. Demonstration on signal-discrimination
The demonstration uses a 10 × 10 FSOM network trained
on a k-mer (k=12) dataset generated from a set of promoter
sequences of co-regulated genes that contain a known motif
of CREB [13] transcription factor. The objective is to vi-
sualize the effectiveness of CSF adaption in discriminating
the putative (signal dominated) nodes from the non-putative
(noise) nodes in respect with the following three modes of the
composition parameter (λ):
1) λAdaptive :⇒ adaptive composition as shown in (10).
2) λ0.5 :⇒ equal weight composition in (10), yielding
Θj(Ki,Mj(t)) = 0.5×r(Ki,Mj(t))+0.5×[1−rb(Ki)].3) λ1.0 :⇒ B-MISCORE omitted composition that rewrites
(10) as Θj(Ki,Mj(t)) = r(Ki,Mj(t)).
We applied z-score to statistically measure the relative degree
of discrimination between a set of signal nodes (Sq) and a set
of noise nodes (Nq) as,
Z(Sq, Nq) =EQ(Sq) − EQ(Nq)
stdQ(Nq) × C, (14)
where E∗ is the expectation on q models, Q(∗) is the
respective model quantification for the adaptation modes and
C = 3 is a scaling constant for visualization.
In typical SOM-based motif discovery, a limited number
of top scoring models are extracted as putative signal nodes
from a trained map. Similarly in this demonstration, nodes are
evaluated at each iteration and a limited top q and the bottom
q scoring nodes are categorized as the putative signal and the
noise nodes, respectively, for Z(Sq, Nq) score computation.
The network is given the same initial state for each of
FINAL VERSION OF TNNLS-2014-P-4075 5
the modes of adaptation during each run. This is repeated
separately for q ∈ 5, 10, 15, 20 and a 10-run average is
presented in Fig. 2.
Observations: This visualization depicts that the adaptive
mode of the composition (λAdaptive), which is functionally
required in CSFs, offers a better discrimination of signal
nodes than the other two modes considered. This describes the
usefulness of the combination of background referencing by B-
MISCORE and the adaptive composition used in CSFs. Fig. 2
also depicts a rational decrease in the degree of discrimination
as q increases to a larger number, which agrees with the
distribution of signals in the nodes of the maps. Similar results
were observed on other datasets in our unreported experiments.
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1000.6
0.65
0.7
0.75
0.8
training cycle t
mean
λ(t
)
meanputative nodes ± std
meannoise nodes ± std
(a): Random initialization of the weights.
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 961000.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
training cycle t
mean
λ(t
)
meanputative nodes ± std
meannoise nodes ± std
(b): Same initializing value of the weights, i.e., ∀λj(t0) = 0.5.
Fig. 3. Demonstration on the effects of different initialization of CSF weights.
C. Demonstration on CSF-initialization
In another attempt we intended to observe the impacts of
different initializations, i.e., (i) random, and (ii) a same value
initialization, of the CSF weights on its signal-discrimination
ability. The experiment was set to separate n-number of
putative signal nodes with a comparatively lower degree of
noise in them from the rest of the nodes that are mostly noise-
dominated, based on their respective noise-level indication,
which allowed comparing the CSF-adaptation for these two
types of nodes with supposedly opposite signal characteristics
in the map throughout the iterations. In implementation,
n = 10 nodes were firstly separated as signal nodes during
each iteration using their respective noise-level indicator,
while the rest of the nodes were categorized as noise-nodes
in the network. Then, for each initialization mean∗ and
std∗ of the CSF-weights (λ(t)) of these two types of
clusters were iteration-wise plotted in Fig 3.
Observations: This visualization shows that CSF-adaptation
is capable of effectively discriminating the putative nodes
from the noise dominated ones in the map by assigning a
comparatively higher λ(t) value to the signal nodes throughout
the major portion of the training, regardless of the initialization
applied. That is, the adaptation receives a very minimal or a
negligible impact from the initialization of CSF weights, which
adds a supportive feature to its algorithmic robustness. Hence,
a random initialization of the weights can be conveniently
applied, as used in the experiments in this paper.
D. Benefits
This section describes the key benefits of the proposed CSF-
based similarity measure, contrasting the use of traditional
similarity metrics in SOM-based (clustering-based) motif dis-
covery, as follows.
1) The proposed CSFs address a major limitation of the
state-of-the-art clustering approaches that inconsistently
apply the same (analogous) optimization to the clusters
with opposite signal characteristics, causing inexplicable
noise clusters to largely populate the maps. CSFs resolve
this inconsistency by refining the clusters to become
more motif-like, depending on their signal composition.
In other words, CSFs ensure the degree of optimization
to each cluster is directly related to the present degree
of motif properties and embedded noise in that cluster
during each iteration, offering a consistent interpretation
to every cluster in the maps and resultantly, an improved
system clarity.
2) The proposed CSF adaptation reveals the current noise
level in each cluster (node) throughout the iterations,
which enables the monitoring of the ongoing clustering
process. In this aspect, the proposed similarity function
is certainly more useful than the traditional similarity
metrics that are not meant to: (i) reveal the degree
of embedded noise in the clusters at any clustering
iteration; and (ii) enable the monitoring of the quality
of the ongoing clustering.
3) CSFs follow a critical argument that, if a putative bind-
ing site k-mer has a similar match to multiple putative
clusters, then intuitively it is useful to consider their
individual merit in terms of the present degree of motif
properties and embedded noise in the clusters for a more
appropriate assignment of the k-mer. In this manner,
CSFs enable embedding a discrete optimization to the
putative clusters throughout the iterations, which poten-
tially increases the chances of producing more putative
motif candidates in the maps. In contrast, a discrete
optimization to the putative clusters in the state-of-the-
art clustering-based approaches can only be applied as
a separate process after the post-training clusters are
evaluated by external motif scoring metrics.
FINAL VERSION OF TNNLS-2014-P-4075 6
E. FSOM learning using CSFs
The update-equation of FSOM learning in READcsf is
given in (6), which is the same to that used in READ
[1]. However, the CSF implementation distinguishes k-mer
distribution using fuzzy membership computation in READcsf
as,
µli(t) =
[
N∑
q=1
(
Θl (Ki,Ml(t))
Θq (Ki,Mq(t))
)2
m−1
]−1
. (15)
The classical FCM objective [11] then can be written as,
Jm(t) =
|Γ|∑
i=1
N∑
j=1
µmij Θj(Ki,Mj(t)), (16)
where Θ(∗, ∗), Γ, N and m take denotations that are previous-
ly described. The decrease of Jm(t) can be monitored along
with the increase of the performance coefficient pc(µ(t)) =1|Γ|
∑Γj=1
∑N
i=1 µij(t)2 to stop FSOM training in READcsf
when the neighborhood range is sufficiently shrunken [1].
IV. READcsf : IMPROVED READ WITH CSFS
The CSF-based similarity measure is implemented in the
FSOMs of our READ system [1] for motif discovery for
two reasons: (i) to demonstrate the usefulness of CSFs in
clustering-based motif discovery; and (ii) to obtain a new
motif mining tool (READcsf ) that benefits from the synergy
between (ii-a) the FSOM-based soft-clustering that addresses
the underlying fuzziness in the datasets and (ii-b) the CSF-
enabled treatment to the fuzzy signal and noise composition
of the clusters.
This section describes the READcsf algorithm (Robust
Elicitation Algorithms for Discovering DNA Motifs using
Composite Similarity Functions), emphasising: (i) the training
of multiple FSOMs using the CSF-based similarity measure
and (ii) the CSF-based motif scoring functions for node
calibration, while the technical description of the common
components between READ and READcsf are conveniently
referred to [1].
A. Overview
The 2-dimensional output grid of READcsf is a lattice of
N = R × C nodes, where R,C are the number of rows and
columns, respectively. The j-th node, j = 1, 2, . . . , R×C, has
a 2D coordinate zj = [zj1, zj2] in the lattice. The j-th node
is initialized with: (i) a randomly generated PFM Mj(t0); and
(ii) a randomly initialized composition weight 0 < λj(t0) < 1.
The learning steps at t-th iteration are then as follows:
• Membership computation: Compute the fuzzy member-
ship of each k-mer to every node using CSFs.
• Prototype updating: Update the node prototypes using
fuzzy membership distribution of k-mers and grid-based
neighborhood cooperation between the nodes.
• CSF adaptation: Update the composition weight λj(t)based on the current noise level of the j-th node.
Post-training nodes in the map are evaluated and ranked
using the proposed CSFs-based motif scoring metric given in
Fig. 4. READcsf algorithm overview.
(17). Multiple FSOMs are trained for variable k-mer lengths
(kmin ≤ k ≤ kmax) due to the unknown length of the motif
elements. User-defined top T candidates are then returned
as final motifs through an open competition between the
candidates with different consensus length (k-mer length)
extracted from multiple FSOMs. An overview of the READcsf
algorithm is presented in Fig. 4 illustrating the parallel training
of multiple FSOMs.
B. Candidate evaluation
The post-training nodes need calibration to identify the
putative candidates in the maps. The CSF-based similarity
measure gives a new motif scoring function that can be written
as,
Q(Mj(t′)) =
∑|Γ|i=1 Θj(Ki,Mj(t
′)) µji(t′)
∑|Γ|i=1 µji(t′)
, (17)
where t′ (t′ ≤ tmax) is the final training iteration, λj(t′) is
the final noise level of the j-th node and µji(t′) is the final
fuzzy membership between Ki and Mj(t′).
FINAL VERSION OF TNNLS-2014-P-4075 7
By applying the CSF description given in (10) and by
algebraic derivation, (17) can be re-written as,
Q(Mj(t′)) =λj(t
′) Rf (Mj(t′)) +
[1− λj(t′)]
[
1−Rfb (Mj(t
′))]
,(18)
where Rf(∗) is the fuzzy-MMS metric [1] for quantifying the
degree of motif properties and [1−Rfb (∗)] is the inverse of the
fuzzy-BMMS metric for measuring the dissimilarity of a fuzzy
cluster from the backgrounds. Smaller such scores, combined
through the final signal composition weight of the cluster, are
functionally required for a smaller Q(∗) score to calibrate a
fuzzy cluster as a putative motif candidate in the map. In (18)
the following holds: 0 < λj(t′), Rf(∗), Rf
b (∗) < 1.
C. Post-processing
The top T candidates (user defined) are selected to be
optimized by their grid-based neighboring nodes, followed by
their defuzzification (decoding), as applied in READ [1]. The
candidates extracted from multiple FSOMs are then refined
with the post-processing scheme detailed in [1]. The top Tcandidates are then returned as the final motifs through an open
contest among the motif candidates with different consensus
length extracted from multiple maps.
D. Relation to READ System
It is deemed meaningful to describe the relationship between
READcsf and its predecessor READ system [1]. READ
system was introduced with a primary focus on addressing: (i)
the underlying fuzziness in the characterizing features of the
motif models; and (ii) the practically fuzzy-association of the
motif instances (binding sites) to multiple and different motif
models. Aiming to address such inherent fuzziness in DNA
motifs in their discovery, READ system adopted modified
Fuzzy Self-Organizing Maps (FSOMs) with an unsupervised
soft-clustering approach and several heuristics-based post-
processing schemes for effective motif mining [1].
In contrast, READcsf primarily focuses on addressing the
explicability aspect of the clusters in the map by introducing a
means of quantitatively measuring the signal and noise com-
position in a cluster at any given clustering state through the
use of a novel background-similarity measure [10]. READcsf
introduces a novel and sensible approach for cluster analysis
in motif discovery in addition to offering the features of its
predecessor.
V. PERFORMANCE EVALUATION
This section reports the performance evaluation with two
objectives: (i) to demonstrate the benefits of applying CS-
Fs through a performance comparison between READ and
READcsf ; and (ii) to study the usefulness of READcsf as a
motif mining tool in comparison to other SOM-based tools,
i.e., SOMBRERO [2] and SOMEA [5], and other prominent
tools namely MEME [14], AlignACE [15] and WEEDER [16].
This paper adopts the performance measures used in [1], i.e.,
recall (R), precision(P) and F -measure (F) rates.
Algorithm 1 READcsf learning pseudocodes
1: START2: input: Γ, T , N = R ×C, tmax.3: ensure: Γ 6= ; 5 ≤ R ≤ 100; 5 ≤ C ≤ 100;4: 1. Initialization:5: For the j-th node, j = 1, 2, . . . , N :6: Generate a random PFM Mj(t0).7: Allocate a 2D coordinate as zj = [zj1, zj2].8: Randomly initialize λj(t0), i.e., 0 < λj(t0) < 1.9: 2. Training:
10: for t = 1 : tmax do11: ∆Mj(t)[4×k] ⇐= [0]4×k ; ∆hj(t) ⇐= 0; j = 1, . . . , N .12: 2.1. task: fuzzy membership computation
13: for i = 1 : |Γ| do14: for l = 1 : N do
15: µli(t) ⇐=
[
N∑
q=1
(
Θl (Ki,Ml(t))
Θq (Ki,Mq(t))
) 2
m−1
]−1
.
16: end for17: end for18: 2.2. task: node updates computation
19: for i = 1 : |Γ| do20: for j = 1 : N do21: for l = 1 : N do22: ∆Mj(t) ⇐= ∆Mj(t) + µm
li (t) hjl(t) Ki.23: ∆hj(t) ⇐= ∆hj(t) + µm
li (t) hjl(t).24: end for25: end for26: end for27: 2.3. task: adaptation28: for j = 1 : N do
29: Mj(t+ 1) ⇐=∆Mj(t)
∆hj(t).
30: λj(t+ 1) ⇐=
∑|Γ|
i=1µji(t) rb(Ki)
∑|Γ|
i=1µji(t) [r(Ki,Mj(t)) + rb(Ki)]
.
31: end for
32: σ(t+ 1) ⇐= σ(t0) exp −2σ(t0) t/tmax.33: Stop training if the termination condition is satisfied.34: end for
35: 3. Motif extraction and post-processing:36: • Evaluate and rank each node using (17).37: • Extract top T candidates from the ranking.38: • Apply post-processing as described in [1].39: • Retain candidates for an open contest among the candidates
extracted from multiple maps with different motif length.40: END41: Notations: |∗|: set cardinality; T : number of motifs to return; N :
number of nodes in the map; tmax: maximum number of trainingepoch; t0: initialization stage of the map; Mj(t): node PFM ofj-th node at t-th epoch; hjl(t): neighborhood function given in(8); and ∆Mj(t), ∆hj(t): are two computing component fornode updating.
42: Note: This pseudocode applies to FSOM training for a givenconsensus length k and multiple FSOM trainings are requiredfor user defined kmin ≤ k ≤ kmax.
There exist a large number of different algorithms and
tools for DNA motif discovery in current literature. However,
due to the constraints on time, resources and due to the
succinctness of this paper, we have carefully selected a couple
of those tools in the quantitative evaluation based on the
following reasons: (i) SOMBRERO and SOMEA represent a
class of SOM-based tools that are the recent developments
in SOM-based/clustering-based motif discovery and due to
FINAL VERSION OF TNNLS-2014-P-4075 8
TABLE IPERFORMANCE EVALUATION USING REAL DATASETS
Average recall (R), precision (P) and F -measure rates over 10 runs
SOM/FSOM-based Tools Other ToolsFSOM-based Tools SOM-based Tools
READcsf READ SOMEA SOMBRERO MEME ALIGNACE WEEDERTF R P F R P F R P F R P F R P F R P F R P F
CRP 0.79 0.90 0.84 0.76 0.84 0.80 0.91 0.89 0.90 0.83 0.43 0.56 0.59 0.88 0.69 0.83 0.98 0.90 0.75 0.83 0.79GCN4 0.46 0.59 0.50 0.48 0.70 0.55 0.69 0.45 0.54 0.80 0.41 0.53 0.52 0.52 0.52 0.61 0.62 0.60 0.64 0.87 0.73
ERE 0.85 0.63 0.72 0.92 0.59 0.71 0.74 0.58 0.65 0.80 0.59 0.67 0.72 0.82 0.77 0.75 0.77 0.76 0.76 0.54 0.63MEF2 0.99 0.90 0.94 0.96 0.87 0.91 0.81 0.99 0.89 0.35 0.22 0.27 0.92 0.80 0.85 0.86 0.87 0.86 0.88 0.88 0.88
SRF 0.90 0.81 0.85 0.91 0.77 0.83 0.84 0.74 0.79 0.67 0.83 0.74 0.87 0.72 0.79 0.83 0.71 0.77 0.83 0.71 0.76CREB 0.82 0.81 0.82 0.82 0.78 0.81 0.89 0.67 0.77 0.83 0.43 0.56 0.59 0.88 0.69 0.52 0.66 0.57 0.79 0.71 0.75
E2F 0.71 0.73 0.72 0.69 0.74 0.71 0.82 0.64 0.71 0.76 0.67 0.71 0.68 0.64 0.65 0.75 0.68 0.71 0.89 0.67 0.76MyoD 0.67 0.43 0.52 0.65 0.42 0.51 0.66 0.39 0.49 0.50 0.32 0.39 0.23 0.38 0.27 0.34 0.31 0.32 0.43 0.50 0.46
avg 0.77 0.72 0.74 0.77 0.71 0.73 0.80 0.67 0.72 0.69 0.49 0.55 0.64 0.71 0.65 0.69 0.70 0.69 0.75 0.71 0.72
TABLE IISTATISTICAL DESCRIPTION OF THE EIGHT REAL DATASETS
TF Res Lbs Nseq Nbs Nspp
(min,max, avg) (bp)
CREB H (05, 30, 12) 17 19 1.12SRF H (09, 22, 12) 20 35 1.75MEF2 H (07, 15, 10) 17 17 1.00MyoD H (06, 06, 06) 17 21 1.24ERE M (13, 13, 13) 25 25 1.00E2F M (11, 11, 11) 25 27 1.08CRP E (22, 22, 22) 18 24 1.33GCN4 Y (05, 15, 07) 09 21 2.33
Notations: Res is the resource: (H, M, Y, E) refer to (Human, Mouse,Saccharomyces cerevisiae, E.coli) respectively, Lbs is the length of thebinding sites in bp, Nseq is the number of sequences in the dataset,Nbs is the number of binding sites in the dataset and Nspp is theratio of number of binding sites per promoter sequence.
their high relevancy to this work; and (ii) MEME, AlignACE
and WEEDER represent a very prominent group of tools
that are developed on different state-of-the-art computational
approaches. Additionally, our previous work [1] can be re-
ferred for a comprehensive performance evaluation between
several state-of-the-art clustering-based approaches namely: (i)
a standard FCM-based [11] approach, (ii) a classical batch
learning SOM-based [17] approach and (iii) our FSOM-based
[1] approach for DNA motif discovery task.
We also acknowledge the fact that the mining tools, found-
ed on different approaches and algorithms, have different
strengths and weaknesses and a comparison of the perfor-
mance of these tools is not completely fair due to several
unavoidable reasons. Thus, a perfect performance benchmark-
ing is neither expected nor achievable and the results reported
here should only serve as references.
A. Results on real DNA datasets
Due to the significance of the results obtained on real
datasets in DNA motif discovery, we used eight real datasets
in the evaluation. These datasets, collected from [13], [18],
are composed of the real promoter sequences of co-regulated
genes that contain verified motifs (functional binding sites)
that bind to ERE, MEF2, SRF, CREB, E2F, MyoD, CRP
and GCN4 TFs. Each dataset contains a varying number of
sequences and one verified motif with known location of
its instances (known binding sites) in the sequences. These
datasets are useful to evaluate the tools with respect to the
original sequence properties in finding known motifs. The
statistical features of these datasets are given in Table II.
During each run of READcsf on a dataset, multiple FSOMs
were trained with random map sizes between 10 × 10 to
20 × 20 for each consensus length k (kmin ≤ k ≤ kmax)
such that (kmin, kmax) = (l − 3, l + 3), where l is the
consensus length of the true motif in the dataset. Then, the top
10 candidates were set to be extracted from each map. The
composition weight associated with each node was randomly
initialized as 0 < λ(t0) < 1. The initial neighborhood range
σ(t0) = 3, fuzziness regulator m = 1.025 and maximum
epoch tmax = 100 were set for training as applied in [1].
The training and parameter settings of READ, SOMBRERO
and SOMEA were described in [1]. For the sake of a fair com-
parison, READcsf , READ, SOMBRERO and SOMEA were
allowed to have the same map size, a random initialization
of nodes, the same number of maximum epoch, the same
expected motif width (k-mer length) and the top 10 candidates
to be returned during each run on each dataset. Also, MEME,
AlignACE and WEEDER were run on these datasets using
the parameter settings detailed in [1]. The ‘best’ motif found
during each run of a tool on each dataset in terms of F -
measure was saved and the recall (R), precision (P) and F-
measure (F) rates obtained by these motifs were recorded.
The average recall, precision and F -measure rates over
10 runs obtained by the tools on each dataset are presented
in Table I, showing that READcsf outperformed READ on
seven of eight datasets in terms of F -measure, indicating the
benefits of using CSFs, since its implementation distinguishes
READcsf from READ. In comparison with other SOM-based
tools, READcsf (0.74) obtained a noticeable 25.7% improve-
ment over SOMBRERO (0.55) and a 2.7% improvement over
SOMEA (0.72) in terms of average F -measure computed on
the datasets. Also, READcsf (0.77) obtained a considerable
10.4% improved average recall rate over SOMBRERO (0.69),
indicating its significantly improved ability to retrieve true
binding sites over SOMBRERO. Also, it obtained a remark-
able 31.9% and a 6.9% improved average precision rate over
SOMBRERO and SOMEA, respectively.
In comparison with the other tools considered, the average
F -measure of READcsf (0.74) on these datasets shows a
significant improvement over MEME (0.65) and AlignACE
FINAL VERSION OF TNNLS-2014-P-4075 9
TABLE IIIPERFORMANCE EVALUATION USING MULTIPLE MOTIF DATASETS
Average recall (R), precision (P) and F -measure (F) rates over 10 runs
SOM/FSOM-based Tools Other Tools
FSOM-based Tools SOM-based Tools
READcsf READ SOMEA SOMBRERO MEME WEEDER
3 TFs R P F R P F R P F R P F R P F R P F
Dataset1 CREB 0.43 0.28 0.34 0.39 0.29 0.33 0.43 0.26 0.33 0.44 0.26 0.33 0.20 1.00 0.33 0.00 0.00 0.00
MyoD 0.38 0.19 0.25 0.27 0.19 0.23 0.48 0.23 0.31 0.20 0.08 0.11 0.00 0.00 0.00 0.00 0.00 0.00
TBP 0.31 0.18 0.22 0.28 0.20 0.23 0.36 0.21 0.26 0.20 0.12 0.15 0.07 0.50 0.12 0.00 0.00 0.00
avg 0.37 0.22 0.27 0.31 0.23 0.26 0.42 0.23 0.30 0.28 0.15 0.20 0.09 0.50 0.15 0.00 0.00 0.00
Dataset2 HNF4 0.86 0.64 0.73 0.85 0.64 0.73 0.39 0.27 0.31 0.36 0.21 0.26 0.44 0.78 0.56 0.00 0.00 0.00
NFAT 0.40 0.32 0.36 0.36 0.29 0.32 0.57 0.40 0.47 0.63 0.39 0.48 0.60 0.82 0.69 0.40 1.00 0.57
SP1 0.53 0.44 0.48 0.53 0.43 0.47 0.50 0.53 0.50 0.53 0.35 0.42 0.38 0.54 0.44 0.00 0.00 0.00
avg 0.60 0.46 0.52 0.58 0.45 0.51 0.49 0.40 0.43 0.51 0.32 0.39 0.47 0.71 0.56 0.13 0.33 0.19
Dataset3 CAAT 0.46 0.25 0.32 0.28 0.20 0.23 0.43 0.21 0.25 0.32 0.17 0.22 0.29 0.80 0.42 0.00 0.00 0.00
MEF2 0.62 0.43 0.51 0.61 0.46 0.52 0.70 0.40 0.50 0.59 0.28 0.38 0.29 0.57 0.38 0.00 0.00 0.00
SRF 0.54 0.33 0.41 0.49 0.35 0.40 0.79 0.45 0.57 0.65 0.31 0.27 0.80 0.57 0.67 0.27 1.00 0.42
avg 0.54 0.34 0.41 0.46 0.33 0.38 0.64 0.35 0.44 0.52 0.25 0.29 0.46 0.65 0.49 0.09 0.33 0.14
Dataset4 HNF3B 0.24 0.17 0.19 0.21 0.14 0.17 0.68 0.39 0.48 0.73 0.48 0.57 0.41 0.88 0.56 0.00 0.00 0.00
NFKB 0.95 0.81 0.87 0.89 0.81 0.85 0.47 0.25 0.31 0.26 0.13 0.17 0.15 1.00 0.27 0.00 0.00 0.00
USF 0.65 0.52 0.57 0.61 0.52 0.56 0.71 0.47 0.56 0.66 0.46 0.54 0.80 0.57 0.67 0.33 1.00 0.50
avg 0.61 0.50 0.55 0.57 0.49 0.53 0.62 0.37 0.45 0.55 0.36 0.43 0.45 0.82 0.50 0.11 0.33 0.17
Dataset5 CMYC 0.94 0.73 0.82 0.94 0.75 0.83 0.61 0.37 0.46 0.49 0.33 0.36 0.40 0.75 0.52 0.40 1.00 0.57
EGR1 0.69 0.48 0.56 0.61 0.43 0.51 0.74 0.47 0.57 0.89 0.70 0.84 0.75 1.00 0.86 0.19 0.75 0.30
GATA3 0.50 0.37 0.42 0.46 0.37 0.41 0.66 0.36 0.47 0.47 0.26 0.33 0.64 0.81 0.72 0.00 0.00 0.00
avg 0.71 0.53 0.60 0.67 0.52 0.58 0.67 0.40 0.50 0.62 0.43 0.51 0.60 0.85 0.70 0.20 0.58 0.29
avg5 datasets 0.57 0.41 0.47 0.52 0.40 0.45 0.57 0.35 0.42 0.49 0.30 0.36 0.41 0.71 0.48 0.11 0.32 0.16
(0.69) and is also better than WEEDER (0.72). Noticeably,
the average recall rate of READcsf (0.77) is found to be
significantly higher than MEME (0.64) and AlignACE (0.69)
and better than WEEDER (0.75), even though similar average
precision rates of these tools are observed. Improvement in
the recall rates without compromising the precision rates
enabled READcsf to outperform the other tools considered.
Remark 2: It was previously shown in [1] that the operational
complexity of standard SOM (Ωsom) and FSOM (Ωfsom) can
be similar in practical implementation, i.e., Ωfsom ≈ Ωsom.
This can be achieved by customizing FSOM learning without
losing its integrity. We extend this understanding to READcsf
training that comprises: 1) B-MISCORE computation of k-
mers; and 2) FSOM learning, giving its overall operational
requirement as: ΩREADcsf= Ω(rb(K)) + Ωfsom. The first
term imposes a minor increase in training time due to its pre-
computable nature, implying no major difference between the
computation time of READcsf and the state-of-the-art SOMs.
In a demonstration, a 10-run average training time of READcsf
and standard SOMs on eight datasets were found as 115.62
and 99.20s respectively, where the same number of nodes and
a fixed number of cycles were set for a fair comparison using
an Intel(R) Core(TM) i7- 3612QM CPU @ 2.10 GHz machine.
B. Results on multiple motif datasets
Computational tools are expected to be capable of find-
ing multiple motifs if these exist in the query set of input
sequences. However, the F -measure-based performance eval-
uation on motif mining task requires to know the specific
locations of the instances (binding sites) of different motifs in
the set of sequences, as a pre-requisite to recall and precision
measure. To the best of our knowledge it is difficult to find
a set of co-regulated sequences with the pointed locations
of the binding sites of different transcription factors in the
same sequence collection, that can be applied in quantitative
performance evaluation of tools in multiple motif mining task.
Therefore, due to the lack of availability of the real datasets
with such properties in the public databases, we adopted five
artificial datasets from our previous studies [1], [5] in this
evaluation. Applying these datasets serves two other purpose:
1) Each dataset contains twenty sequences of real pro-
moters taken from relevant species and each dataset
has three verified motifs, each for a different TF, and
the known motif instances are arbitrarily planted in
the promoters. These datasets are useful in evaluating
the tools in terms of simultaneously mining multiple
motifs to imitate a plausible scenario in real-world motif
mining, where the input set of promoter sequences may
harbour multiple functional motifs of different TFs.
2) Each dataset is composed of considerably large-length
sequences and features a problematically low signal-to-
noise ratio (≤ 0.0018). These challenging features test
the ability of the tools in finding motifs in large datasets
in a simulated environment. Note that these results only
serve as a reference rather than a complete scalability
benchmarking of the tools, which is beyond the scope
of this paper.
READcsf , READ, SOMEA, SOMBRERO, MEME and
WEEDER were run on these datasets. The training and pa-
rameter settings of these tools were kept similar as applied in
the single motif discovery task. However, due to the increased
FINAL VERSION OF TNNLS-2014-P-4075 10
size of datasets, the SOM/FSOM-based tools were given a
larger map size of 20×20, and all the tools were set to return
the top 20 candidates during each run. Then, the best motif
for each TF in terms of the F -measure was recorded during
each run of the tools on a dataset and the average R, P and
F -measure rates over 10 runs are presented in Table III.
The results show that READcsf obtained a noticeable
4.3% improvement in terms of the average F -measure and a
remarkable 8.8% improvement in terms of the average recall
rate over READ on these datasets. Their average precision
rates were found to be closely similar, i.e., READcsf (0.41)
and READ (0.40), which was caused by applying the same
post-processing scheme for model refinement described in [1].
Thus, it is deducible that the improvement in the F -measures
of READcsf over READ is caused by its higher recall ability
of the true motif instances, which is potentially facilitated by
the CSF-based similarity computation, revealing the usefulness
of CSFs over traditional similarity metrics in FSOM-based
motif discovery. READcsf also produced the best average
F -measure among the SOM/FSOM-based tools considered.
Remarkably, READcsf (0.47) obtained a noticeable 23.4%improvement over SOMBRERO (0.36) and a 10.6% im-
provement over SOMEA (0.42) in terms of the average F -
measure on these datasets, demonstrating its potential ability
in producing more useful mining results than the existing
SOM/FSOM-based tools.
In comparison with the other tools, MEME obtained the
best average F -measure on these datasets. Note that, the
SOM-based (also, FSOM-based) tools face the following
two major performance biases compared to the other tools
(e.g., MEME) on multi-motif datasets: (i) the proper map
size selection and (ii) the k-mers length selection in order
to simultaneously satisfy multiple motifs, as elaborated in
[1]. Despite these biases, READcsf (0.47) obtained a similar
average F -measure to MEME (0.48). Noticeably, READcsf
(0.57) obtained a 28.1% improved average recall rate over
MEME (0.41), which is certainly advantageous in this
complicated and low performance motif discovery exercise.
Demonstration Using a Real Dataset: In order to learn the
capability of our computational tool developed in this paper
in discovering multiple motifs, a real dataset containing the
instances of multiple motifs is examined. Firstly, we collected
a set of co-regulated sequences from literature [19], [20] for
SWI4 and SWI6 transcription proteins. Then, we carefully
selected only the common sequences (intersection) from the
two sets of sequences. This gives a sequence collection (named
as SWI4 SWI6) that contains the instances of both SWI4
and SWI6 TFs, however with no information on the specific
locations of the binding sites in the sequences. Feature-
wise SWI4 SWI6 sequence set contains 78 sequences with
an average length of 717.5bp each. The unavailability of
the locations of the binding sites in sequences enables this
discovery exercise to mimic a practical motif finding task.
We ran READcsf , MEME and Weeder on this dataset and
each tool was allowed to return maximum top 20 motifs during
each run. It was observed that READcsf was able to find both
motifs simultaneously in each run after careful inspection on
the list of motifs returned and by comparing them with the
verified logos of these motifs collected from JASPAR [21]
database. For a qualitative comparison, best samples of logos
discovered by these tools over 10 runs are presented in Fig. 5
for a visual comparison with the verified logos from JASPAR
database, where READcsf has shown a promising performance
of discovering multiple motifs.
Noticeably, READcsf retuned those two motifs within the
top 5 candidate motifs in a run, while the other tools consid-
ered could not either recognize these motifs in higher ranks
in their returned lists or discovered either of the motifs and
missed the other one in a run. Thus, in order to quantitatively
measure this motif-recognizability performance of these tools,
we adopted mean rank (φ) score computing from [8] as:
φ(M) = q(q+1)/2∑q
i=1 rank(Mi), where q is the number of
the relevant items (motifs) whose rank orders are to be consid-
ered and a higher φ(∗) indicates a better motif-recognizability.
We observed the following mean rank scores for the tools
in finding these motifs over 10 runs as:
φ(SWI4) φ(SWI6) φ(SWI4, SWI6)READcsf 0.61 0.60 0.79
MEME 0.45 0.42 0.41
WEEDER 0.44 0.45 0.42
These figures show that READcsf discovered both motifs in
top ranking and outperformed the other tools. Previous studies
[1], [2], [5] revealed that, SOM-based clustering approaches
are capable to returning multiple candidate motifs simultane-
ously in the same search, where multiple candidates are usual-
ly found to share partial representation of the same motif and
often they represent different motifs with significantly diverse
properties. The latter observation leads to their usefulness in
discovering multiple motifs in the query sequences if that exist.
C. Robustness analysis
In the SOM-based tools, an improper map size degrades
the quality of clustering and motif mining performances. In
order to robustly handle the negative effects of the map size
setting, READ and READcsf adopted (i) FSOMs for soft-
partitioning in the k-mer dataset, and (ii) a post-processing
scheme [1] capable of quickly turning a noisy motif model into
a desired one by acquiring left-behind subtle motif elements
and by iteratively removing noise from the model. These two
mechanisms enabled READ to be more robust in handling
inaccurate map sizes than SOMBRERO and SOMEA, as
demonstrated in [1]. Similarly, READcsf is anticipated to
demonstrate such robustness. However, the involvement of
CSFs is interesting to observe in such an aspect, which is
investigated in this section.
READcsf , READ, SOMBRERO and SOMEA were run
on the real datasets using standard map sizes of 10 × 10,
15 × 15 and 20 × 20, while the other training parameters
were kept same as those applied in the single motif discovery
task described in section V-A. A 10-run average of F -measure
obtained by each tool for each map size on each dataset is pre-
sented in Table IV for comparison. Table IV also includes the
standard deviation, as the robustness indicator, computed
over the average F -measure obtained by the tools on different
map sizes. This shows that READcsf produces the smallest
FINAL VERSION OF TNNLS-2014-P-4075 11
SWI4 (Jaspar) SWI4 (READcsf ) SWI4 (MEME)
SWI6 (Jaspar) SWI6 (READcsf ) SWI6 (MEME)
Fig. 5. Verified motif logos of SWI4 and SWI6 TFs collected from JASPAR [21] database are compared with the discovered logos by READcsf and MEME.
TABLE IVROBUSTNESS ANALYSIS OF SOM/FSOM-BASED TOOLS IN HANDLING DIFFERENT MAP SIZE
Average F -measure over 10 runs for different map sizes Standard deviation as robustness indicator
READcsf READ SOMBRERO SOMEA READcsf READ SOMEA SOMBRERO
TF 10x10 15x15 20x20 10x10 15x15 20x20 10x10 15x15 20x20 10x10 15x15 20x20 std(F ) std(F ) std(F ) std(F )
CREB 0.81 0.79 0.81 0.80 0.79 0.78 0.41 0.67 0.67 0.70 0.76 0.72 0.014 0.008 0.031 0.150
CRP 0.80 0.79 0.72 0.79 0.79 0.69 0.71 0.71 0.52 0.81 0.66 0.58 0.039 0.060 0.117 0.110
E2F 0.68 0.69 0.69 0.70 0.70 0.71 0.73 0.63 0.67 0.58 0.69 0.72 0.007 0.004 0.074 0.050
ERE 0.73 0.73 0.75 0.72 0.76 0.71 0.42 0.60 0.74 0.53 0.66 0.61 0.022 0.028 0.066 0.160
GCN4 0.49 0.49 0.46 0.53 0.50 0.49 0.44 0.52 0.60 0.41 0.51 0.58 0.017 0.018 0.085 0.080
MEF2 0.92 0.92 0.82 0.92 0.85 0.75 0.92 0.80 0.44 0.68 0.91 0.82 0.058 0.087 0.116 0.250
MyoD 0.51 0.53 0.49 0.50 0.52 0.44 0.23 0.42 0.49 0.32 0.49 0.47 0.017 0.043 0.093 0.135
SRF 0.83 0.82 0.76 0.82 0.81 0.72 0.67 0.72 0.71 0.70 0.77 0.71 0.039 0.055 0.038 0.026
std value in most of the cases, indicating better robustness
than the other SOM/FSOM-based tools considered in handling
improper map size settings. The noticeable improvement in
terms of such robustness of READcsf over READ system can
be sensibly implied as the effect of CSF-implementation in
the clustering process. Also, this observation indicates that
the rational treatment of signal and noise composition in the
clusters by CSFs can conjunctively improve such robustness
in clustering-based motif elicitation while applied with post-
processing schemes [1] especially designed for this task.
D. Discussion
This section presents a discussion on how motif elements
are discriminated by CSFs through using signal and noise
characteristics and quantifying their composition in the clus-
ters. Given that the signal type notations read as: Mr = a
random model, Mt = a true motif model, Kr = a random
k-mer, and Kt = a true binding site k-mer, a simplified and
general interpretation of CSF-based similarity measure can be
described using the following four cases.
• Case 1: Θ(Kt,Mt) gives a smaller score caused by the
combined effects of (i) a good degree of motif properties
in Mt; (ii) consequently, a smaller degree of embedded
noise in the cluster represented by a larger λ value; and
(iii) a smaller [1− rb(Kt)] score indicating an inherently
higher dissimilarity of Kt to the backgrounds.
• Case 2: Θ(Kr,Mt) gives a larger score contrasting case
1, due to a larger [1 − rb(Kt)] score indicating a higher
degree of noise resemblance property of Kr.
• Case 3: Θ(Kt,Mr) gives a larger score contrasting case
1, caused by the combined effects of the following: (i) a
random model Mr is likely to have a higher degree of
noise embedded, causing a larger noise level represented
by a smaller λ value associated with Mr; and (ii) the
MISCORE-based similarity r(Kt,Mr) is larger due to
the absence of motif properties in a random model.
• Case 4: Θ(Kr,Mr) gives a larger score with some
degree of randomness contrasting case 1, caused by the
stochastic nature of r(Kr,Mr) quantification due to the
absence of motif properties in a random (noise) model
[8]. That is, the relationship between a random k-mer
and a noise-dominated model imposes some degree
of uncertainty in the modelling scheme, which is a
persistent problem in all existing approaches of signal
(motif) discrimination due to the special characteristics
of this problem. However, it is observed in [8] that
R(Mt) ≪ ER(Mr) holds, where E∗ is the
expectation over a large number of random models,
implying r(Kt,Mt) < r(Kr,Mr) and consequently,
Θ(Kt,Mt) < Θ(Kr,Mr) in an average case.
VI. CONCLUSIONS
A consistent interpretation of clusters through an explicable
distribution of k-mers in respect to the embedded signal
and noise characteristics in the clusters is a fundamental
requirement for system clarity, which however has not been
previously solved, where the existing domain-specific simi-
larity metrics play a persistently problematic role. This work
has addressed this problem through introducing the composite
similarity functions (CSFs) that are capable of measuring the
degree of noise embedded in each cluster and utilizing this
FINAL VERSION OF TNNLS-2014-P-4075 12
information in discriminating putative motif clusters from the
noise dominated ones during k-mer distribution throughout the
training. This offers two significant benefits in SOM-based
motif discovery: (i) an improved explicability of all clusters
in the maps with practical benefits in terms of improved
motif mining results; and (ii) a new similarity measure to
improve several problematic aspects of the classical SOM-
based approaches that indiscriminatingly (analogously) treat
clusters with a different degree of signal and noise composition
due to applying the existing similarity metrics.
This paper has described CSF implementation to introduce
READcsf as an improved mining tool that has shown promis-
ing improvement in terms of discovery results over READ,
SOMBRERO, SOMEA and the other tools considered in the
experiments, revealing the usefulness of the technical solutions
presented. The outcome of this study may potentially lead
to a new direction of future research on: (i) novel similarity
metrics for DNA motif mining, and (ii) the noise information-
based clustering techniques in SOM/FSOM-based motif min-
ing. Also, further research can be conducted on advanced
characterization of noise and motif elements in biological
datasets to benefit computational motif mining tools.
ACKNOWLEDGMENT
The authors are grateful to the anonymous reviewers for
their insightful comments that truly helped us to improve
the quality of this publication. The authors also express their
gratitude to our previous research group members, Dr Nung
Kion Lee from Universiti Malaysia Sarawak, Malaysia, Dr
Sean Li from CRISO, Australia and Dr Monther Alhamdoosh,
for contributing in discussions and dataset collection.
REFERENCES
[1] D. Wang and S. Tapan, “A robust elicitation algorithm for discoveringDNA motifs using fuzzy self-organizing maps,” IEEE Transactions on
Neural Neworks and Learning Systems, vol. 24, no. 10, pp. 1677 – 1688,October 2013.
[2] S. Mahony, D. Hendrix, A. Golden, T. J. Smith, and D. S. Rokhsar,“Transcription factor binding site identification using the self-organizingmap,” Bioinformatics, vol. 21, no. 9, pp. 1807–1814, May 2005.
[3] D. Liu, X. Xiong, Z.-G. Hou, and B. DasGupta, Identification of motifswith insertions and deletions in protein sequences using self-organizingneural networks, Neural Networks, vol. 18, no. 56, pp. 835842, June-July 2005.
[4] D. Liu, X. Xiong, B. DasGupta, and H. Zhang, “Motif discoveries inunaligned molecular sequences using self-organizing neural networks,”IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 919–928,July 2006.
[5] N. K. Lee and D. Wang, “SOMEA: self-organizing map based extractionalgorithm for DNA motif identification with heterogeneous model,”BMC Bioinformatics, vol. 12, no. Suppl 1, p. S16, February 2011.
[6] W. W. Wasserman and A. Sandelin, “Applied bioinformatics for theidentification of regulatory elements,” Nature Reviews Genetics, vol. 5,no. 4, pp. 276–287, 2004.
[7] K. D. MacIsaac and E. Fraenkel, “Practical strategies for discoveringregulatory DNA sequence motifs,” PLoS Computational Biology, vol. 2,no. 4, p. e36, April 2006.
[8] D. Wang and S. Tapan, “MISCORE: a new scoring function forcharacterizing DNA regulatory motifs in promoter sequences,” BMC
Systems Biology, vol. 6, no. Suppl 2, p. S4, December 2012.[9] G. D. Stormo and D. S. Fields, “Specificity, free energy and information
content in protein-DNA interactions,” Trends in Biochemical Sciences,vol. 23, no. 3, pp. 109–113, March 1998.
[10] D. Wang, “B-MISCORE: a new similarity metric for self-organizationof DNA k-mers,” LTU Technical Report, June 2013. [Online].Available: http://homepage.cs.latrobe.edu.au/dwang/BMISCORE.pdf
[11] J. C. Bezdek, Pattern recognition with fuzzy objective function algo-
rithms. Norwell, MA, USA: Kluwer Academic Publishers, 1981.[12] M. M. Van Hulle, Handbook of Natural Computing: Theory, Experi-
ments, and Applications. Springer, 2011, ch. Self-Organizing Maps.[13] Z. Wei and S. T. Jensen, “Game: detecting cis-regulatory elements using
a genetic algorithm,” Bioinformatics, vol. 22, no. 13, pp. 1577–1584,April 2006.
[14] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifsin biopolymers using expectation maximization,” Machine Learning,vol. 21, no. 1, pp. 51–80, October/November 1995.
[15] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church, “FindingDNA regulatory motifs within unaligned noncoding sequences clusteredby whole-genome mrna quantitation,” Nature Biotechnology, vol. 16,no. 10, pp. 939–945, October 1998.
[16] G. Pavesi, G. Mauri, and G. Pesole, “An algorithm for finding signals ofunknown length in DNA sequences,” Bioinformatics, vol. 17, no. Suppl1, pp. S207–S214, April 2001.
[17] T. Kohonen, Self-organizing maps, Springer series in information sci-ences. Berlin: Springer, 1995.
[18] J. Zhu and M. Zhang, “SCPD: a promoter database of the yeastsaccharomyces cerevisiae,” Bioinformatics, vol. 15, no. 7, pp. 607–611,July/August 1999.
[19] C. T. Harbison, D. B. Gordon, T. I. Lee, et. al., “Transcriptionalregulatory code of a eukaryotic genome,” Nature, vol. 431, no. 7004,pp. 99–104, September 2004.
[20] K. MacIsaac, T. Wang, D. B. Gordon, D. Gifford, G. Stormo, andE. Fraenkel, “An improved map of conserved regulatory sites forSaccharomyces cerevisiae,” BMC Bioinformatics, vol. 7, no. 1, pp. 113+,March 2006.
[21] D. Vlieghe, A. Sandelin, P. J. De Bleser, et. al., “A new generation ofJASPAR, the open-access repository for transcription factor binding siteprofiles,” Nucleic Acids Research, vol. 34, no. Database issue, January2006.
Sarwar Tapan received his PhD in Computer Sci-ence from La Trobe University, Australia, in Novem-ber 2013. He received his Bachelor of ComputerScience from University of Wollongong (MalaysiaCampus), Australia in 2004, and his Masters inCognitive Sciences in 2008 from Universiti MalaysiaSarawak (UNIMAS). His research interests are inthe applications of intelligent computing techniquesin decision support systems, data visualization, datamining and business intelligence, predictive mod-elling and biological sequence analysis emphasizing
on computational discovery of regulatory DNA motifs.
Dianhui Wang (M’03-SM’05) was awarded a Ph.D.from Northeastern University, Shenyang, China, in1995.
From 1995 to 2001, he worked as a Postdoc-toral Fellow with Nanyang Technological Univer-sity, Singapore, and a Researcher with The HongKong Polytechnic University, Hong Kong, China.He joined La Trobe University in July 2001 and iscurrently a Reader and Associate Professor with theDepartment of Computer Science and InformationTechnology, La Trobe University, Australia. He is
adjunct Professor at The State Key Laboratory of Synthetical Automationof Process Industries, Northeastern University, China. His current researchinterests include data mining and computational intelligence techniques forbioinformatics and engineering applications, and randomized learning algo-rithms for big data modelling.
Dr Wang is a Senior Member of IEEE, and serving as an AssociateEditor for IEEE Transactions On Neural Networks and Learning Systems,IEEE Transactions On Cybernetics, Information Sciences, Neurocomputingand International Journal of Machine Learning and Cybernetics.