12
FINAL VERSION OF TNNLS-2014-P-4075 1 A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing Maps Sarwar Tapan and Dianhui Wang, Senior Member, IEEE Abstract—SOM-based motif mining, despite being a promising approach for problem solving, mostly fails to offer a consistent interpretation of clusters in respect to the mixed composition of signal and noise in the nodes. The main reason behind this shortcoming comes from the similarity metrics used in data assignment, specially designed with the biological interpretation for this domain, are not meant to consider the inevitable noise mixture in the clusters. This limits the explicability of the majority of clusters that are supposedly noise dominated, degrading the overall system clarity in motif discovery. This paper aims to improve the explicability aspect of learning process by introducing a Composite Similarity Function (CSF) that is specially designed for the k-mer-to-cluster similarity measure in respect to the degree of motif properties and embedded noise in the cluster. Our proposed motif finding algorithm in this paper is built on our previous work READ [1] and termed as READ csf , that performs slightly better than READ and shows some remarkable improvements over SOM-based SOMBRERO and SOMEA tools respectively in terms of F -measure on the testing datasets. A real dataset containing multiple motifs is used to explore the potential of the READ csf for more challenging biological data mining tasks. Visual comparisons with the verified logos extracted from JASPAR database demonstrate that our algorithm is promising to discover multiple motifs simultaneously. Index Terms—Computational DNA motif discovery, Composite similarity metrics, Robust elicitation algorithms, Fuzzy self- organizing maps. I. I NTRODUCTION In continuation to our previous study on fuzzy-SOM (FSOM)-based motif discovery [1], this paper addresses a persistent problem in the existing SOM-based tools (e.g., [2]– [5]) such that they commonly demonstrate a critical limitation in addressing the practical fuzzy-mixture of signal and noise in the clusters. These tools ignore the presence of noise in the clusters at all clustering states and optimize the clusters based on only their degree of motif properties, despite the known fact that every cluster practically comprises some degree of noise in most cases due to the specific nature of the problem. Such ignorance to embedded noise consequently limits the explicability of the noise-dominated clusters that occupy the largest portion in the maps, which is in general a common problem in any clustering-based approach for this task. The primary reason for this is the critical limitation of the existing similarity metrics that are meant to consider the motif properties and ignore the embedded noise in the clusters during k-mer (k-length subsequence) assignment. Improvement in D. Wang is with the Department of Computer Science and Information Technology, La Trobe University, Melbourne, Victoria 3086, Australia. e-mail: [email protected] this aspect of SOM-based motif discovery is necessary and has motivated this study. This paper extends our proposed READ framework [1] to address previously unsolved issues in this approach for motif discovery. Technical contributions of this paper include: 1) Improving the explicability aspect of clustering algo- rithms using SOM networks through introducing a new similarity metric which is designed to offer a rational treatment to the embedded noise-mixture in clusters; and 2) Investigating the learning behaviour of SOM networks for subtle pattern discovery task. The nature of the problem necessitates describing two chal- lenging properties of the k-mer dataset: (i) a considerably low signal-to-noise ratio [6], [7], causing noise-dominated clusters to largely populate the maps; and (ii) due to natural degenera- tion caused by evolutionary pressure, motif (signal) elements (binding sites) often have a close resemblance to noise, which causes the unavoidable presence of some degree of noise in the putative motif clusters. Thus, an explicable clustering requires both signal and noise elements (also, their characteristics) in the clusters to be combinedly and complementarily considered, possibly through using specially designed similarity metrics in the clustering process, contrasting the use of existing similarity metrics that are mostly designed with a ‘signal only characterization’ approach for motif discovery. The use of biologically inclined similarity metrics, such as MISCORE [8] and log-likelihood metric [9], offers an explica- ble assignment of the putative binding site k-mers to putative clusters (i.e., clusters with a good degree of motif properties) through characterizing functional motif properties in the clus- ters [8]. Their use in the iterative optimization of clusters aims to consistently improve the degree of motif properties of the clusters irrespective of their dominant signal type. This causes a non-trivial inconsistency throughout the clustering process, since there is a consistent attempt to improve every noise- dominated cluster with a better degree of motif properties in the same manner that only suits the optimization of putative motif clusters. Thus, applying such similarity metrics limits the interpretation to only the putative motif clusters as the noise-dominated ones become inexplicable, imposing a major drawback in terms of system clarity. This paper proposes a new and adaptive similarity met- ric, named Composite Similarity Function (CSF), which is designed to address the discrete composition of signal and noise in the clusters during k-mer distribution. The CSF-based similarity quantification between a k-mer and a cluster gives a composite similarity measure in respect to the current signal composition (noise level) of the cluster using two separate

A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 1

A Further Study On Mining DNA Motifs Using

Fuzzy Self-Organizing MapsSarwar Tapan and Dianhui Wang, Senior Member, IEEE

Abstract—SOM-based motif mining, despite being a promisingapproach for problem solving, mostly fails to offer a consistentinterpretation of clusters in respect to the mixed compositionof signal and noise in the nodes. The main reason behind thisshortcoming comes from the similarity metrics used in dataassignment, specially designed with the biological interpretationfor this domain, are not meant to consider the inevitablenoise mixture in the clusters. This limits the explicability ofthe majority of clusters that are supposedly noise dominated,degrading the overall system clarity in motif discovery. Thispaper aims to improve the explicability aspect of learning processby introducing a Composite Similarity Function (CSF) that isspecially designed for the k-mer-to-cluster similarity measure inrespect to the degree of motif properties and embedded noisein the cluster. Our proposed motif finding algorithm in thispaper is built on our previous work READ [1] and termed asREADcsf , that performs slightly better than READ and showssome remarkable improvements over SOM-based SOMBREROand SOMEA tools respectively in terms of F -measure on thetesting datasets. A real dataset containing multiple motifs is usedto explore the potential of the READcsf for more challengingbiological data mining tasks. Visual comparisons with the verifiedlogos extracted from JASPAR database demonstrate that ouralgorithm is promising to discover multiple motifs simultaneously.

Index Terms—Computational DNA motif discovery, Compositesimilarity metrics, Robust elicitation algorithms, Fuzzy self-organizing maps.

I. INTRODUCTION

In continuation to our previous study on fuzzy-SOM

(FSOM)-based motif discovery [1], this paper addresses a

persistent problem in the existing SOM-based tools (e.g., [2]–

[5]) such that they commonly demonstrate a critical limitation

in addressing the practical fuzzy-mixture of signal and noise

in the clusters. These tools ignore the presence of noise in

the clusters at all clustering states and optimize the clusters

based on only their degree of motif properties, despite the

known fact that every cluster practically comprises some

degree of noise in most cases due to the specific nature of

the problem. Such ignorance to embedded noise consequently

limits the explicability of the noise-dominated clusters that

occupy the largest portion in the maps, which is in general

a common problem in any clustering-based approach for this

task. The primary reason for this is the critical limitation of the

existing similarity metrics that are meant to consider the motif

properties and ignore the embedded noise in the clusters during

k-mer (k-length subsequence) assignment. Improvement in

D. Wang is with the Department of Computer Science and InformationTechnology, La Trobe University, Melbourne, Victoria 3086, Australia. e-mail:[email protected]

this aspect of SOM-based motif discovery is necessary and

has motivated this study.

This paper extends our proposed READ framework [1] to

address previously unsolved issues in this approach for motif

discovery. Technical contributions of this paper include:

1) Improving the explicability aspect of clustering algo-

rithms using SOM networks through introducing a new

similarity metric which is designed to offer a rational

treatment to the embedded noise-mixture in clusters; and

2) Investigating the learning behaviour of SOM networks

for subtle pattern discovery task.

The nature of the problem necessitates describing two chal-

lenging properties of the k-mer dataset: (i) a considerably low

signal-to-noise ratio [6], [7], causing noise-dominated clusters

to largely populate the maps; and (ii) due to natural degenera-

tion caused by evolutionary pressure, motif (signal) elements

(binding sites) often have a close resemblance to noise, which

causes the unavoidable presence of some degree of noise in the

putative motif clusters. Thus, an explicable clustering requires

both signal and noise elements (also, their characteristics) in

the clusters to be combinedly and complementarily considered,

possibly through using specially designed similarity metrics

in the clustering process, contrasting the use of existing

similarity metrics that are mostly designed with a ‘signal only

characterization’ approach for motif discovery.

The use of biologically inclined similarity metrics, such as

MISCORE [8] and log-likelihood metric [9], offers an explica-

ble assignment of the putative binding site k-mers to putative

clusters (i.e., clusters with a good degree of motif properties)

through characterizing functional motif properties in the clus-

ters [8]. Their use in the iterative optimization of clusters aims

to consistently improve the degree of motif properties of the

clusters irrespective of their dominant signal type. This causes

a non-trivial inconsistency throughout the clustering process,

since there is a consistent attempt to improve every noise-

dominated cluster with a better degree of motif properties in

the same manner that only suits the optimization of putative

motif clusters. Thus, applying such similarity metrics limits

the interpretation to only the putative motif clusters as the

noise-dominated ones become inexplicable, imposing a major

drawback in terms of system clarity.

This paper proposes a new and adaptive similarity met-

ric, named Composite Similarity Function (CSF), which is

designed to address the discrete composition of signal and

noise in the clusters during k-mer distribution. The CSF-based

similarity quantification between a k-mer and a cluster gives

a composite similarity measure in respect to the current signal

composition (noise level) of the cluster using two separate

Page 2: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 2

but complementary modelling schemes, connected through

adaptive composition weight. In CSF, the first component

is our MISCORE [8], which is a useful signal modelling

scheme with biological interpretation, capable of measuring

the potential of a k-mer through characterising several motif

properties of a cluster. The second one is a newly developed

background signal modelling scheme, named B-MISCORE

[10], which gives the similarity measure of a motif and its

elements to the backgrounds through a large random sampling

of the backgrounds (see preliminaries).

Technically, applying CSFs in clustering-based motif dis-

covery offers the following benefits: (i) a consistent interpre-

tation of all clusters in the system; (ii) a useful indication of the

noise level of each cluster throughout the iterations, offering an

effective monitoring of the ongoing clustering process; and (ii-

i) a means of embedding a discrete optimization of the putative

clusters throughout the iterations, potentially increasing the

chances of obtaining more putative motif candidates (detailed

in section III-D).

The remainder of this paper is organized as follows. Section

II provides some preliminaries used in this paper; Section III

details the proposed CSFs; Section IV describes the READcsf

algorithm; Section V reports our experimental results; and

Section VI concludes this paper.

II. PRELIMINARIES

The Positional Frequency Matrix (PFM)-based motif model,

denoted by M , is a matrix, i.e., M = [f(bi, j)]4×k, where

bi ∈ χ = A,C,G, T and j = 1, . . . , k, and each entry

f(bi, j) represents the probability of nucleotide bi at the j-

th position. Similarly, a k-mer Ks = q1q2 . . . qk is encoded

as a binary matrix K = [k(bi, j)]4×k with k(qi, j) = 1 and

k(bi, j) = 0 for bi 6= qi.

A. Background modelling using B-MISCORE

B-MISCORE [10] is a new modelling scheme for evalu-

ating a motif or its elements in respect to the backgrounds.

Firstly, a large collection of random sets denoted as ζ =G1, G2, G3 . . . , , |ζ| ≥ 1000, are generated, where a

random set Gl consists of randomly grouped k-mers from

backgrounds, i.e., Gl = K1,K2,K3, . . . , 25 ≤ |Gl| ≤ 50.

Then, the background probability of each K ∈ Γ, where Γis the k-mer dataset produced from the input sequences, is

computed using a first order Markov chain transition matrix

β = [π(a, a′)]4×4 as,

P (K,MB) = p(b1)∏

∀(a,a′)

π(a, a′)k(a,a′), (1)

where k(a, a′) gives the count of di-nucleotide aa′ in K ,

and p(b1) is the independent background probability of the

nucleotide appearing at the first position in K . The background

probability of the k-mers are globally normalized as,

Pn(K) =P (K)− min

∀K∈ΓP (K)

max∀K∈Γ

P (K) − min∀K∈Γ

P (K). (2)

The background score of Ki ∈ Γ, can then be measured

using a random k-mer set, namely Gl, as,

dB(Ki, Gl) =1

|Gl|

∀Kp∈Gl

Pn(K)d(Ki,Kp), (3)

where d(Ki,Kp) is the Hamming distance [8] between two

k-mers.

It can be deduced from (3) that dB(Ki, Gl) is a weighted

measure of Ki being a background class element in respect

to ∀Kp ∈ Gl, where d(Ki,Kp) applies the weight to the

contribution of each Kp (∀Kp ∈ Gl) in evaluating the

similarity of Ki to the backgrounds. Then, a large collection

of random sets ζ is used to obtain a discriminative background

score of Ki (B-MISCORE) with denotation rb(Ki) as,

rb(Ki) = min∀Gl∈ζ

dB(Ki, Gl), (4)

where a smaller rb(Ki) score represents a higher chance of

Ki being a background class element, and vice versa.

For a given set (S) of k-mers, the B-MISCORE-based

Model Score (BMMS) can be written as,

Rb(S) =1

|S|

∀K∈MS

rb(K), (5)

where a larger Rb(∗) score represents a higher potential of

the model being a putative motif, and vice versa.

B. Fuzzy-SOM (FSOM) batch learning

Let Γ be the set of all binary encoded k-mers from the

input sequences, N represent the number of nodes in the

FSOM network where the j-th node has a 2-dimensional grid

coordinate as zj = [zj1, zj2] and a node-PFM as Mj . Then

the batch update rule of FSOM can be written from [1] as,

Mj(t+ 1) =

∑|Γ|i=1

∑Nk=1 µ

mki(t) hjk(t)Ki

∑|Γ|i=1

∑N

k=1 µmki(t) hjk(t)

, (6)

where µki(t) is the fuzzy membership [11] of Ki to Mk(t)that can be computed as,

µki(t+ 1) =

[

N∑

l=1

(

Φ (Ki,Mk(t))

Φ (Ki,Ml(t))

)2

m−1

]−1

, (7)

where Φ is a similarity metric and the exponential term m > 1essentially controls the amount of fuzziness in µ.

In (6), hjk(t) is the following neighbourhood function,

hjk(t) = exp

−‖zj − zk‖

2

2σ(t)2

, (8)

where the neighbourhood range σ(t) can be monotonically

shrunken using the criterion mentioned in [12] as,

σ(t+ 1) = σ(t0)exp

−2σ(t0)t

tmax

, (9)

where σ(t0) is a fairly large initial σ and tmax is the maximum

epoch set by the user.

Page 3: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 3

III. COMPOSITE SIMILARITY FUNCTIONS (CSFS)

The Composite Similarity Function (CSF) associated with

a given j-th node at t-th iteration can be written as,

Θj(Ki,Mj(t)) =

λj(t) r(Ki,Mj(t)) + (1 − λj(t))[1 − rb(Ki)],(10)

where r(Ki,Mj(t)) is the MISCORE-based similarity [8]

between Ki and Mj(t) in respect to the motif properties in the

j-th cluster and rb(Ki) is the B-MISCORE-based similarity

measure of Ki to the backgrounds. In (10), λj(t) is the

(adaptive) composition weight that reflects the current noise

level in the node, where a higher such (0 < λj(t) < 1) value

represents signal domination over noise in the j-th cluster and

sensibly assigns a higher weight to r(Ki,Mj(t)) score in the

CSF-based similarity measure, and vice versa.

B-MISCORE in (4) gives the similarity of Ki to the back-

grounds, where a smaller score gives a higher similarity score

referring to a higher chance of Ki being categorized as noise in

the dataset. In the CSF-based similarity measure, an inverse of

the B-MISCORE score is taken as [1−rb(Ki)], interpreting the

dissimilarity of Ki to the backgrounds. This makes it comple-

mentary to MISCORE in measuring the merit of the k-mers,

where r(Ki,Mj) gives how similar Ki is to Mj in respect

to the embedded motif properties in Mj(t) and [1 − rb(Ki)]gives how dissimilar Ki is from the backgrounds. This follows

a sensible understanding that a higher dissimilarity of a k-

mer from the backgrounds and a higher similarity of the k-

mer to a given cluster with a good degree of motif properties

conjunctively quantify a higher potential of the k-mer to be a

putative motif element. In (10), a smaller Θj(Ki,Mj(t)) score

gives a higher composite-similarity between Ki and Mj(t),and vice versa.

A. Adaptation

The adaptation of composition weights is functionally re-

quired to reflect the changes in the noise level of each node

throughout the clustering iterations. The composition weights

are updated at the end of each iteration as,

λj(t+ 1) =

|Γ|∑

i=1

µji(t) rb(Ki)

|Γ|∑

i=1

µji(t) [r(Ki,Mj(t)) + rb(Ki)]

, (11)

where Γ is the k-mer dataset and µji(t) is the fuzzy mem-

bership of i-th k-mer to j-th node at t-th iteration that can be

computed using the CSFs as,

µij(t) =

[

N∑

l=1

(

Θj(Ki,Mj(t))

Θl(Ki,Mj(t))

)2

m−1

]−1

, (12)

where λj(t) reflects the current noise-level in the j-th node

and N is the number of nodes in the network.

Fig. 1. Conceptualisation of CSF-based clustering of k-mers in FSOM

network for DNA motif discovery, where nodes are illustrated with discretesignal and noise composition.

Eq. (11) can be re-written by dividing the terms by Ωj =∑|Γ|

i=1 µji(t) as:

λj(t+ 1)

=Ω−1

j

∑|Γ|i=1 µji(t) rb(Ki)

[Ω−1j

|Γ|∑

i=1

µji(t) r(Ki,Mj(t))] + [Ω−1j

|Γ|∑

i=1

µji(t) rb(Ki)]

=Rf

b (Mj(t))

Rf (Mj(t)) +Rfb (Mj(t))

,

(13)

where Rf (Mj(t)) is the fuzzy extension of the MISCORE-

based Motif Score (MMS), previously described in [1], for

quantifying motif properties of a given fuzzy cluster; and

Rfb (Mj(t)) gives the background similarity score of the fuzzy

cluster that can be rationally interpreted as the current degree

of noise in the cluster, implying λj(t) as the current noise level

of the j-th cluster at t-th iteration. Note that λj(t+1) > λj(t)is the result given by (11) by an increase of motif character-

istics and consequently a decrease in the noise level of j-th

node in terms of its dissimilarity to the backgrounds, while

λj(t+ 1) < λj(t) is caused by the opposite.

The adaptation process enables λj(t) to reveal a relative

measure of signal and noise composition in j-th cluster (for

Page 4: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 4

20 40 60 80 1000

1

2

3

4

epoch t

Z(S

5,N

5)

λAdaptive

λ0.5

λ1.0

20 40 60 80 1000

1

2

3

4

epoch t

Z(S

10,N

10)

λAdaptive

λ0.5

λ1.0

20 40 60 80 1000

1

2

3

4

epoch t

Z(S

15,N

15)

λAdaptive

λ0.5

λ1.0

20 40 60 80 1000

1

2

3

4

epoch t

Z(S

20,N

20)

λAdaptive

λ0.5

λ1.0

Fig. 2. Discrimination of signal nodes from the noise nodes by different modes of adaptation in CSFs using CREB [13] transcription factor dataset.

1 < j < N ) at any clustering cycle t. During the initialization

stage of training, each cluster usually demonstrates a high

presence of noise and gradually some of the clusters get

improved in terms of motif modelling. The value of λ is

associated with the degree of mixture of signals and noise

in each node (cluster) that can be useful in discriminating

potential motif models from the random ones. Note that

this value is only a relative indicator rather than a physical

quantification of noise level in the cluster. The implementation

of CSFs in FSOM is conceptualized in Fig. 1.

Remark 1: In typical SOM-based algorithms for DNA

motif discovery, CSFs can be applied to find the BMU (Best

Matching Unit) with denotation ci(t) for a given Ki at t-thiteration as: ci(t) = argmin

lΘl(Ki,Ml(t)). The adaptation

given in (11) then can be simplified for a crisp set of k-mers

as, λj(t + 1) =

∀K∈Vj(t)rb(K)

∀K∈Vj(t)[r(K,Mj(t)) + rb(K)]

, where

Vj(t) is the j-th crisp cluster produced at t-th iteration.

B. Demonstration on signal-discrimination

The demonstration uses a 10 × 10 FSOM network trained

on a k-mer (k=12) dataset generated from a set of promoter

sequences of co-regulated genes that contain a known motif

of CREB [13] transcription factor. The objective is to vi-

sualize the effectiveness of CSF adaption in discriminating

the putative (signal dominated) nodes from the non-putative

(noise) nodes in respect with the following three modes of the

composition parameter (λ):

1) λAdaptive :⇒ adaptive composition as shown in (10).

2) λ0.5 :⇒ equal weight composition in (10), yielding

Θj(Ki,Mj(t)) = 0.5×r(Ki,Mj(t))+0.5×[1−rb(Ki)].3) λ1.0 :⇒ B-MISCORE omitted composition that rewrites

(10) as Θj(Ki,Mj(t)) = r(Ki,Mj(t)).

We applied z-score to statistically measure the relative degree

of discrimination between a set of signal nodes (Sq) and a set

of noise nodes (Nq) as,

Z(Sq, Nq) =EQ(Sq) − EQ(Nq)

stdQ(Nq) × C, (14)

where E∗ is the expectation on q models, Q(∗) is the

respective model quantification for the adaptation modes and

C = 3 is a scaling constant for visualization.

In typical SOM-based motif discovery, a limited number

of top scoring models are extracted as putative signal nodes

from a trained map. Similarly in this demonstration, nodes are

evaluated at each iteration and a limited top q and the bottom

q scoring nodes are categorized as the putative signal and the

noise nodes, respectively, for Z(Sq, Nq) score computation.

The network is given the same initial state for each of

Page 5: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 5

the modes of adaptation during each run. This is repeated

separately for q ∈ 5, 10, 15, 20 and a 10-run average is

presented in Fig. 2.

Observations: This visualization depicts that the adaptive

mode of the composition (λAdaptive), which is functionally

required in CSFs, offers a better discrimination of signal

nodes than the other two modes considered. This describes the

usefulness of the combination of background referencing by B-

MISCORE and the adaptive composition used in CSFs. Fig. 2

also depicts a rational decrease in the degree of discrimination

as q increases to a larger number, which agrees with the

distribution of signals in the nodes of the maps. Similar results

were observed on other datasets in our unreported experiments.

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1000.6

0.65

0.7

0.75

0.8

training cycle t

mean

λ(t

)

meanputative nodes ± std

meannoise nodes ± std

(a): Random initialization of the weights.

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 961000.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

training cycle t

mean

λ(t

)

meanputative nodes ± std

meannoise nodes ± std

(b): Same initializing value of the weights, i.e., ∀λj(t0) = 0.5.

Fig. 3. Demonstration on the effects of different initialization of CSF weights.

C. Demonstration on CSF-initialization

In another attempt we intended to observe the impacts of

different initializations, i.e., (i) random, and (ii) a same value

initialization, of the CSF weights on its signal-discrimination

ability. The experiment was set to separate n-number of

putative signal nodes with a comparatively lower degree of

noise in them from the rest of the nodes that are mostly noise-

dominated, based on their respective noise-level indication,

which allowed comparing the CSF-adaptation for these two

types of nodes with supposedly opposite signal characteristics

in the map throughout the iterations. In implementation,

n = 10 nodes were firstly separated as signal nodes during

each iteration using their respective noise-level indicator,

while the rest of the nodes were categorized as noise-nodes

in the network. Then, for each initialization mean∗ and

std∗ of the CSF-weights (λ(t)) of these two types of

clusters were iteration-wise plotted in Fig 3.

Observations: This visualization shows that CSF-adaptation

is capable of effectively discriminating the putative nodes

from the noise dominated ones in the map by assigning a

comparatively higher λ(t) value to the signal nodes throughout

the major portion of the training, regardless of the initialization

applied. That is, the adaptation receives a very minimal or a

negligible impact from the initialization of CSF weights, which

adds a supportive feature to its algorithmic robustness. Hence,

a random initialization of the weights can be conveniently

applied, as used in the experiments in this paper.

D. Benefits

This section describes the key benefits of the proposed CSF-

based similarity measure, contrasting the use of traditional

similarity metrics in SOM-based (clustering-based) motif dis-

covery, as follows.

1) The proposed CSFs address a major limitation of the

state-of-the-art clustering approaches that inconsistently

apply the same (analogous) optimization to the clusters

with opposite signal characteristics, causing inexplicable

noise clusters to largely populate the maps. CSFs resolve

this inconsistency by refining the clusters to become

more motif-like, depending on their signal composition.

In other words, CSFs ensure the degree of optimization

to each cluster is directly related to the present degree

of motif properties and embedded noise in that cluster

during each iteration, offering a consistent interpretation

to every cluster in the maps and resultantly, an improved

system clarity.

2) The proposed CSF adaptation reveals the current noise

level in each cluster (node) throughout the iterations,

which enables the monitoring of the ongoing clustering

process. In this aspect, the proposed similarity function

is certainly more useful than the traditional similarity

metrics that are not meant to: (i) reveal the degree

of embedded noise in the clusters at any clustering

iteration; and (ii) enable the monitoring of the quality

of the ongoing clustering.

3) CSFs follow a critical argument that, if a putative bind-

ing site k-mer has a similar match to multiple putative

clusters, then intuitively it is useful to consider their

individual merit in terms of the present degree of motif

properties and embedded noise in the clusters for a more

appropriate assignment of the k-mer. In this manner,

CSFs enable embedding a discrete optimization to the

putative clusters throughout the iterations, which poten-

tially increases the chances of producing more putative

motif candidates in the maps. In contrast, a discrete

optimization to the putative clusters in the state-of-the-

art clustering-based approaches can only be applied as

a separate process after the post-training clusters are

evaluated by external motif scoring metrics.

Page 6: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 6

E. FSOM learning using CSFs

The update-equation of FSOM learning in READcsf is

given in (6), which is the same to that used in READ

[1]. However, the CSF implementation distinguishes k-mer

distribution using fuzzy membership computation in READcsf

as,

µli(t) =

[

N∑

q=1

(

Θl (Ki,Ml(t))

Θq (Ki,Mq(t))

)2

m−1

]−1

. (15)

The classical FCM objective [11] then can be written as,

Jm(t) =

|Γ|∑

i=1

N∑

j=1

µmij Θj(Ki,Mj(t)), (16)

where Θ(∗, ∗), Γ, N and m take denotations that are previous-

ly described. The decrease of Jm(t) can be monitored along

with the increase of the performance coefficient pc(µ(t)) =1|Γ|

∑Γj=1

∑N

i=1 µij(t)2 to stop FSOM training in READcsf

when the neighborhood range is sufficiently shrunken [1].

IV. READcsf : IMPROVED READ WITH CSFS

The CSF-based similarity measure is implemented in the

FSOMs of our READ system [1] for motif discovery for

two reasons: (i) to demonstrate the usefulness of CSFs in

clustering-based motif discovery; and (ii) to obtain a new

motif mining tool (READcsf ) that benefits from the synergy

between (ii-a) the FSOM-based soft-clustering that addresses

the underlying fuzziness in the datasets and (ii-b) the CSF-

enabled treatment to the fuzzy signal and noise composition

of the clusters.

This section describes the READcsf algorithm (Robust

Elicitation Algorithms for Discovering DNA Motifs using

Composite Similarity Functions), emphasising: (i) the training

of multiple FSOMs using the CSF-based similarity measure

and (ii) the CSF-based motif scoring functions for node

calibration, while the technical description of the common

components between READ and READcsf are conveniently

referred to [1].

A. Overview

The 2-dimensional output grid of READcsf is a lattice of

N = R × C nodes, where R,C are the number of rows and

columns, respectively. The j-th node, j = 1, 2, . . . , R×C, has

a 2D coordinate zj = [zj1, zj2] in the lattice. The j-th node

is initialized with: (i) a randomly generated PFM Mj(t0); and

(ii) a randomly initialized composition weight 0 < λj(t0) < 1.

The learning steps at t-th iteration are then as follows:

• Membership computation: Compute the fuzzy member-

ship of each k-mer to every node using CSFs.

• Prototype updating: Update the node prototypes using

fuzzy membership distribution of k-mers and grid-based

neighborhood cooperation between the nodes.

• CSF adaptation: Update the composition weight λj(t)based on the current noise level of the j-th node.

Post-training nodes in the map are evaluated and ranked

using the proposed CSFs-based motif scoring metric given in

Fig. 4. READcsf algorithm overview.

(17). Multiple FSOMs are trained for variable k-mer lengths

(kmin ≤ k ≤ kmax) due to the unknown length of the motif

elements. User-defined top T candidates are then returned

as final motifs through an open competition between the

candidates with different consensus length (k-mer length)

extracted from multiple FSOMs. An overview of the READcsf

algorithm is presented in Fig. 4 illustrating the parallel training

of multiple FSOMs.

B. Candidate evaluation

The post-training nodes need calibration to identify the

putative candidates in the maps. The CSF-based similarity

measure gives a new motif scoring function that can be written

as,

Q(Mj(t′)) =

∑|Γ|i=1 Θj(Ki,Mj(t

′)) µji(t′)

∑|Γ|i=1 µji(t′)

, (17)

where t′ (t′ ≤ tmax) is the final training iteration, λj(t′) is

the final noise level of the j-th node and µji(t′) is the final

fuzzy membership between Ki and Mj(t′).

Page 7: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 7

By applying the CSF description given in (10) and by

algebraic derivation, (17) can be re-written as,

Q(Mj(t′)) =λj(t

′) Rf (Mj(t′)) +

[1− λj(t′)]

[

1−Rfb (Mj(t

′))]

,(18)

where Rf(∗) is the fuzzy-MMS metric [1] for quantifying the

degree of motif properties and [1−Rfb (∗)] is the inverse of the

fuzzy-BMMS metric for measuring the dissimilarity of a fuzzy

cluster from the backgrounds. Smaller such scores, combined

through the final signal composition weight of the cluster, are

functionally required for a smaller Q(∗) score to calibrate a

fuzzy cluster as a putative motif candidate in the map. In (18)

the following holds: 0 < λj(t′), Rf(∗), Rf

b (∗) < 1.

C. Post-processing

The top T candidates (user defined) are selected to be

optimized by their grid-based neighboring nodes, followed by

their defuzzification (decoding), as applied in READ [1]. The

candidates extracted from multiple FSOMs are then refined

with the post-processing scheme detailed in [1]. The top Tcandidates are then returned as the final motifs through an open

contest among the motif candidates with different consensus

length extracted from multiple maps.

D. Relation to READ System

It is deemed meaningful to describe the relationship between

READcsf and its predecessor READ system [1]. READ

system was introduced with a primary focus on addressing: (i)

the underlying fuzziness in the characterizing features of the

motif models; and (ii) the practically fuzzy-association of the

motif instances (binding sites) to multiple and different motif

models. Aiming to address such inherent fuzziness in DNA

motifs in their discovery, READ system adopted modified

Fuzzy Self-Organizing Maps (FSOMs) with an unsupervised

soft-clustering approach and several heuristics-based post-

processing schemes for effective motif mining [1].

In contrast, READcsf primarily focuses on addressing the

explicability aspect of the clusters in the map by introducing a

means of quantitatively measuring the signal and noise com-

position in a cluster at any given clustering state through the

use of a novel background-similarity measure [10]. READcsf

introduces a novel and sensible approach for cluster analysis

in motif discovery in addition to offering the features of its

predecessor.

V. PERFORMANCE EVALUATION

This section reports the performance evaluation with two

objectives: (i) to demonstrate the benefits of applying CS-

Fs through a performance comparison between READ and

READcsf ; and (ii) to study the usefulness of READcsf as a

motif mining tool in comparison to other SOM-based tools,

i.e., SOMBRERO [2] and SOMEA [5], and other prominent

tools namely MEME [14], AlignACE [15] and WEEDER [16].

This paper adopts the performance measures used in [1], i.e.,

recall (R), precision(P) and F -measure (F) rates.

Algorithm 1 READcsf learning pseudocodes

1: START2: input: Γ, T , N = R ×C, tmax.3: ensure: Γ 6= ; 5 ≤ R ≤ 100; 5 ≤ C ≤ 100;4: 1. Initialization:5: For the j-th node, j = 1, 2, . . . , N :6: Generate a random PFM Mj(t0).7: Allocate a 2D coordinate as zj = [zj1, zj2].8: Randomly initialize λj(t0), i.e., 0 < λj(t0) < 1.9: 2. Training:

10: for t = 1 : tmax do11: ∆Mj(t)[4×k] ⇐= [0]4×k ; ∆hj(t) ⇐= 0; j = 1, . . . , N .12: 2.1. task: fuzzy membership computation

13: for i = 1 : |Γ| do14: for l = 1 : N do

15: µli(t) ⇐=

[

N∑

q=1

(

Θl (Ki,Ml(t))

Θq (Ki,Mq(t))

) 2

m−1

]−1

.

16: end for17: end for18: 2.2. task: node updates computation

19: for i = 1 : |Γ| do20: for j = 1 : N do21: for l = 1 : N do22: ∆Mj(t) ⇐= ∆Mj(t) + µm

li (t) hjl(t) Ki.23: ∆hj(t) ⇐= ∆hj(t) + µm

li (t) hjl(t).24: end for25: end for26: end for27: 2.3. task: adaptation28: for j = 1 : N do

29: Mj(t+ 1) ⇐=∆Mj(t)

∆hj(t).

30: λj(t+ 1) ⇐=

∑|Γ|

i=1µji(t) rb(Ki)

∑|Γ|

i=1µji(t) [r(Ki,Mj(t)) + rb(Ki)]

.

31: end for

32: σ(t+ 1) ⇐= σ(t0) exp −2σ(t0) t/tmax.33: Stop training if the termination condition is satisfied.34: end for

35: 3. Motif extraction and post-processing:36: • Evaluate and rank each node using (17).37: • Extract top T candidates from the ranking.38: • Apply post-processing as described in [1].39: • Retain candidates for an open contest among the candidates

extracted from multiple maps with different motif length.40: END41: Notations: |∗|: set cardinality; T : number of motifs to return; N :

number of nodes in the map; tmax: maximum number of trainingepoch; t0: initialization stage of the map; Mj(t): node PFM ofj-th node at t-th epoch; hjl(t): neighborhood function given in(8); and ∆Mj(t), ∆hj(t): are two computing component fornode updating.

42: Note: This pseudocode applies to FSOM training for a givenconsensus length k and multiple FSOM trainings are requiredfor user defined kmin ≤ k ≤ kmax.

There exist a large number of different algorithms and

tools for DNA motif discovery in current literature. However,

due to the constraints on time, resources and due to the

succinctness of this paper, we have carefully selected a couple

of those tools in the quantitative evaluation based on the

following reasons: (i) SOMBRERO and SOMEA represent a

class of SOM-based tools that are the recent developments

in SOM-based/clustering-based motif discovery and due to

Page 8: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 8

TABLE IPERFORMANCE EVALUATION USING REAL DATASETS

Average recall (R), precision (P) and F -measure rates over 10 runs

SOM/FSOM-based Tools Other ToolsFSOM-based Tools SOM-based Tools

READcsf READ SOMEA SOMBRERO MEME ALIGNACE WEEDERTF R P F R P F R P F R P F R P F R P F R P F

CRP 0.79 0.90 0.84 0.76 0.84 0.80 0.91 0.89 0.90 0.83 0.43 0.56 0.59 0.88 0.69 0.83 0.98 0.90 0.75 0.83 0.79GCN4 0.46 0.59 0.50 0.48 0.70 0.55 0.69 0.45 0.54 0.80 0.41 0.53 0.52 0.52 0.52 0.61 0.62 0.60 0.64 0.87 0.73

ERE 0.85 0.63 0.72 0.92 0.59 0.71 0.74 0.58 0.65 0.80 0.59 0.67 0.72 0.82 0.77 0.75 0.77 0.76 0.76 0.54 0.63MEF2 0.99 0.90 0.94 0.96 0.87 0.91 0.81 0.99 0.89 0.35 0.22 0.27 0.92 0.80 0.85 0.86 0.87 0.86 0.88 0.88 0.88

SRF 0.90 0.81 0.85 0.91 0.77 0.83 0.84 0.74 0.79 0.67 0.83 0.74 0.87 0.72 0.79 0.83 0.71 0.77 0.83 0.71 0.76CREB 0.82 0.81 0.82 0.82 0.78 0.81 0.89 0.67 0.77 0.83 0.43 0.56 0.59 0.88 0.69 0.52 0.66 0.57 0.79 0.71 0.75

E2F 0.71 0.73 0.72 0.69 0.74 0.71 0.82 0.64 0.71 0.76 0.67 0.71 0.68 0.64 0.65 0.75 0.68 0.71 0.89 0.67 0.76MyoD 0.67 0.43 0.52 0.65 0.42 0.51 0.66 0.39 0.49 0.50 0.32 0.39 0.23 0.38 0.27 0.34 0.31 0.32 0.43 0.50 0.46

avg 0.77 0.72 0.74 0.77 0.71 0.73 0.80 0.67 0.72 0.69 0.49 0.55 0.64 0.71 0.65 0.69 0.70 0.69 0.75 0.71 0.72

TABLE IISTATISTICAL DESCRIPTION OF THE EIGHT REAL DATASETS

TF Res Lbs Nseq Nbs Nspp

(min,max, avg) (bp)

CREB H (05, 30, 12) 17 19 1.12SRF H (09, 22, 12) 20 35 1.75MEF2 H (07, 15, 10) 17 17 1.00MyoD H (06, 06, 06) 17 21 1.24ERE M (13, 13, 13) 25 25 1.00E2F M (11, 11, 11) 25 27 1.08CRP E (22, 22, 22) 18 24 1.33GCN4 Y (05, 15, 07) 09 21 2.33

Notations: Res is the resource: (H, M, Y, E) refer to (Human, Mouse,Saccharomyces cerevisiae, E.coli) respectively, Lbs is the length of thebinding sites in bp, Nseq is the number of sequences in the dataset,Nbs is the number of binding sites in the dataset and Nspp is theratio of number of binding sites per promoter sequence.

their high relevancy to this work; and (ii) MEME, AlignACE

and WEEDER represent a very prominent group of tools

that are developed on different state-of-the-art computational

approaches. Additionally, our previous work [1] can be re-

ferred for a comprehensive performance evaluation between

several state-of-the-art clustering-based approaches namely: (i)

a standard FCM-based [11] approach, (ii) a classical batch

learning SOM-based [17] approach and (iii) our FSOM-based

[1] approach for DNA motif discovery task.

We also acknowledge the fact that the mining tools, found-

ed on different approaches and algorithms, have different

strengths and weaknesses and a comparison of the perfor-

mance of these tools is not completely fair due to several

unavoidable reasons. Thus, a perfect performance benchmark-

ing is neither expected nor achievable and the results reported

here should only serve as references.

A. Results on real DNA datasets

Due to the significance of the results obtained on real

datasets in DNA motif discovery, we used eight real datasets

in the evaluation. These datasets, collected from [13], [18],

are composed of the real promoter sequences of co-regulated

genes that contain verified motifs (functional binding sites)

that bind to ERE, MEF2, SRF, CREB, E2F, MyoD, CRP

and GCN4 TFs. Each dataset contains a varying number of

sequences and one verified motif with known location of

its instances (known binding sites) in the sequences. These

datasets are useful to evaluate the tools with respect to the

original sequence properties in finding known motifs. The

statistical features of these datasets are given in Table II.

During each run of READcsf on a dataset, multiple FSOMs

were trained with random map sizes between 10 × 10 to

20 × 20 for each consensus length k (kmin ≤ k ≤ kmax)

such that (kmin, kmax) = (l − 3, l + 3), where l is the

consensus length of the true motif in the dataset. Then, the top

10 candidates were set to be extracted from each map. The

composition weight associated with each node was randomly

initialized as 0 < λ(t0) < 1. The initial neighborhood range

σ(t0) = 3, fuzziness regulator m = 1.025 and maximum

epoch tmax = 100 were set for training as applied in [1].

The training and parameter settings of READ, SOMBRERO

and SOMEA were described in [1]. For the sake of a fair com-

parison, READcsf , READ, SOMBRERO and SOMEA were

allowed to have the same map size, a random initialization

of nodes, the same number of maximum epoch, the same

expected motif width (k-mer length) and the top 10 candidates

to be returned during each run on each dataset. Also, MEME,

AlignACE and WEEDER were run on these datasets using

the parameter settings detailed in [1]. The ‘best’ motif found

during each run of a tool on each dataset in terms of F -

measure was saved and the recall (R), precision (P) and F-

measure (F) rates obtained by these motifs were recorded.

The average recall, precision and F -measure rates over

10 runs obtained by the tools on each dataset are presented

in Table I, showing that READcsf outperformed READ on

seven of eight datasets in terms of F -measure, indicating the

benefits of using CSFs, since its implementation distinguishes

READcsf from READ. In comparison with other SOM-based

tools, READcsf (0.74) obtained a noticeable 25.7% improve-

ment over SOMBRERO (0.55) and a 2.7% improvement over

SOMEA (0.72) in terms of average F -measure computed on

the datasets. Also, READcsf (0.77) obtained a considerable

10.4% improved average recall rate over SOMBRERO (0.69),

indicating its significantly improved ability to retrieve true

binding sites over SOMBRERO. Also, it obtained a remark-

able 31.9% and a 6.9% improved average precision rate over

SOMBRERO and SOMEA, respectively.

In comparison with the other tools considered, the average

F -measure of READcsf (0.74) on these datasets shows a

significant improvement over MEME (0.65) and AlignACE

Page 9: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 9

TABLE IIIPERFORMANCE EVALUATION USING MULTIPLE MOTIF DATASETS

Average recall (R), precision (P) and F -measure (F) rates over 10 runs

SOM/FSOM-based Tools Other Tools

FSOM-based Tools SOM-based Tools

READcsf READ SOMEA SOMBRERO MEME WEEDER

3 TFs R P F R P F R P F R P F R P F R P F

Dataset1 CREB 0.43 0.28 0.34 0.39 0.29 0.33 0.43 0.26 0.33 0.44 0.26 0.33 0.20 1.00 0.33 0.00 0.00 0.00

MyoD 0.38 0.19 0.25 0.27 0.19 0.23 0.48 0.23 0.31 0.20 0.08 0.11 0.00 0.00 0.00 0.00 0.00 0.00

TBP 0.31 0.18 0.22 0.28 0.20 0.23 0.36 0.21 0.26 0.20 0.12 0.15 0.07 0.50 0.12 0.00 0.00 0.00

avg 0.37 0.22 0.27 0.31 0.23 0.26 0.42 0.23 0.30 0.28 0.15 0.20 0.09 0.50 0.15 0.00 0.00 0.00

Dataset2 HNF4 0.86 0.64 0.73 0.85 0.64 0.73 0.39 0.27 0.31 0.36 0.21 0.26 0.44 0.78 0.56 0.00 0.00 0.00

NFAT 0.40 0.32 0.36 0.36 0.29 0.32 0.57 0.40 0.47 0.63 0.39 0.48 0.60 0.82 0.69 0.40 1.00 0.57

SP1 0.53 0.44 0.48 0.53 0.43 0.47 0.50 0.53 0.50 0.53 0.35 0.42 0.38 0.54 0.44 0.00 0.00 0.00

avg 0.60 0.46 0.52 0.58 0.45 0.51 0.49 0.40 0.43 0.51 0.32 0.39 0.47 0.71 0.56 0.13 0.33 0.19

Dataset3 CAAT 0.46 0.25 0.32 0.28 0.20 0.23 0.43 0.21 0.25 0.32 0.17 0.22 0.29 0.80 0.42 0.00 0.00 0.00

MEF2 0.62 0.43 0.51 0.61 0.46 0.52 0.70 0.40 0.50 0.59 0.28 0.38 0.29 0.57 0.38 0.00 0.00 0.00

SRF 0.54 0.33 0.41 0.49 0.35 0.40 0.79 0.45 0.57 0.65 0.31 0.27 0.80 0.57 0.67 0.27 1.00 0.42

avg 0.54 0.34 0.41 0.46 0.33 0.38 0.64 0.35 0.44 0.52 0.25 0.29 0.46 0.65 0.49 0.09 0.33 0.14

Dataset4 HNF3B 0.24 0.17 0.19 0.21 0.14 0.17 0.68 0.39 0.48 0.73 0.48 0.57 0.41 0.88 0.56 0.00 0.00 0.00

NFKB 0.95 0.81 0.87 0.89 0.81 0.85 0.47 0.25 0.31 0.26 0.13 0.17 0.15 1.00 0.27 0.00 0.00 0.00

USF 0.65 0.52 0.57 0.61 0.52 0.56 0.71 0.47 0.56 0.66 0.46 0.54 0.80 0.57 0.67 0.33 1.00 0.50

avg 0.61 0.50 0.55 0.57 0.49 0.53 0.62 0.37 0.45 0.55 0.36 0.43 0.45 0.82 0.50 0.11 0.33 0.17

Dataset5 CMYC 0.94 0.73 0.82 0.94 0.75 0.83 0.61 0.37 0.46 0.49 0.33 0.36 0.40 0.75 0.52 0.40 1.00 0.57

EGR1 0.69 0.48 0.56 0.61 0.43 0.51 0.74 0.47 0.57 0.89 0.70 0.84 0.75 1.00 0.86 0.19 0.75 0.30

GATA3 0.50 0.37 0.42 0.46 0.37 0.41 0.66 0.36 0.47 0.47 0.26 0.33 0.64 0.81 0.72 0.00 0.00 0.00

avg 0.71 0.53 0.60 0.67 0.52 0.58 0.67 0.40 0.50 0.62 0.43 0.51 0.60 0.85 0.70 0.20 0.58 0.29

avg5 datasets 0.57 0.41 0.47 0.52 0.40 0.45 0.57 0.35 0.42 0.49 0.30 0.36 0.41 0.71 0.48 0.11 0.32 0.16

(0.69) and is also better than WEEDER (0.72). Noticeably,

the average recall rate of READcsf (0.77) is found to be

significantly higher than MEME (0.64) and AlignACE (0.69)

and better than WEEDER (0.75), even though similar average

precision rates of these tools are observed. Improvement in

the recall rates without compromising the precision rates

enabled READcsf to outperform the other tools considered.

Remark 2: It was previously shown in [1] that the operational

complexity of standard SOM (Ωsom) and FSOM (Ωfsom) can

be similar in practical implementation, i.e., Ωfsom ≈ Ωsom.

This can be achieved by customizing FSOM learning without

losing its integrity. We extend this understanding to READcsf

training that comprises: 1) B-MISCORE computation of k-

mers; and 2) FSOM learning, giving its overall operational

requirement as: ΩREADcsf= Ω(rb(K)) + Ωfsom. The first

term imposes a minor increase in training time due to its pre-

computable nature, implying no major difference between the

computation time of READcsf and the state-of-the-art SOMs.

In a demonstration, a 10-run average training time of READcsf

and standard SOMs on eight datasets were found as 115.62

and 99.20s respectively, where the same number of nodes and

a fixed number of cycles were set for a fair comparison using

an Intel(R) Core(TM) i7- 3612QM CPU @ 2.10 GHz machine.

B. Results on multiple motif datasets

Computational tools are expected to be capable of find-

ing multiple motifs if these exist in the query set of input

sequences. However, the F -measure-based performance eval-

uation on motif mining task requires to know the specific

locations of the instances (binding sites) of different motifs in

the set of sequences, as a pre-requisite to recall and precision

measure. To the best of our knowledge it is difficult to find

a set of co-regulated sequences with the pointed locations

of the binding sites of different transcription factors in the

same sequence collection, that can be applied in quantitative

performance evaluation of tools in multiple motif mining task.

Therefore, due to the lack of availability of the real datasets

with such properties in the public databases, we adopted five

artificial datasets from our previous studies [1], [5] in this

evaluation. Applying these datasets serves two other purpose:

1) Each dataset contains twenty sequences of real pro-

moters taken from relevant species and each dataset

has three verified motifs, each for a different TF, and

the known motif instances are arbitrarily planted in

the promoters. These datasets are useful in evaluating

the tools in terms of simultaneously mining multiple

motifs to imitate a plausible scenario in real-world motif

mining, where the input set of promoter sequences may

harbour multiple functional motifs of different TFs.

2) Each dataset is composed of considerably large-length

sequences and features a problematically low signal-to-

noise ratio (≤ 0.0018). These challenging features test

the ability of the tools in finding motifs in large datasets

in a simulated environment. Note that these results only

serve as a reference rather than a complete scalability

benchmarking of the tools, which is beyond the scope

of this paper.

READcsf , READ, SOMEA, SOMBRERO, MEME and

WEEDER were run on these datasets. The training and pa-

rameter settings of these tools were kept similar as applied in

the single motif discovery task. However, due to the increased

Page 10: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 10

size of datasets, the SOM/FSOM-based tools were given a

larger map size of 20×20, and all the tools were set to return

the top 20 candidates during each run. Then, the best motif

for each TF in terms of the F -measure was recorded during

each run of the tools on a dataset and the average R, P and

F -measure rates over 10 runs are presented in Table III.

The results show that READcsf obtained a noticeable

4.3% improvement in terms of the average F -measure and a

remarkable 8.8% improvement in terms of the average recall

rate over READ on these datasets. Their average precision

rates were found to be closely similar, i.e., READcsf (0.41)

and READ (0.40), which was caused by applying the same

post-processing scheme for model refinement described in [1].

Thus, it is deducible that the improvement in the F -measures

of READcsf over READ is caused by its higher recall ability

of the true motif instances, which is potentially facilitated by

the CSF-based similarity computation, revealing the usefulness

of CSFs over traditional similarity metrics in FSOM-based

motif discovery. READcsf also produced the best average

F -measure among the SOM/FSOM-based tools considered.

Remarkably, READcsf (0.47) obtained a noticeable 23.4%improvement over SOMBRERO (0.36) and a 10.6% im-

provement over SOMEA (0.42) in terms of the average F -

measure on these datasets, demonstrating its potential ability

in producing more useful mining results than the existing

SOM/FSOM-based tools.

In comparison with the other tools, MEME obtained the

best average F -measure on these datasets. Note that, the

SOM-based (also, FSOM-based) tools face the following

two major performance biases compared to the other tools

(e.g., MEME) on multi-motif datasets: (i) the proper map

size selection and (ii) the k-mers length selection in order

to simultaneously satisfy multiple motifs, as elaborated in

[1]. Despite these biases, READcsf (0.47) obtained a similar

average F -measure to MEME (0.48). Noticeably, READcsf

(0.57) obtained a 28.1% improved average recall rate over

MEME (0.41), which is certainly advantageous in this

complicated and low performance motif discovery exercise.

Demonstration Using a Real Dataset: In order to learn the

capability of our computational tool developed in this paper

in discovering multiple motifs, a real dataset containing the

instances of multiple motifs is examined. Firstly, we collected

a set of co-regulated sequences from literature [19], [20] for

SWI4 and SWI6 transcription proteins. Then, we carefully

selected only the common sequences (intersection) from the

two sets of sequences. This gives a sequence collection (named

as SWI4 SWI6) that contains the instances of both SWI4

and SWI6 TFs, however with no information on the specific

locations of the binding sites in the sequences. Feature-

wise SWI4 SWI6 sequence set contains 78 sequences with

an average length of 717.5bp each. The unavailability of

the locations of the binding sites in sequences enables this

discovery exercise to mimic a practical motif finding task.

We ran READcsf , MEME and Weeder on this dataset and

each tool was allowed to return maximum top 20 motifs during

each run. It was observed that READcsf was able to find both

motifs simultaneously in each run after careful inspection on

the list of motifs returned and by comparing them with the

verified logos of these motifs collected from JASPAR [21]

database. For a qualitative comparison, best samples of logos

discovered by these tools over 10 runs are presented in Fig. 5

for a visual comparison with the verified logos from JASPAR

database, where READcsf has shown a promising performance

of discovering multiple motifs.

Noticeably, READcsf retuned those two motifs within the

top 5 candidate motifs in a run, while the other tools consid-

ered could not either recognize these motifs in higher ranks

in their returned lists or discovered either of the motifs and

missed the other one in a run. Thus, in order to quantitatively

measure this motif-recognizability performance of these tools,

we adopted mean rank (φ) score computing from [8] as:

φ(M) = q(q+1)/2∑q

i=1 rank(Mi), where q is the number of

the relevant items (motifs) whose rank orders are to be consid-

ered and a higher φ(∗) indicates a better motif-recognizability.

We observed the following mean rank scores for the tools

in finding these motifs over 10 runs as:

φ(SWI4) φ(SWI6) φ(SWI4, SWI6)READcsf 0.61 0.60 0.79

MEME 0.45 0.42 0.41

WEEDER 0.44 0.45 0.42

These figures show that READcsf discovered both motifs in

top ranking and outperformed the other tools. Previous studies

[1], [2], [5] revealed that, SOM-based clustering approaches

are capable to returning multiple candidate motifs simultane-

ously in the same search, where multiple candidates are usual-

ly found to share partial representation of the same motif and

often they represent different motifs with significantly diverse

properties. The latter observation leads to their usefulness in

discovering multiple motifs in the query sequences if that exist.

C. Robustness analysis

In the SOM-based tools, an improper map size degrades

the quality of clustering and motif mining performances. In

order to robustly handle the negative effects of the map size

setting, READ and READcsf adopted (i) FSOMs for soft-

partitioning in the k-mer dataset, and (ii) a post-processing

scheme [1] capable of quickly turning a noisy motif model into

a desired one by acquiring left-behind subtle motif elements

and by iteratively removing noise from the model. These two

mechanisms enabled READ to be more robust in handling

inaccurate map sizes than SOMBRERO and SOMEA, as

demonstrated in [1]. Similarly, READcsf is anticipated to

demonstrate such robustness. However, the involvement of

CSFs is interesting to observe in such an aspect, which is

investigated in this section.

READcsf , READ, SOMBRERO and SOMEA were run

on the real datasets using standard map sizes of 10 × 10,

15 × 15 and 20 × 20, while the other training parameters

were kept same as those applied in the single motif discovery

task described in section V-A. A 10-run average of F -measure

obtained by each tool for each map size on each dataset is pre-

sented in Table IV for comparison. Table IV also includes the

standard deviation, as the robustness indicator, computed

over the average F -measure obtained by the tools on different

map sizes. This shows that READcsf produces the smallest

Page 11: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 11

SWI4 (Jaspar) SWI4 (READcsf ) SWI4 (MEME)

SWI6 (Jaspar) SWI6 (READcsf ) SWI6 (MEME)

Fig. 5. Verified motif logos of SWI4 and SWI6 TFs collected from JASPAR [21] database are compared with the discovered logos by READcsf and MEME.

TABLE IVROBUSTNESS ANALYSIS OF SOM/FSOM-BASED TOOLS IN HANDLING DIFFERENT MAP SIZE

Average F -measure over 10 runs for different map sizes Standard deviation as robustness indicator

READcsf READ SOMBRERO SOMEA READcsf READ SOMEA SOMBRERO

TF 10x10 15x15 20x20 10x10 15x15 20x20 10x10 15x15 20x20 10x10 15x15 20x20 std(F ) std(F ) std(F ) std(F )

CREB 0.81 0.79 0.81 0.80 0.79 0.78 0.41 0.67 0.67 0.70 0.76 0.72 0.014 0.008 0.031 0.150

CRP 0.80 0.79 0.72 0.79 0.79 0.69 0.71 0.71 0.52 0.81 0.66 0.58 0.039 0.060 0.117 0.110

E2F 0.68 0.69 0.69 0.70 0.70 0.71 0.73 0.63 0.67 0.58 0.69 0.72 0.007 0.004 0.074 0.050

ERE 0.73 0.73 0.75 0.72 0.76 0.71 0.42 0.60 0.74 0.53 0.66 0.61 0.022 0.028 0.066 0.160

GCN4 0.49 0.49 0.46 0.53 0.50 0.49 0.44 0.52 0.60 0.41 0.51 0.58 0.017 0.018 0.085 0.080

MEF2 0.92 0.92 0.82 0.92 0.85 0.75 0.92 0.80 0.44 0.68 0.91 0.82 0.058 0.087 0.116 0.250

MyoD 0.51 0.53 0.49 0.50 0.52 0.44 0.23 0.42 0.49 0.32 0.49 0.47 0.017 0.043 0.093 0.135

SRF 0.83 0.82 0.76 0.82 0.81 0.72 0.67 0.72 0.71 0.70 0.77 0.71 0.039 0.055 0.038 0.026

std value in most of the cases, indicating better robustness

than the other SOM/FSOM-based tools considered in handling

improper map size settings. The noticeable improvement in

terms of such robustness of READcsf over READ system can

be sensibly implied as the effect of CSF-implementation in

the clustering process. Also, this observation indicates that

the rational treatment of signal and noise composition in the

clusters by CSFs can conjunctively improve such robustness

in clustering-based motif elicitation while applied with post-

processing schemes [1] especially designed for this task.

D. Discussion

This section presents a discussion on how motif elements

are discriminated by CSFs through using signal and noise

characteristics and quantifying their composition in the clus-

ters. Given that the signal type notations read as: Mr = a

random model, Mt = a true motif model, Kr = a random

k-mer, and Kt = a true binding site k-mer, a simplified and

general interpretation of CSF-based similarity measure can be

described using the following four cases.

• Case 1: Θ(Kt,Mt) gives a smaller score caused by the

combined effects of (i) a good degree of motif properties

in Mt; (ii) consequently, a smaller degree of embedded

noise in the cluster represented by a larger λ value; and

(iii) a smaller [1− rb(Kt)] score indicating an inherently

higher dissimilarity of Kt to the backgrounds.

• Case 2: Θ(Kr,Mt) gives a larger score contrasting case

1, due to a larger [1 − rb(Kt)] score indicating a higher

degree of noise resemblance property of Kr.

• Case 3: Θ(Kt,Mr) gives a larger score contrasting case

1, caused by the combined effects of the following: (i) a

random model Mr is likely to have a higher degree of

noise embedded, causing a larger noise level represented

by a smaller λ value associated with Mr; and (ii) the

MISCORE-based similarity r(Kt,Mr) is larger due to

the absence of motif properties in a random model.

• Case 4: Θ(Kr,Mr) gives a larger score with some

degree of randomness contrasting case 1, caused by the

stochastic nature of r(Kr,Mr) quantification due to the

absence of motif properties in a random (noise) model

[8]. That is, the relationship between a random k-mer

and a noise-dominated model imposes some degree

of uncertainty in the modelling scheme, which is a

persistent problem in all existing approaches of signal

(motif) discrimination due to the special characteristics

of this problem. However, it is observed in [8] that

R(Mt) ≪ ER(Mr) holds, where E∗ is the

expectation over a large number of random models,

implying r(Kt,Mt) < r(Kr,Mr) and consequently,

Θ(Kt,Mt) < Θ(Kr,Mr) in an average case.

VI. CONCLUSIONS

A consistent interpretation of clusters through an explicable

distribution of k-mers in respect to the embedded signal

and noise characteristics in the clusters is a fundamental

requirement for system clarity, which however has not been

previously solved, where the existing domain-specific simi-

larity metrics play a persistently problematic role. This work

has addressed this problem through introducing the composite

similarity functions (CSFs) that are capable of measuring the

degree of noise embedded in each cluster and utilizing this

Page 12: A Further Study On Mining DNA Motifs Using Fuzzy Self ...homepage.cs.latrobe.edu.au/dwang/html/TNNLS-2014-P-4075.pdf · A Further Study On Mining DNA Motifs Using Fuzzy Self-Organizing

FINAL VERSION OF TNNLS-2014-P-4075 12

information in discriminating putative motif clusters from the

noise dominated ones during k-mer distribution throughout the

training. This offers two significant benefits in SOM-based

motif discovery: (i) an improved explicability of all clusters

in the maps with practical benefits in terms of improved

motif mining results; and (ii) a new similarity measure to

improve several problematic aspects of the classical SOM-

based approaches that indiscriminatingly (analogously) treat

clusters with a different degree of signal and noise composition

due to applying the existing similarity metrics.

This paper has described CSF implementation to introduce

READcsf as an improved mining tool that has shown promis-

ing improvement in terms of discovery results over READ,

SOMBRERO, SOMEA and the other tools considered in the

experiments, revealing the usefulness of the technical solutions

presented. The outcome of this study may potentially lead

to a new direction of future research on: (i) novel similarity

metrics for DNA motif mining, and (ii) the noise information-

based clustering techniques in SOM/FSOM-based motif min-

ing. Also, further research can be conducted on advanced

characterization of noise and motif elements in biological

datasets to benefit computational motif mining tools.

ACKNOWLEDGMENT

The authors are grateful to the anonymous reviewers for

their insightful comments that truly helped us to improve

the quality of this publication. The authors also express their

gratitude to our previous research group members, Dr Nung

Kion Lee from Universiti Malaysia Sarawak, Malaysia, Dr

Sean Li from CRISO, Australia and Dr Monther Alhamdoosh,

for contributing in discussions and dataset collection.

REFERENCES

[1] D. Wang and S. Tapan, “A robust elicitation algorithm for discoveringDNA motifs using fuzzy self-organizing maps,” IEEE Transactions on

Neural Neworks and Learning Systems, vol. 24, no. 10, pp. 1677 – 1688,October 2013.

[2] S. Mahony, D. Hendrix, A. Golden, T. J. Smith, and D. S. Rokhsar,“Transcription factor binding site identification using the self-organizingmap,” Bioinformatics, vol. 21, no. 9, pp. 1807–1814, May 2005.

[3] D. Liu, X. Xiong, Z.-G. Hou, and B. DasGupta, Identification of motifswith insertions and deletions in protein sequences using self-organizingneural networks, Neural Networks, vol. 18, no. 56, pp. 835842, June-July 2005.

[4] D. Liu, X. Xiong, B. DasGupta, and H. Zhang, “Motif discoveries inunaligned molecular sequences using self-organizing neural networks,”IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 919–928,July 2006.

[5] N. K. Lee and D. Wang, “SOMEA: self-organizing map based extractionalgorithm for DNA motif identification with heterogeneous model,”BMC Bioinformatics, vol. 12, no. Suppl 1, p. S16, February 2011.

[6] W. W. Wasserman and A. Sandelin, “Applied bioinformatics for theidentification of regulatory elements,” Nature Reviews Genetics, vol. 5,no. 4, pp. 276–287, 2004.

[7] K. D. MacIsaac and E. Fraenkel, “Practical strategies for discoveringregulatory DNA sequence motifs,” PLoS Computational Biology, vol. 2,no. 4, p. e36, April 2006.

[8] D. Wang and S. Tapan, “MISCORE: a new scoring function forcharacterizing DNA regulatory motifs in promoter sequences,” BMC

Systems Biology, vol. 6, no. Suppl 2, p. S4, December 2012.[9] G. D. Stormo and D. S. Fields, “Specificity, free energy and information

content in protein-DNA interactions,” Trends in Biochemical Sciences,vol. 23, no. 3, pp. 109–113, March 1998.

[10] D. Wang, “B-MISCORE: a new similarity metric for self-organizationof DNA k-mers,” LTU Technical Report, June 2013. [Online].Available: http://homepage.cs.latrobe.edu.au/dwang/BMISCORE.pdf

[11] J. C. Bezdek, Pattern recognition with fuzzy objective function algo-

rithms. Norwell, MA, USA: Kluwer Academic Publishers, 1981.[12] M. M. Van Hulle, Handbook of Natural Computing: Theory, Experi-

ments, and Applications. Springer, 2011, ch. Self-Organizing Maps.[13] Z. Wei and S. T. Jensen, “Game: detecting cis-regulatory elements using

a genetic algorithm,” Bioinformatics, vol. 22, no. 13, pp. 1577–1584,April 2006.

[14] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifsin biopolymers using expectation maximization,” Machine Learning,vol. 21, no. 1, pp. 51–80, October/November 1995.

[15] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church, “FindingDNA regulatory motifs within unaligned noncoding sequences clusteredby whole-genome mrna quantitation,” Nature Biotechnology, vol. 16,no. 10, pp. 939–945, October 1998.

[16] G. Pavesi, G. Mauri, and G. Pesole, “An algorithm for finding signals ofunknown length in DNA sequences,” Bioinformatics, vol. 17, no. Suppl1, pp. S207–S214, April 2001.

[17] T. Kohonen, Self-organizing maps, Springer series in information sci-ences. Berlin: Springer, 1995.

[18] J. Zhu and M. Zhang, “SCPD: a promoter database of the yeastsaccharomyces cerevisiae,” Bioinformatics, vol. 15, no. 7, pp. 607–611,July/August 1999.

[19] C. T. Harbison, D. B. Gordon, T. I. Lee, et. al., “Transcriptionalregulatory code of a eukaryotic genome,” Nature, vol. 431, no. 7004,pp. 99–104, September 2004.

[20] K. MacIsaac, T. Wang, D. B. Gordon, D. Gifford, G. Stormo, andE. Fraenkel, “An improved map of conserved regulatory sites forSaccharomyces cerevisiae,” BMC Bioinformatics, vol. 7, no. 1, pp. 113+,March 2006.

[21] D. Vlieghe, A. Sandelin, P. J. De Bleser, et. al., “A new generation ofJASPAR, the open-access repository for transcription factor binding siteprofiles,” Nucleic Acids Research, vol. 34, no. Database issue, January2006.

Sarwar Tapan received his PhD in Computer Sci-ence from La Trobe University, Australia, in Novem-ber 2013. He received his Bachelor of ComputerScience from University of Wollongong (MalaysiaCampus), Australia in 2004, and his Masters inCognitive Sciences in 2008 from Universiti MalaysiaSarawak (UNIMAS). His research interests are inthe applications of intelligent computing techniquesin decision support systems, data visualization, datamining and business intelligence, predictive mod-elling and biological sequence analysis emphasizing

on computational discovery of regulatory DNA motifs.

Dianhui Wang (M’03-SM’05) was awarded a Ph.D.from Northeastern University, Shenyang, China, in1995.

From 1995 to 2001, he worked as a Postdoc-toral Fellow with Nanyang Technological Univer-sity, Singapore, and a Researcher with The HongKong Polytechnic University, Hong Kong, China.He joined La Trobe University in July 2001 and iscurrently a Reader and Associate Professor with theDepartment of Computer Science and InformationTechnology, La Trobe University, Australia. He is

adjunct Professor at The State Key Laboratory of Synthetical Automationof Process Industries, Northeastern University, China. His current researchinterests include data mining and computational intelligence techniques forbioinformatics and engineering applications, and randomized learning algo-rithms for big data modelling.

Dr Wang is a Senior Member of IEEE, and serving as an AssociateEditor for IEEE Transactions On Neural Networks and Learning Systems,IEEE Transactions On Cybernetics, Information Sciences, Neurocomputingand International Journal of Machine Learning and Cybernetics.