6
A Fast Gene Expression Analysis using Parallel Biclustering and Distributed Triclustering Approach Tulika Kakati [email protected] Hasin A Ahmed [email protected] Dhruba K Bhattacharyya [email protected] Jugal K Kalita § [email protected] ABSTRACT Biclustering or simultaneous clustering aims to mine rowise and columwise a GS dataset into groups of genes coex- pressed across a subset of conditions. Triclustering is a re- cent advancement in unsupervised learning, which groups genes under a subset of conditions and time points over GST plane. With the growing size of data, the cost of tricluster extraction becomes too high, which demands for a cost-eective triclustering method by distributing the computational load is needed for obtaining optimal results. This paper presents a fast shared memory biclustering and shared nothing triclustering analysis architecture to ana- lyze gene expression data to identify coexpressed patterns of high biological significance over GST plane. The pro- posed triclustering approach has been found able to identify shifted, scaled, and shifted-and-scaled coexpressed patterns at minimum cost over several benchmark datasets. CCS Concepts Mathematics of computing ! Mathematical opti- mization; Theory of computation ! Parallel algo- rithms; Distributed algorithms; Keywords Distributed, biclusters, triclusters, shifting-and-scaling pat- terns, coexpressed. Tulika Kakati, Department of CSE, Tezpur University, Na- paam, Assam, India 784028 Hasin A Ahmed, Department of CSE, Tezpur University, Napaam, Assam, India 784028 Dhruba K Bhattacharyya, Department of CSE, Tezpur University, Napaam, Assam, India 784028 § Jugal K Kalita, Department of Computer Science, Univer- sity of Colorado, CO 80933, USA Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICTCS ’16, March 04-05, 2016, Udaipur, India c 2016 ACM. ISBN 978-1-4503-3962-9/16/03. . . $15.00 DOI: http://dx.doi.org/10.1145/2905055.2905182 1. INTRODUCTION Gene expression is the mechanism by which information from gene is used in the synthesis of gene products like protein and functional RNA. Microarray technologies en- able to represent the repository of gene expression levels in a matrix format. Depending on the format, microar- ray dataset can be classified as gene-sample (GS) dataset and gene-sample-time (GST ) microarray dataset. The later assesses genes under multiple experimental conditions at varying time points. The GS can be represented as a real-valued gene expression matrix, M={Mij | 1< i <n ,1< j <m}, where G and S represent the set of genes and samples respectively and mij represents the expression value of gene gi for sample sj . Similarly, a GST microarray dataset can be represented as a real valued gene expression cube, C={ C ijk |1< i <m ,1< j <n ,1<k<l }, where each cell C ijk represents the expression value of gene gi under condition sj at time point t k . The study of co-expressed genes helps in determining vari- ous types of correlations such as, shifting, scaling, and shifting- and-scaling among genes and diseases across time spaces [4]. Given two gene expressions g1=[ a1, a2, a3, ..., an ] and g2=[b1, b2, b3, ..., bn], the correlatins between these two patterns may be expressed as follows: Shifting patterns: if ai =p + bi , i =1, 2, 3,...,n, where p is the additive constant. Scaling patterns: if ai =qbi , i =1, 2, 3,...,n, where q is the multiplicative constant. Shifting-and-scaling patterns: if ai =q bi + p, i = 1, 2, 3,...,n where p is the additive constant and q is the multiplicative constant. In the yesteryears, several clustering methods [10, 11] have been proposed to classify the groups of genes [8]. However, these traditional clustering approach cannot find the genes which exhibit shifting-and-scaling correlations. Later, the biclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposed to mine biclusters over a subspace of conditions. However, these approaches have diculty in analysing of gene expressions beyond a particular time point. In [28], a clustering method is proposed to mine genes across a sub- set of samples and time points, introducing the triclustering approach. Later, more triclustering algorithms [13, 14, 3, 6, 24, 23, 12, 16, 26, 20, 15, 7] are introduced to find cluster of genes over subset of conditions and time points on the basis

A Fast Gene Expression Analysis using Parallel ...jkalita/papers/2016/KakatiTulikaICTCS2016.pdfbiclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposed to mine biclusters

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Fast Gene Expression Analysis using Parallel ...jkalita/papers/2016/KakatiTulikaICTCS2016.pdfbiclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposed to mine biclusters

A Fast Gene Expression Analysis using Parallel

Biclustering and Distributed Triclustering Approach

Tulika Kakati⇤

[email protected] A Ahmed

[email protected] K Bhattacharyya

[email protected]

Jugal K Kalita§

[email protected]

ABSTRACTBiclustering or simultaneous clustering aims to mine rowiseand columwise a G⇥S dataset into groups of genes coex-pressed across a subset of conditions. Triclustering is a re-cent advancement in unsupervised learning, which groupsgenes under a subset of conditions and time points overG⇥S⇥T plane. With the growing size of data, the costof tricluster extraction becomes too high, which demandsfor a cost-e↵ective triclustering method by distributing thecomputational load is needed for obtaining optimal results.This paper presents a fast shared memory biclustering andshared nothing triclustering analysis architecture to ana-lyze gene expression data to identify coexpressed patternsof high biological significance over G⇥S⇥T plane. The pro-posed triclustering approach has been found able to identifyshifted, scaled, and shifted-and-scaled coexpressed patternsat minimum cost over several benchmark datasets.

CCS Concepts•Mathematics of computing ! Mathematical opti-

mization; •Theory of computation ! Parallel algo-

rithms; Distributed algorithms;

KeywordsDistributed, biclusters, triclusters, shifting-and-scaling pat-terns, coexpressed.

⇤Tulika Kakati, Department of CSE, Tezpur University, Na-paam, Assam, India 784028†Hasin A Ahmed, Department of CSE, Tezpur University,Napaam, Assam, India 784028‡Dhruba K Bhattacharyya, Department of CSE, TezpurUniversity, Napaam, Assam, India 784028§Jugal K Kalita, Department of Computer Science, Univer-sity of Colorado, CO 80933, USA

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

ICTCS ’16, March 04-05, 2016, Udaipur, Indiac� 2016 ACM. ISBN 978-1-4503-3962-9/16/03. . . $15.00

DOI: http://dx.doi.org/10.1145/2905055.2905182

1. INTRODUCTIONGene expression is the mechanism by which information

from gene is used in the synthesis of gene products likeprotein and functional RNA. Microarray technologies en-able to represent the repository of gene expression levelsin a matrix format. Depending on the format, microar-ray dataset can be classified as gene-sample (G⇥S) datasetand gene-sample-time (G⇥S⇥T ) microarray dataset. Thelater assesses genes under multiple experimental conditionsat varying time points. The G⇥S can be represented asa real-valued gene expression matrix, M={M

ij

| 1< i <n

,1< j <m}, where G and S represent the set of genes andsamples respectively and m

ij

represents the expression valueof gene g

i

for sample s

j

. Similarly, a G⇥S⇥T microarraydataset can be represented as a real valued gene expressioncube, C={ C

ijk

|1< i <m ,1< j <n , 1<k<l }, where eachcell C

ijk

represents the expression value of gene g

i

undercondition s

j

at time point tk

.The study of co-expressed genes helps in determining vari-

ous types of correlations such as, shifting, scaling, and shifting-and-scaling among genes and diseases across time spaces [4].Given two gene expressions g1=[ a1, a2, a3, ..., a

n

] andg2=[b1, b2, b3, ..., b

n

], the correlatins between these twopatterns may be expressed as follows:

• Shifting patterns: if ai

=p + b

i

, i = 1, 2, 3, . . . , n, wherep is the additive constant.

• Scaling patterns: if ai

=q⇥b

i

, i = 1, 2, 3, . . . , n, whereq is the multiplicative constant.

• Shifting-and-scaling patterns: if a

i

=q ⇥ b

i

+ p, i =1, 2, 3, . . . , n where p is the additive constant and q isthe multiplicative constant.

In the yesteryears, several clustering methods [10, 11] havebeen proposed to classify the groups of genes [8]. However,these traditional clustering approach cannot find the geneswhich exhibit shifting-and-scaling correlations. Later, thebiclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposedto mine biclusters over a subspace of conditions.

However, these approaches have di�culty in analysing ofgene expressions beyond a particular time point. In [28], aclustering method is proposed to mine genes across a sub-set of samples and time points, introducing the triclusteringapproach. Later, more triclustering algorithms [13, 14, 3, 6,24, 23, 12, 16, 26, 20, 15, 7] are introduced to find cluster ofgenes over subset of conditions and time points on the basis

Page 2: A Fast Gene Expression Analysis using Parallel ...jkalita/papers/2016/KakatiTulikaICTCS2016.pdfbiclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposed to mine biclusters

of some homogeneity criterion. However, these tricluster-ing algorithms are also not free from aforesaid limitations.This has motivated us to develop a fast triclustering algo-rithm, which can not only mine shifting and scaling patterns,but also can mine genes with shifting-and-scaling patterns.Our method uses SSSim metric [2] which can detect pairof genes with high correlations in any gene expression data.The proposed triclustering algorithm can find co-expressedgenes exhibiting shifting-and-scaling patterns and also takescare of inter-temporal coherence. In summary, it is able toidentify coexpressed patterns of high biological significanceover G⇥S⇥T plane.

Although triclustering algorithms perform better in anal-ysis of gene expression data, these problems are NP-Hard,because triclustering algorithm is an extended biclustering,which is itself a NP-Hard problem [18]. Moreover, with in-creasing size of data, the demand for a cost-e↵ective ar-chitecture is growing to mine the co-expressed genes overG⇥S⇥T space in minimum execution time. A parallel bi-clustering and distributed triclustering approach can im-prove the performance of gene expression analysis of largedatasets. This paper presents, a shared memory sharednothing architecture to e↵ectively find triclusters from aG⇥S⇥T dataset with less execution time.

1.1 ContributionThe major contributions of this paper are listed below.

• A parallel version of a biclustering algorithm is pro-posed to mine co-expressed genes over G⇥S space.

• A distributed triclustering algorithm to mine co-expressedgenes over G⇥S⇥T is reported.

• A shared memory shared nothing client-server archi-tecture is proposed to parallelize the triclustering algo-rithm, thereby reducing the computational overhead.

• Both biological validation and execution time perfor-mance analysis is provided.

The rest of the paper is organised as follows. Section 2gives an overview of existing triclustering algorithms andparallelization of these algorithms. Section 3 elaborates theproposed parallel method of triclustering and the proposedarchitecture. Section 4 reports the experimental results toestablish the e↵ectiveness of the proposed architecture. Fi-nally, Section 5 concludes the paper with possible futurework.

2. RELATED WORKSThe increasing volume of information makes clustering of

large data a challenging task. In order to minimize thisproblem, parallelization of clustering has become an emerg-ing problem for the reseachers in the recent time. In [29],parallel version of K-means clustering algorithm is proposedwhich is based on Hadoop MapReduce. Hadoop MapReduceis a programming techinique for processing large datasetsand is scalable with increasing volume of datasets. In [17],algorithms for parallelization of Heirarchical clustering us-ing distance metrics are described. The parallelization ofHierarchical clustering gives more significant results thansequential algorithms.

Biclustering algorithm is a simultaneous clustering of genesover sample space. These biclustering algorithms are morecomplex compared to the traditional clustering with respect

to large datasets. With passage of time, research has beendone for developing e↵ective clustering algorithm, that canmine genes over sample space and is scalable with largedatasets. In [30], a parallel version of a biclustering algo-rithm, named as, ParRescue has been proposed and imple-mented using MPI on a cluster of 64 nodes. ParRescue dis-tributes the load among processors and make it scalable overlarge number of nodes. This implementation proves thatthere is linear speedups with increase of computation nodesand dataset size as compared to sequential algorithm [30].

Triclustering is a recent research topic in analysis of geneexpression data and to the best of our knowledge, there isonly one parallel version of existing triclustering algorithmnamed as Partricluster [4], which is parallelized using filter-stream paradigm of Anthil platform. The parallelizationgives a better results in terms of scalability and executiontime. But to the best of our knowledge, there is no clientserver model which is developed for parallelization of tri-clustering algorithm. Anthill is a programming frameworkto support the implementation and evaluation of P2P appli-cations. So, in this paper, we propose a client-server basedshared memory shared nothing architecture to mine gene ex-pression data to identify biologically significant coexpressedtriclusters over G⇥S⇥T space.

3. PROPOSED METHODWe introduce a parallel version of triclustering algorithm

to mine initially biclusters using a parallel or shared mem-ory approach and then extracts the triclusters using a dis-tributed or shared nothing approach. Our triclustering methodworks on G⇥S⇥T space and is able to identify clusters ofgenes exhibiting shifting, scaling, and shifting-and-scalingpatterns. We use the measure defined in [2] named as SS-Sim (shifting-and-scaling) during biclusters and triclustersextraction, which can detect the gene pairs having high cor-relations. The measure is based on local mean (lmean) ofa gene g

i

2 G as the mean of gi

(cj�1), gi(cj) and g

i

(cj+1)

for condition c

j

2 C such that j 6= 1 and j 6= n, where n isthe total number of conditions. If g1 and g2 are two geneexpressions such that g1={ a1,a2,a3,a4, ..., an

} and g2={

b1,b2,b3,b4, ..., b

n

}, then SSSim( g1, g2) is given by thefollowing equation.

SSSim(g1, g2) = 1 �

n�1Pi=2

X

Y

n � 2(1)

whereX = |

a

i+1 � a

i

a2 � a1�

b

i+1 � b

i

b2 � b1| (2)

Y = 2 ⇤ max(|lmean

i

�a

i+1 � a

i

a2 � a1, lmean

i

�b

i+1 � b

i

b2 � b1|) (3)

for i=2 samplelmean

i

= mean(a

i+1 � a

i

a2 � a1,

b

i+1 � b

i

b2 � b1,

a

i+2 � a

i+1

a2 � a1,

b

i+2 � b

i+1

a2 � a1)

(4)

for i=n-1 sample

lmean

i

= mean(a

i

� a

i�1

a2 � a1,

b

i

� b

i�1

b2 � b1,

a

i+1 � a

i

a2 � a1,

b

i+1 � b

i

a2 � a1)

(5)

otherwise,lmean

i

= mean(a

i

� a

i�1

a2 � a1,

b

i

� b

i�1

b2 � b1,

a

i+1 � a

i

a2 � a1,

b

i+1 � b

i

a2 � a1,

a

i+2 � a

i+1

a2 � a1,

b

i+2 � b

i+1

a2 � a1)

(6)

3.1 The Proposed Architecture

Page 3: A Fast Gene Expression Analysis using Parallel ...jkalita/papers/2016/KakatiTulikaICTCS2016.pdfbiclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposed to mine biclusters

The proposed architecture for tricluster extraction is givenin Figure 1. It extracts triclusters from G⇥S⇥T space in sixdistinct steps using client server framework. In step (1), theserver partitions the GST dataset vertically with commongene names. The GS planes are then distributed among

Figure 1: Shared Memory Shared Nothing Client-Server Ar-chitecture

the clients in step (2) for parallel biclustering. For each GSplane, biclusters are extracted by each client machine andthe server itself in step (3). These biclusters are then sentback by the clients in step (4) to the server. The servermerges them in step (5). Finally, the server extracts tri-clusters from these biclusters, considering the inter-temporalcoherence threshold in step (6).

We employ the parallelization strategy using a sharedmemory shared nothing architecture and we use a clusterof n machines and assign one of the machines as the Masterand rest (n� 1) machines as the client machines. A sharednothing architecture is defined between the server and theclients to distribute the load of extracting biclusters amongthem by implementing TCP/IP communication between theserver and each client. The shared memory architecture isused by each client to extract biclusters from each GS planeby implementing the parallel toolbox of MATLAB program-ming environment. Figure1 gives the architecture for paral-lelization of our algorithm. The basic steps of this strategyare illustrated below.

1. The server machine accepts GST dataset and uses thefollowing input parameters: SSSim threshold (�), min

g

,min

s

, inter-temporal coherence threshold (⇢). Theserver then vertically partitions the GST dataset withcommon gene names across di↵erent time points.

2. The server then distributes the load between the clientsand itself.

3. The server will extract biclusters for the GS planes andthe client will also do the same job independently andin parallel.

4. The clients will send back the biclusters for each GSplane to the server.

5. The server will then merge all these biclusters and ex-tract triclusters subsequently.

Some definitions used to describe the biclusters and tri-clusters algorithm are reported next.

Definition 1. Mean Column Matrix: It is the matrix M

of order 1 ⇥ n where M(i) is the mean of the column c

i

.

Definition 2. Matrix Strength: It is the average of allthe similarity scores of all the pairs of genes, over a p subset

of features. It is given by, mS=nP

i=1SSSim(g

i

, g1)/n, where

g1 is the matrix of the means of columns cj�1, cj and c

j+1,taking base columns from j=2 . . . (p-1), i.e. j=2 to (p-1)and n is the total number of genes.

Definition 3. Initial module: An initial module, IM isextracted from a GS plane as a sub-matrix of genes. AnIM consists of three columns, say c

y

, cy�1 and c

y+1, wherec

y

is the base column and c

y�1 and c

y+1 are neighboringcolumns such that the average of the similarity scores of allgene pairs across all three columns is greater or equal to athreshold �, i.e., mS ��.

Definition 4. Bicluster: A bicluster, Bi

G

0S

0is a group

of co-expressed genes,extracted by each client, over a sub-space of samples defined by two dimensions G

0, S

0 whereG

0✓ G, S0

✓ S such that, for each pair of genes (gi

, g

j

) inthe bicluster B

c

, SSSim (gi

, g

j

) � �.

Definition 5. Tricluster: A tricluster, Ci

G

0S

0T

0is a group

of co-expressed genes, extracted by the server, over a sub-space of samples and time points, defined by three dimen-sions G

0, S0 and T

0 where G

0✓ G, S0

✓ S and T

0✓ T such

that for each pair of genes (gi

, g

j

) in the tricluster Tric

, SS-Sim (g

i

, g

j

) � � and inter-temporal coherence (Im)�⇢. Eachtricluster is the result of merging of the biclusters extractedfrom several GS planes by each client parallely, on the basisof inter-temporal coherence threshold.

Definition 6. Inter-temporal coherence: Inter-temporalcoherence (Im) is defined as the coherence (homogeneity)among genes across di↵erent gene-sample planes. It is given

by, Im=nP

i=1SSSim(gi

P

, g

i

t

)/n, where g

i

P

, git

are the genes of

the average plane P and GS plane at time-point t

j

respec-tively.

Definition 7. Average plane: An average plane or meanrepresentative plane is a matrix of order m ⇥ n taking allmeans of expression values of m genes across a subset of nconditions over t GS planes. If P is the average plane, thenfor gene g

i

and sample s

j

, we can express P as follows.

P (i, j) =

tPk=1

exp val(gjki

)

t

(7)

3.2 Generation of Biclusters and TriclustersOur architecture includes two basic algorithms namely

Generate biclusters and Generate triclusters. These algo-rithms are described in Algorithm 1 and Algorithm 2. Afterdistribution of the GS planes to each client, the server andthe clients implement the Generate biclusters algorithm toextract biclusters from each plane. These biclusters are thenmerged by the server to extract triclusters using Generate

triclusters algorithm in step (5) and (6).

Page 4: A Fast Gene Expression Analysis using Parallel ...jkalita/papers/2016/KakatiTulikaICTCS2016.pdfbiclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposed to mine biclusters

Input: GST dataset with cardinal of m genes and n samples, �,min

g

Output: Bicluster set B

t

for all time points or, a time point t

Initialization: B

t

=�, pB

=�, where p

B

is the set of partialbiclusters;Compute Mean Column Matrix M for columns c2 to c

n�1;foreach (column c

j

, j=2 to j=n-1) docompute mS of the similarity scores of all gene pairs withc

j�1 and c

j+1 as neighboring columns;endArrange the indices of the columns w.r.t to decreasing order ofmS;Taking each column c

j

with c

j�1 and c

j+1, generate initialmodules (IMs);foreach (IM) do

compute mS;while mS<� and number of genes > min

g

dodelete the gene for which the mS<�;compute mS;if (mS��) then

move IM to p

B

;else

no biclusters can be extracted from this IM ;end

end

endforeach (partial bicluster p

bcl

in p

B

) doextraFeatures= all features /2 p

bcl

;foreach (featuer ef 2 extrafeatures) do

add ef 2 extraFeatures to p

bcl

;newp

bcl

=p

bcl

;compute mS of newp

bcl

;if (mS�� ) then

p

bcl

=newp

bcl

;else

discard the feature;end

endif p

bcl

6✓ C, where C 2 B

t

thenmove p

bcl

to B

t

end

end

Algorithm 1: Generate biclusters

3.3 ImplementationIn this section, we discuss the implementation strategy for

our parallelization of our triclustering approach. This paral-lelization is done using the parallel and distributed paradigmof MATLAB 1 programming environment. TCP/IP proto-col enables to transfer data between two computers or be-tween two applications within a single computer. MATLAButilises the ability to call JAVA inline functions and commu-nicate between two computers using TCP/IP protocol. Atfirst, we create a socket for communicating two computersusing MATLAB interface. This communication includes thefollowing steps.

• The MATLAB session in server side prepares the datato be transferred and creates a TCP/IP server socketand waits for the client for connection.

• The second MATLAB session running on a di↵erentcomputer creates a TCP/IP client and connects to theserver.

• Once the connection is made, the data is send out andthe socket is closed.

• The client retrieves the data from the server and theconnection is closed.

MATLAB provides Parallel Computing ToolBox to solvelarge computations using processors having multi-cores. This1http://www.mathworks.com/matlabcentral/fileexchange/24524-tcp-ip-communications-in-matlab

Input: B

t

, ⇢, min

g

and min

s

Output: Tricluster set,Tri

c

Initialization: Tri

c

=�, pT

=�, where p

T

is the set of partialtriclusters;while (B

x

is not reached, where x is the last time-point) doforeach (cluster c

i

, present in B

t

of time-point t) doforeach (cluster c

j

, present in B

t

at time-point t+1 )do

C c

t

i

\ct+1j

;

if (untraversed time points still left) thenif (cardinal of genes in C �min

g

and cardinalof samples in C �min

s

) thenc

i

C;t++;

end

elsep

T

C;break;

end

end

end

endforeach (partial tricluster 2 p

T

) docalculate the inter-temporal coherence;if (inter-temporal coherence is �⇢ and partial tricluster 6✓Tri

c

) thenTri

c

partial tricluster;else

discard the partial triclusterend

end

Algorithm 2: Generate triclusters

is possible by using the parfor for running tasks with manyiterations on multiple core processors. This parfor distributesthe independent tasks to the locally available workers whichrun independently in one MATLAB session.

3.4 Complexity AnalysisOur method includes two main modules: Generation of bi-

clusters and Generation of triclusters. The computation ofgenerating biclusters adds maximum overhead to the overallcomplexity. So, we have reduced this computation time bydistributing the GS planes to clients, which extracts biclus-ters from each plane in parallel. Assume the dataset consistsof m genes, n samples and t time points. After removal ofgenes for which SSSim score is less than the SSSim threshold,number of genes and samples in each partial biclusters be m

0

and 3 respectively. Again, let n

0be the number of samples

not included in the partial biclusters and after addition offeatures, number of samples in each biclusters be n

00. There-

fore, the complexity of generation of biclusters is O(m⇥n

2

+ n⇥ logn +m⇥m

0⇥n+ n(n

0⇥m

0⇥n

00+n(m

0+n

00))).

Let n be the maximum number of biclusters in biclusterset B

t

at any time point t. So, complexity of generatingset of partial triclusters p

T

is O(m0⇥n

3⇥t). If maximum

number of partial triclusters in p

T

is n⇥t and m

0, n

00, and

t

x

are the respective number of genes, samples, and timepoints, then the complexity of generation of triclusters isO(m

0⇥n

3⇥t) + O(m

0⇥n⇥n

00⇥t⇥t

x

). The complexity ofdata transmission between clients and server is linear withthe size of data.

4. EXPERIMENTAL RESULTSThe proposed triclustering method was implemented to

extract triclusters from a real yeast microarray dataset ina cluster of three machines connected by a switch, each

Page 5: A Fast Gene Expression Analysis using Parallel ...jkalita/papers/2016/KakatiTulikaICTCS2016.pdfbiclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposed to mine biclusters

with an INTEL(R) core (TM) I3-2120 [email protected] pro-cessor, 2.00 GB RAM and a 64-bit Windows 7 Operat-ing System. We use TCP/IP communication and parallelparadigm of MATLAB programming environment. Three(n=3) GST datasets of yeast cell-cycle regulated genes [21],given below, were taken to validate the performance of ourmethod. We mine these clusters with SSSim threshold (�)=0.7, inter-temporal coherence threshold (⇢)= 0.7, minimumnumber of genes (min

g

)= 5 and minimum number of sam-ples (min

s

)=3 for all these three datasets :

• Dataset of yeast cell-cycle regulated genes for Elutria-tion Experiment: 7161 genes, 13 samples and 14 timepoints.

• Dataset of yeast cell-cycle regulated genes for PheromoneExperiment 7161 genes, 13 samples and 10 time points.

• Dataset of yeast cell-cycle regulated genes for cdc15Experiment 7161 genes, 13 samples and 4 time points.

We start our analysis by comparing the performance of ourmethod in sequential and parallel mode. From Figure 2a,we interpret that the execution time required for mining bi-clusters sequentially in one machine for all the GS planes ismore than that of mining biclusters in two machines usingthe distributed and parallel programming. Table 1 gives anempirical illustration of how the distributed and paralleliza-tion strategy improves the performance by reducing the ex-cution time of extracting biclusters. Again, since the mod-

(a) Comparison of performance using sequencial exe-

cution and distributed and parallel execution.

(b) Performance with increase of processors.

Figure 2: Performance Analysis

Table 1: Comparison of performance for extracting biclus-ters using sequencial execution and distributed and parallelexecution.

Serialnumber

Execution time for extracting biclusters(in minutes)

In single ma-chine

In two ma-chines

Dataset1

21.62 15.67

Dataset2

50.28 33.41

Dataset3

0.63 0.41

ule generate biclusters adds more overhead to the overallcomplexity, so reducing the execution time for this moduleplays a vital role in reducing the total execution time. FromTable 2, it is clear that overall execution time of extract-ing triclusters is reduced due to the distributed and parallelstrategy adopted to minimize the execution time in biclus-ter extraction. Moreover, this parallelization strategy scaleswell with increasing number of processors. That is, if thesequential execution of extracting triclusters from Dataset1 takes 70.95 minutes, then it reduces to 65.45 minutes and64.22 minutes with parallellization strategy adopted in twoprocessors and three processors respectively. Similarly, forDataset 2, the execution time reduces to 68.05 minutes and66.564 minutes from 84.92 minutes after using this paral-lelization strategy. Thus, from Figure 2b, we can confirmthat, execution time decreases linearly with increase of pro-cessors. To validate the performance of our method in termsof biological significance, we used p value [5]. From Table 3,it is seen that, in case of dataset 2, our method shows highenrichment in terms of p value with 1.667e-25 and 5.033e-34.Similarly, for dataset 3, we obtained triclusters with high bi-ological significance in terms of lesser p value with 1.910e-77and 2.230e-67.

Table 2: Comparison of performance with increasing numberof biclusters

Serialnum-ber

Overall execution time for extracting tri-clusters (in minutes)

In singlemachine

In two ma-chines

In three ma-chines

Dataset1

70.95 65.45 64.22

Dataset2

84.92 68.05 66.564

Dataset3

27.28 27.08 27

Table 3: External Validation (p value)Dataset 1 Dataset 2 Dataset 3

p

valueSize oftricluster

p

valueSize oftricluster

p

valueSize oftricluster

7.507e-18

457⇥3⇥13 1.667e-25

966 ⇥3⇥9

1.910e-77

6280 ⇥3⇥10

9.981e-12

238⇥3⇥10

5.033e-34

288 ⇥3⇥7

2.230e-67

6239 ⇥3⇥10

7.456e-14

196 ⇥3⇥8

3.131e-6

173 ⇥3⇥2

1.144e-5

1706⇥3⇥10

4.841e-8

488 ⇥3⇥4

4.369e-7

77 ⇥3 ⇥9 4.48e-5

300 ⇥3⇥5

4.894e-7

198 ⇥3⇥3

9.22e-34

877 ⇥3⇥3

4.220e-7

3585⇥3⇥4

7.060e-10

85 ⇥3 ⇥4 3.149e-48

5894 ⇥3⇥9

2.091e-5

1976⇥3⇥4

Page 6: A Fast Gene Expression Analysis using Parallel ...jkalita/papers/2016/KakatiTulikaICTCS2016.pdfbiclustering algorithms, [9], [22, 1, 19, 2, 25, 27] are proposed to mine biclusters

5. CONCLUSION AND FUTURE WORKThis approach of parallelizing our algoritm is e↵ective in

terms execution time and it includes mainly four steps: datadistribution, task parallization, merging and result extrac-tion. Data distribution is instantiated by distributing theGS planes among client machines and server machine de-pending upon the load in each machine. This communica-tion is asynchronous as the same task of extracting biclus-ters can be done in each of the three machines and reducethe computation time. Task parallelization is implementedby distributing the independent task to the locally avail-able workers which run independently in one machine us-ing parallel toolbox of MATLAB programming environment.The number of iterations taken by each loop while extract-ing the biclusters is reduced in this step. The proposedmethod has been established and found to perform satis-factorily in terms of execution time and quality (p value)of triclusters extracted. Our algorithm is also able to mineco-expressed genes exhibiting shifting-and-scaling patternsover gene-sample-time space in minimum time.In future, we aim to analyze our method in human diseasedatasets, and to develop a ranking scheme to rank the tri-clusters for a given disease query.

AcknowledgementThe authors are thankful to Ministry of HRD and UGC forproviding funding to carry out research under the schemes -(i) CoE under FAST and (ii) SAP DRS-II of UGC.

6. REFERENCES[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan.

Automatic subspace clustering of high dimensional data. DataMining and Knowledge Discovery, 11(1):5–33, 2005.

[2] H. Ahmed, P. Mahanta, D. Bhattacharyya, and J. Kalita.Shifting-and-scaling correlation based biclustering algorithm.IEEE/ACM Transactions, Computational Biology andBioinformatics, 11(6):1239–1252, Nov 2014.

[3] H. Ahmed, P. Mahanta, D. Bhattacharyya, J. Kalita, andA. Ghosh. Intersected coexpressed subcube miner: An e↵ectivetriclustering algorithm. In WICT, 2011, pages 846–851. IEEE,2011.

[4] R. Araujo, G. Trielli, G. Orair, W. Meira, R. Ferreira, andD. Guedes. Partricluster: a scalable parallel algorithm for geneexpression analysis. In 18TH ISBAC-PAD’06, pages 3–10.IEEE, 2006.

[5] G. F. Berriz, O. D. King, B. Bryant, C. Sander, and F. P.Roth. Characterizing gene sets with funcassociate.Bioinformatics, 19(18):2502–2504, 2003.

[6] A. Bhar, M. Haubrock, A. Mukhopadhyay, U. Maulik,S. Bandyopadhyay, and E. Wingender. �-trimax: extractingtriclusters and analysing coregulation in time series geneexpression data. In Algorithms in Bioinformatics, pages165–177. Springer, 2012.

[7] A. Bhar, M. Haubrock, A. Mukhopadhyay, and E. Wingender.Multiobjective triclustering of time-series transcriptome datareveals key genes of biological processes. BMC bioinformatics,16(1):200, 2015.

[8] D. K. Bhattacharyya and J. K. Kalita. Network anomalydetection: A machine learning perspective. CRC Press, 2013.

[9] Y. Cheng and G. M. Church. Biclustering of expression data. InIntelligent Systems for Molecular Biology, volume 8, pages93–103, 2000.

[10] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein.Cluster analysis and display of genome-wide expressionpatterns. Proceedings of the National Academy of Sciences,95(25):14863–14868, 1998.

[11] J. A. Hartigan. Clustering Algorithms. John Wiley &amp;Sons, Inc., New York, NY, USA, 99th edition, 1975.

[12] Z. Hu and R. Bhatnagar. Algorithm for discoveringlow-variance 3-clusters from real-valued datasets. In IEEE 10thICDM, 2010, pages 236–245. IEEE, 2010.

[13] D. Jiang, J. Pei, M. Ramanathan, C. Tang, and A. Zhang.Mining coherent gene clusters from gene-sample-timemicroarray data. In Proceedings of the tenth ACM SIGKDD,pages 430–439. ACM, 2004.

[14] H. Jiang, S. Zhou, J. Guan, and Y. Zheng. gtricluster: a moregeneral and e↵ective 3d clustering algorithm forgene-sample-time microarray data. In Data Mining forBiomedical Applications, pages 48–59. Springer, 2006.

[15] Y. Liu, T. Yang, and L. Fu. A partitioning based algorithm tofuzzy tricluster. Mathematical Problems in Engineering, 2015,2015.

[16] Y.-C. Liu, C.-H. Lee, W.-C. Chen, J. Shin, H.-H. Hsu, andV. S. Tseng. A novel method for mining temporally dependentassociation rules in three-dimensional microarray datasets. InComputer Symposium (ICS), 2010 International, pages759–764. IEEE, 2010.

[17] C. F. Olson. Parallel algorithms for hierarchical clustering.Parallel computing, 21(8):1313–1325, 1995.

[18] J. Orlin. Contentment in graph theory: covering graphs withcliques. In Indagationes Mathematicae (Proceedings),volume 80, pages 406–424. Elsevier, 1977.

[19] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann,W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler. A systematiccomparison and evaluation of biclustering methods for geneexpression data. Bioinformatics, 22(9):1122–1129, 2006.

[20] K. Sim, Z. Aung, and V. Gopalkrishnan. Discovering correlatedsubspace clusters in 3d continuous-valued data. In IEEE 10thICDM, 2010, pages 471–480, Dec 2010.

[21] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer,K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, andB. Futcher. Comprehensive identification of cell cycle–regualted genes of the yeast saccharomyces cerevisiae bymicroarray hybridization. Molecular Biology of the cell,9(12):3273–3297, 1998.

[22] A. Tanay, R. Sharan, M. Kupiec, and R. Shamir. Revealingmodularity and organization in the yeast molecular network byintegrated analysis of highly heterogeneous genomewide data.Proceedings of the National Academy of Sciences of theUnited States of America, 101(9):2981–2986, 2004.

[23] G. Wang, L. Yin, Y. Zhao, and K. Mao. E�ciently miningtime-delayed gene expression patterns. IEEE Transactions onSystems, Man, and Cybernetics, Part B: Cybernetics,40(2):400–411, 2010.

[24] X. Xu, Y. Lu, K.-L. Tan, and A. Tung. Finding time-lagged 3dclusters. In IEEE 25th ICDE 2009., pages 445–456. IEEE,2009.

[25] X. Xu, Y. Lu, A. Tung, and W. Wang. Miningshifting-and-scaling co-regulation patterns on gene expressionprofiles. In Proceedings of the 22nd ICDE 2006., pages 89–89.IEEE, 2006.

[26] Y. Yin, Y. Zhao, B. Zhang, and G. Wang. Mining time-shiftingco-regulation patterns from gene expression data. In Advancesin data and web management, pages 62–73. Springer, 2007.

[27] T. Yun and G.-S. Yi. Biclustering for the comprehensive searchof correlated gene expression patterns using clustered seedexpansion. BMC genomics, 14(1):144, 2013.

[28] L. Zhao and M. J. Zaki. Tricluster: an e↵ective algorithm formining coherent clusters in 3d microarray data. In Proceedingsof the 2005 ACM SIGMOD international conference onManagement of data, pages 694–705. ACM, 2005.

[29] W. Zhao, H. Ma, and Q. He. Parallel k-means clustering basedon mapreduce. In Cloud Computing, pages 674–679. Springer,2009.

[30] J. Zhou and A. Khokhar. Parrescue: Scalable parallel algorithmand implementation for biclustering over large distributeddatasets. In 26th IEEE ICDCS 2006., pages 21–21, 2006.