10
Performance Enhancement Algorithms For Data Reduction in Hadoop Environment Mr. A. Antony Prakash 1 , Dr. A. Aloysius 2 1 . Asst.Professor in Information Tech, St Joseph‟s College - Tiruchirappalli- 2 [email protected] 1 2 Asst.Professor in Computer Science, St Joseph‟s College - Tiruchirappalli- 2 [email protected] 2 Abstract: The Current research developments in Interactomics are mainly focused on designing the intelligent computational systems which results in producing heaps of Biological Interaction data. The protein complex (dense sub-networks) prediction from these voluminous host-viral interaction networks is one of the research challenges. They are in the form of cliques and non-cliques for single species interaction, bicliques and non-bicliques for host-viral interactions (dual species interaction). For this research problem, the existing graph theoretic computational approaches concentrated on the clique mining from Interaction networks based on their topological properties but the dense non-cliques are ignored. The protein complexes in host-viral interactions are dense sub-networks possibly both bicliques and non-bicliques. The Score based Co-Clustering with MapReduce (MR-CoC) is one of the sub-network mining algorithms based on score measure which extracts both cliques and non-cliques. This approach is used in this paper for mining Protein Complexes (bicliques and non bicliques) from HIV-Human Protein Interaction Network. The protein complex coverage of the extracted HIV-Human sub-graphs are mapped with existing HIV-Human complexes, almost 95 percent of the complexes are mapped. Further, unknown protein sub-graphs extracted can be provided to biologists for new complex discovery. The Gene Ontology and Pathways based analysis is carried out in this work. This analysis shows that the viral infections are on the immune system of the human proteins which confirms the presence of functionality of HIV. Keywords: Big Data, Biclique, Bipartite graph, Complete graph, Clustering, Co-Clustering, Sub-graph mining, Sub-network Mining, MR- CoC I. Introduction The Protein Complexes of host-viral interactions are also the bio-products same as normal protein complexes that are used to understand the viral dynamisms, central hubs for viral infections, disease diagnosis, biological characteristics of the biological systems [1]. The protein complexes in host-viral interactions are referred as protein complex, throughout this paper. This protein complex prediction is disease specific and very few complexes were predicted so far. Normally, The protein complexes are sub-networks of the protein interaction networks which are responsible for a specific biological functions like signal transductions, cell replication, cellular immunization, catalytic activities, etc at various parts of the cell [2] [3] [4]. The disease pathogens‟ infections on the host organisms are diagnosed based on these bio- products. The various sub-network mining algorithms available are CMC, COACH, MCODE [5], MCL [6], Cfinder [7], RNSC [8], STM [9] and other mining algorithms. These algorithms are mainly mines the cliques and not attempted for biclique mining. The scalability of these existing approaches is poor to process the voluminous data. The parallelization of the existing computational approaches is one of the solutions to increase the scalability. In this scenario, the MapReduce programming model is one of the de-facto standards for handling the big data [10] [11] [12] [13]. This model can be used for different scenarios like parallelization, reducing the redundant computations, scalability issues and so on. It is mainly chosen to parallelize the computation and to cope with the scalability issues. In normal parallel computations, most of the time will be spent on the context switching and event synchronization processes, which is not in the MapReduce model. Some MapReduce based approaches [14] [15] [16] were exist still they are suffering with time complexity issues. In this research work the score based MR-CoC is used for detecting bicliques and non-bicliques in Interaction Data. The complexity of this approach is O (E s +log N s ) which is less than existing approaches [17]. The limitation of the Score based MR-CoC is that the sub- networks identified are dependent on the initial seeds fed to this model. The number of seeds can be increased to maximize the randomness of the approach to mine more sub-networks. To improvise the results, seeds are generated „n‟ number of times and the results of all seed configuration are combined. This approach is slightly modified to mine bipartite sub-graphs and attempted on the HIV-Human Interaction dataset. The sub-networks are then mapped with the available HIV- Human protein complexes to analyze the coverage of protein complexes. International Journal of Pure and Applied Mathematics Volume 119 No. 18 2018, 1803-1811 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 1803

Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

Performance Enhancement Algorithms For Data Reduction in Hadoop Environment

Mr. A. Antony Prakash1, Dr. A. Aloysius2

1. Asst.Professor in Information Tech, St Joseph‟s College - Tiruchirappalli- 2

[email protected] 2 Asst.Professor in Computer Science, St Joseph‟s College - Tiruchirappalli- 2

[email protected] 2

Abstract: The Current research developments in Interactomics are mainly focused on designing the intelligent computational systems

which results in producing heaps of Biological Interaction data. The protein complex (dense sub-networks) prediction from these

voluminous host-viral interaction networks is one of the research challenges. They are in the form of cliques and non-cliques for single

species interaction, bicliques and non-bicliques for host-viral interactions (dual species interaction). For this research problem, the existing

graph theoretic computational approaches concentrated on the clique mining from Interaction networks based on their topological

properties but the dense non-cliques are ignored. The protein complexes in host-viral interactions are dense sub-networks possibly both

bicliques and non-bicliques. The Score based Co-Clustering with MapReduce (MR-CoC) is one of the sub-network mining algorithms based

on score measure which extracts both cliques and non-cliques. This approach is used in this paper for mining Protein Complexes (bicliques

and non bicliques) from HIV-Human Protein Interaction Network. The protein complex coverage of the extracted HIV-Human sub-graphs

are mapped with existing HIV-Human complexes, almost 95 percent of the complexes are mapped. Further, unknown protein sub-graphs

extracted can be provided to biologists for new complex discovery. The Gene Ontology and Pathways based analysis is carried out in this

work. This analysis shows that the viral infections are on the immune system of the human proteins which confirms the presence of

functionality of HIV.

Keywords: Big Data, Biclique, Bipartite graph, Complete graph, Clustering, Co-Clustering, Sub-graph mining, Sub-network Mining, MR-

CoC

I. Introduction

The Protein Complexes of host-viral interactions are also

the bio-products same as normal protein complexes that

are used to understand the viral dynamisms, central hubs

for viral infections, disease diagnosis, biological

characteristics of the biological systems [1]. The protein

complexes in host-viral interactions are referred as protein

complex, throughout this paper. This protein complex

prediction is disease specific and very few complexes

were predicted so far. Normally, The protein complexes

are sub-networks of the protein interaction networks

which are responsible for a specific biological functions

like signal transductions, cell replication, cellular

immunization, catalytic activities, etc at various parts of

the cell [2] [3] [4]. The disease pathogens‟ infections on

the host organisms are diagnosed based on these bio-

products. The various sub-network mining algorithms

available are CMC, COACH, MCODE [5], MCL [6],

Cfinder [7], RNSC [8], STM [9] and other mining

algorithms. These algorithms are mainly mines the cliques

and not attempted for biclique mining. The scalability of

these existing approaches is poor to process the

voluminous data. The parallelization of the existing

computational approaches is one of the solutions to

increase the scalability.

In this scenario, the MapReduce programming model is

one of the de-facto standards for handling the big data [10]

[11] [12] [13]. This model can be used for different

scenarios like parallelization, reducing the redundant

computations, scalability issues and so on. It is mainly

chosen to parallelize the computation and to cope with the

scalability issues. In normal parallel computations, most

of the time will be spent on the context switching and

event synchronization processes, which is not in the

MapReduce model. Some MapReduce based approaches

[14] [15] [16] were exist still they are suffering with time

complexity issues. In this research work the score based

MR-CoC is used for detecting bicliques and non-bicliques

in Interaction Data. The complexity of this approach is O

(Es+log Ns) which is less than existing approaches [17].

The limitation of the Score based MR-CoC is that the sub-

networks identified are dependent on the initial seeds fed

to this model. The number of seeds can be increased to

maximize the randomness of the approach to mine more

sub-networks. To improvise the results, seeds are

generated „n‟ number of times and the results of all seed

configuration are combined. This approach is slightly

modified to mine bipartite sub-graphs and attempted on

the HIV-Human Interaction dataset. The sub-networks are

then mapped with the available HIV- Human protein

complexes to analyze the coverage of protein complexes.

International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 1803-1811ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

1803

Page 2: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

The result shows that the proposed approach can mine

more than 30,000 sub-networks than existing complexes.

The biological significance of these unknown sub-

networks is annotated. The following section covers the literature study. The Proposed Approach and Discussion on Experimental Results are concentrated in the consecutive sections. Finally the summary of this research besides the future enhancements are discussed in the section 6.

(a)

(b)

Figure 1. A Sample Protein Interaction Network with Sub-

Networks highlighted, A Complete Bipartite Graph and its

adjacency matrix

The following section covers the literature study. The

Proposed Approach and Discussion on Experimental

Results are concentrated in the consecutive sections.

Finally the summary of this research besides the future

enhancements are discussed in the section 6.

2. Related Work

The literature study reveals that there are many

supervised and unsupervised models [3] [8] [18] [19]

were devised to mine the protein complexes (sub-

networks), protein patterns based on their topological

properties. They used various graph-theoretic techniques,

topological properties, classification measures to confine

their results They concentrate in different criteria like

cliques with fixed size ignores non-cliques and high

complexity; dense sub-graphs with fixed size having high

complexity etc.. All the existing protein complexes are

not the complete sub-graphs, possibly dense sub-graphs in

some cases. Further in some related research works, the

computational overhead and scalability issues are avoided

by parallelizing the computation in distributed

environment using MapReduce. Still the time complexity

issues are present due to hybridization of graph

algorithms with MapReduce model. The related research

literatures taken for observation are listed in the table.

The current research work deals with parallelizing the

mining process using MapReduce and uses a score

measure which explores the Cohesiveness of the nodes.

Some scientists suggested that mining the adjacency

matrix does not produce fruitful results. But the proposed

approach uses the adjacency matrix to produce efficient

results and outperforms the existing benchmark

approaches like MCODE [5]. Its complexity is O(n3)

where n is number of nodes which is higher than the

proposed MR-CoC.

Table 1. Related Research Works

Title MapReduce Algorithm Input Output

Detection of

Functional

modules from

protein

Interaction

Networks(20)

- Clustering Weighted

Score

Network

Clusters

Identifying

Functional

Modules in

Protein-Protein

interaction

Networks: an

integrated exact

approach(21)

- Mathematical

optimization

Weighted

Vertex

Network

Modules

Weighted

Consensus

Clustering for

Identifying

Functional

Modules In

Protein-Protein

Interaction

Networks(22)

-

Clustering

(Combines 4

Clustering

algorithms)

V,E Network

clusters

An automated

method for

finding

molecular

complexes in

large protein

interaction

networks(23)

- MCODE Vertex,

degree

Clique

(Connected

Sub-graph)

A Faster

Algorithm for

Detecting

Network

Motifs(5)

- Enumerating

Sub-graphs

V, E,

neighbor

list

k-size Sub-

graphs

(Network

Motifs)

Scalable Sub-

graph

Enumeration in

MapReduce(14)

MapReduce Enumerating

Sub-graphs

V, E,

Pattern

Sub

Graphs

Efficient

sampling

algorithm for

estimating sub-

graph

concentrations

and detecting

network

motifs(24)

- Edge

Sampling

Edge list

and

neighbor

list

Sub

Graphs

International Journal of Pure and Applied Mathematics Special Issue

1804

Page 3: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

Title MapReduce Algorithm Input Output

A novel

MapReduce-

based approach

for distributed

frequent sub-

graph

mining(16)

MapReduce

MR based

Graph

Partitioning

V, E Frequent

Pattern

An Iterative

MapReduce

Approach to

Frequent Sub-

graph Mining in

Biological

Datasets (15)

MapReduce Clustering Frequent

Sub-graphs

The current research work deals with parallelizing the

mining process using MapReduce and uses a score

measure which explores the Cohesiveness of the nodes.

Some scientists suggested that mining the adjacency

matrix does not produce fruitful results. But the proposed

approach uses the adjacency matrix to produce efficient

results and outperforms the existing benchmark

approaches like MCODE(5). Its complexity is O(n3)

where n is number of nodes which is higher than the

proposed MR-CoC.

3. Bipartite Graph Mining using Score based MR-

CoC

Network Representation and Terminology

`Protein Interaction Network is represented using

the undirected graph structure as G (V, E) (25)(18) (26)

(27). The interactions between different species like

disease-host interactions are represented as Bipartite

graphs GB. Here the proteins are the set of vertices or

nodes (Vd, Vh) of two species, whereas the interactions

between these proteins are edges or links (E). Edge is

represented as Ei=<Pa, Pb>, where Paϵ Vd, Pbϵ Vh, one

protein from each species. The PIN is an undirected

bipartite graph (<Pa, Pb> =<Pb, Pa>). The connectivity

among the proteins is represented using the adjacency

matrix which is a matrix of size |Vd|×|Vh|. The adjacency

matrix of the PIN represents the membership value of the

edge as defined in the equation (1).The adjacency matrix

(Am,n) of the undirected bipartite graph has disease

proteins in rows and host (human) proteins in columns.

(1)

Complete Bipartite Graph (Gm, n): all the vertices of the

first vertex set (|Vd|=m) has an edge to all the vertices

in the second vertex set (|Vh|=n) (26) (28). All the

entries of the adjacency matrix must be 1 as in figure

1(b).

Biclique: The biclique is a complete subgraph that

present in a given graph Gm,n (26). It possesses all the

properties of the complete bipartite graph but it is a

sub-graph.

Non-Biclique: it is a bipartite sub-graph but not a

complete graph.

Initial Seed Vectors:A seed vector is a set of proteins

chosen randomly to mine sub-network as seen in the

figure 2. Similarly multiple random seeds are

generated to mine sub networks in a random sub-space.

Figure 2. A Sample Seed Vector – the numeric value

represents the protein index

Sub-Network Mining using Score based MR-CoC

The Score measure is used for Co-Clustering the

PIN. The score is defined using the frequency of „1‟s in

the adjacency matrix. The Score of the adjacency matrix

of the graph or sub-graph can be calculated to access the

nature of the derived local pattern from the given PIN

using the equation. It is the ratio of the frequency of „1‟s

to the number of elements(17), for Bipartite Graphs the

score will be redefined as in the equation.

(2)

where Am,n is the adjacency matrix of order ‘m x n’ with

membership value (either ‘0’ or ‘1’) of the edge, the

numerator represents the frequency of ‘1’s in the

adjacency matrix. The Score value of the bicliques and

the complete bipartite graphs will always 1. The decimal

values ranging below 1 can also be easily considered for

mining non-bicliques or dense sub-networks. The

proposed approach Score based Co-clustering approach

for bipartite graphs is depicted in the figure 3. The

scoreCoC(An,m) function is slightly modified for mining

Bipartite sub-networks as shown below.

Function scoreCoC(Am,n)

while (score<thres)

Remove protein (row or column) which has min(min(row_freq), min(col_freq))

Evaluate score(Am,n)

end return (protein ids in Am,n)

end

The MR-CoC has two main phases Map phase and

Reduce phase. The generated seeds are written in the text

files and they are fed as input to the map phase. For each

seed generate the adjacency matrix by extracting the

corresponding columns of seed proteins from Am,n as sub-

matrix then follow the score based co-clustering process

as given in the algorithm. The row_freq represents the

frequency of 1’s in each row, similarly col_freq for

columns.

International Journal of Pure and Applied Mathematics Special Issue

1805

Page 4: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

PIN

Initial Random Seed Matrix

(a) Seed Generation

Adjacency

matrices

(b) Mining Sub-Network using Score based MR-CoC

(c) Biological Significance of Sub-Networks

Figure 3. Workflow of the proposed MR-CoC model

The complexity of the proposed work is O (Es+log Ns)

where Es is number of edges in a seed, Ns represents

number of nodes in a seed. It is minimum than the

MCODE (O(n3)) (5)algorithm which is widely used for

mining sub-graphs.

4. Experimental Setup

The Homo Sapiens dataset from String DB(29)as

in figure 5, is attempted previously for mining the cliques

and non-clique. The sub-networks obtained are entirely

depends on initial seeds generated (17). In this work, the

proposed approach is experimented 50 times for each seed

configuration with same number of random seeds

(generated newly for each experiment). Initially the

cliques and non-cliques are mined in the previous work.

Besides, the protein complex coverage is evaluated to

showcase the performance of the proposed approach.

Secondly, the HIV1-Human interaction Database

(bipartite graph) is chosen to attempt the proposed

approach for mining sub-networks (bicliques and non

bicliques) as shown in the figure 4. The protein

interaction networks are taken from NCBI (30) and their

descriptions are given in the table 2. The seeds are an

initial set of proteins for generating each sub-graph which

will be redundant. The score measure is used to find the

sub-graph from each seed. The distinct subgraphs are

extracted in the reduce phase. The proposed approach is

implemented using MapReduce model in the Matlab. The

environmental setup is discussed in the table. The workers

represent the number of parallel threads, 10 workers are

chosen for this implementation. The Score threshold 0.8 is

chosen for non-cliques and non-bicliques.The

experiments are carried out in a system with Intel I7

processor and 12 GB RAM.

Table 2 Experimental Setup

Parameter Homo Sapiens PPI HIV-Human PPI

Environment Matlab 2016b (Map Reduce Model)

Matlab 2016b (Map Reduce Model)

Number of

Interactions

85,48,003 17,104

Number of proteins 19, 427 HIV- 11, Human- 4481

Seed length 50 proteins 40 proteins

Minimum size of sub-

network

3,4,5 proteins 4 proteins

The Protein complexes of Homo sapiens are taken from

the CORUM database, which is a comprehensive resource

of mammalian protein complex (31). It has 2358 human

protein complexes. The results are mapped to these

existing protein complexes for analyzing the performance

of the proposed approach.

Figure 4. Heat Map of adjacency matrix of HIV-Human PIN

(17,104 Interactions)

Figure 5. Heat Map of adjacency matrix of Homo Sapiens PIN

(85,48,003 Interactions)

International Journal of Pure and Applied Mathematics Special Issue

1806

Page 5: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

5 .Result and Discussion

The sub-networks are mined from the Homo Sapiens PPI

using MR-CoC. The number of cliques and non-cliques

obtained using the proposed methodology for different

initial seed setup and minimum sub-network sizes are

listed in the table 3. The cohesiveness of the sub-network

is clearly showcased by the score measure. The Biological

significances of the obtained cliques and non-cliques can

be further studied by comparing it with the existing

protein complexes taken from the CORUM database [31].

From the experimental results some sub-networks are

exactly same as existing protein complexes;some sub-

networks are partially same as existing protein complexes;

some complexes are remains unmapped. This protein

complex coverage of the computational results is

discussed in the table 4, 5 and 6.

The protein complex coverage is evaluated as Fully

mapped, Partially Mapped and Un-mapped sub-networks.

Fully mapped sub-networks(FM) are same as

existing protein complex.

Partially mapped sub-networks(PM) contain 90

percent of the participants of the existing protein

complexes.

Un-mapped sub-networks(UM) contain less than

90 percent of the participants of the existing

protein complexes.

The coverage of the human protein complexes over the

resultant sub-networks with different seed configuration

on Homo Sapiens Dataset are visualized in the Fig 6, 7

and 8. Similarly the complex coverage of HIV-Human

protein complexes over the resultant sub-networks is

visualized in Fig 9. The X-axis represents the number

of seed vectors in terms of millions (M). The Y-axis

represents the number of sub-networks maps the protein

complexes fully, partially and so on. 2237 protein

complexes out of 2358 existing human protein

complexes are mapped by the resultant sub-networks.

There are 26575 unmapped sub-networks which are

provided to the biologists for further observation of

their biological significances and new complexes

prediction.

Table 2. Number of Cliques and Non- Cliques extracted

on Homo Sapiens Dataset with different parameter values

Number of

Seeds

Minimum sub-

network size = 3,

threshold=0.9

Minimum sub-

network size = 4,

threshold=0.85

Minimum sub-

network size =

5, threshold=0.8

Cliques Non-

Cliques Cliques

Non-

Cliques

Cliq

ues

Non-

Cliqu

es 100000 2910 3181 821 2766 683 2194

500000 2174 4522 1139 4172 812 3178

1000000 4372 7188 2897 3921 2381 2190

5000000 5917 8620 3200 5193 1782 4822

10000000 8539 11861 3176 7885 2910 8821

50000000 6433 10865 5192 7231 4987 6975

100000000 18862 37297 11862 9021 1019

2 19021

In proposed Bipartite sub-networks mining approach, the

38 protein complexes out of 40 existing HIV-Human

protein complexes are predicted. Some of the bicliques

and non-bicliques of Un-mapped sub-networks are

annotated for their biological function, molecular process

and cellular component using the String database [29], the

query services are listed in the table 4.

Figure 6. Protein Complex Coverage with minimum sub-network

size 3

Figure 7. Protein Complex Coverage with minimum sub-network

size 4

International Journal of Pure and Applied Mathematics Special Issue

1807

Page 6: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

Figure 8. Protein Complex Coverage with minimum sub-network

size 5

Some of the bicliques and non-bicliques along with

their biological function, molecular process and cellular

component were extracted using the String database (29)

query service, are listed in the table 4. The biological

significances of the sub-networks are further studied to

know their functionalities. The study shows that the HIV

1 interaction on host organisms affects its immune system.

Most of the components have the proteins responsible for

immune system maintenance. The HIV infection is on the

extra-cellular space of the host organisms. The common

functionalities of the components‟ subunits are wound

healing, cell binding, defense response, immune system

regulations, cell growth, regulation of metabolic activities

and so on as discussed in the table 6. The biological

processes of some components extracted using the

proposed methodology are listed in table 6. The KEGG

pathways seen in the extracted in some of the components

are discussed in the table 7. The pathway analysis evident

that the components have the traces of viral infection

pathways, respiration regulatory signals, tuberculosis

pathways, Natural killer cell mediated cytotoxicity

pathways and so on. Thus the proposed approach can

extracts huge components from the randomized space

(random seeds) than other graph theoretic methods.

Table 4 Biological Significances of the Components in

HIV-Human PIN using Score based MR-CoC

GO Term Biological Process

Component 1

(C3, CD46, CFH,

CR1, IFNA8, IGF1,

IGF2,ITIH4, MASTL)

GO.0002682 regulation of immune system process

GO.0006952 defense response

GO.0032269 negative regulation of cellular

protein metabolic process

GO.0002252 immune effector process

GO.0042060 wound healing

GO.0048584 positive regulation of

response to stimulus

GO.0048583 regulation of response to stimulus

GO.0006950 response to stress

Component 2

(APOBEC3D,CD46,

CFH,CR1,HFE,HGF,SEC62,SEC63, C3,

TFRC)

GO.0006952 defense response

GO.0006950 response to stress

GO.0002252 immune effector process

GO.0002376 immune system process

GO.0045087 innate immune response

Component 3

(C3,CD46,CFH,CR1,IGF1,IGF2,IGFBP1,

MASTL,NUPL2,SH

3RF1,VPRBP, ARPP19, SH3RF1)

GO.0006959 humoral immune response

GO.0006956 complement activation

GO.0010827 regulation of glucose transport

GO.0019538 protein metabolic process

GO.0002455

humoral immune response

mediated by circulating immunoglobulin

GO.0002250 adaptive immune response

GO.0002252 immune effector process

GO.0002684 positive regulation of immune

system process

GO.0043086 negative regulation of

catalytic activity

GO.0048583 regulation of response to

stimulus

GO.0048584 positive regulation of

response to stimulus

GO.0031324 negative regulation of cellular

metabolic process

GO.0045087 innate immune response

Component 4 (CDK1,CFH,FLNA,

GRM1,HFE,HGF,IFI

16,IFI27,ITIH4)

GO.0006103 2-oxoglutarate metabolic

process

GO.0030162 regulation of proteolysis

GO.0006102 isocitrate metabolic process

GO.0051246 regulation of protein metabolic process

GO.0009060 aerobic respiration

Component 5 (IFI35,IFIT1,IFIT2,IFIT3,IFNA1,IFNA2,I

FNA4,IFNA7,IFNG

R2,ITIH4,ITK,NEDD4,SP110,VPRBP)

GO.0009615 response to virus

GO.0051607 defense response to virus

GO.0006955 immune response

GO.0034097 response to cytokine

GO.0045087 innate immune response

GO.0006952 defense response

GO.0043330 response to exogenous

dsRNA

GO.0002376 immune system process

GO.0002250 adaptive immune response

GO.0009615 response to virus

GO.0071345 cellular response to cytokine

stimulus

GO.0002323 natural killer cell activation

involved in immune response

GO.0006950 response to stress

GO.0042110 T cell activation

GO.0002286 T cell activation involved in

immune response

GO.0002520 immune system development

GO.0006959 humoral immune response

GO.0050794 regulation of cellular process

GO.0050896 response to stimulus

Table 5 KEGG Pathway Significances of the

Components in HIV-Human Interactions using Score

based MR-CoC

Pathway KEGG

Pathways

Component 1

(C3, CD46, CFH, CR1, IFNA8)

4610

Complemen

t and coagulation

cascades

5152 Tuberculosi

s

5150

Staphylococ

cus aureus

infection

Component 2 (C3, CD46, CFH, CR1) 4610

Complement and

coagulation cascades

5144 Malaria

Component 3

(C3, CD46, CFH, CR1, IGF1,IGF2, IGFBP1, NUPL2, SH3RF1, VPRBP)

4610

Complemen

t and coagulation

cascades

5152 Tuberculosi

s

5134 Legionellosi

s

5150

Staphylococ

cus aureus infection

Component 4

(CDK1, CFH,FLNA,GRM1,HFE,HGF, IFI16, IFI27, ITIH4)

1210

2-

Oxocarboxylic acid

metabolism

1230

Biosynthesis

of amino

International Journal of Pure and Applied Mathematics Special Issue

1808

Page 7: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

acids

20

Citrate cycle

(TCA cycle)

1200

Carbon

metabolism

Component 5

(IFIT1,IFNA1,IFNA2,IFNA4,IFNA7,IFN

GR2)

5168

Herpes

simplex infection

4140

Regulation

of autophagy

4650

Natural

killer cell

mediated cytotoxicity

5160 Hepatitis C

5162 Measles

5320

Autoimmune thyroid

disease

4630

Jak-STAT

signaling pathway

5164 Influenza A

5152

Tuberculosi

s

4622

RIG-I-like

receptor

signaling pathway

5164 Influenza A

4060

Cytokine-

cytokine receptor

interaction

5161 Hepatitis B

Further, analyzing all the components for its biological

significance will help the biologist to study the

characteristics of the disease on the host organisms. These

components can be further used to assist the drug

discovery, drug target identification, etc. The proposed

methodology is useful in distributed environment to mine

the sub-networks from the complex interaction networks.

The computational time can be reduced considerably if

the computation is carried out in distributed setup. It will

help to overcome the issues of big data in interact comes.

6. Conclusion

Protein Complex mining is one of the emerging research

areas. The proposed methodology Score based Co-

Clustering algorithm with MapReduce model is devised to

mine all kind of dense sub-graphs like clique, bicliques,

non-clique, non-bicliques. This approach is previously

attempted to mine the sub-networks from large networks

like PIN. The performance of the proposed approach is

studied based on the complex coverage level of the results.

More than 94.86 percent of the existing complexes are

mapped by the resultant sub-networks. The proposed

approach discovers 26575 unmapped protein sub-

networks as well. It is further attempted to extract

bicliques and non-bicliques from HIV-Human interactions.

Similarly the proposed approach extracts 6824 sub-

networks and 38 existing HIV-Human protein complexes

out of 40 are mapped.

The unmapped sub-networks of HIV-Human dataset are

annotated for their biological significances. The result

shows the infections of viral pathogens are on the immune

system monitoring proteins of human evidence the

presence of HIV functionalities. They targets proteins in

the extracellular spaces of blood particles. The pathway

analysis reveals that the HIV infections affect the

respiration regulatory proteins, and other viral traces that

possess similar pathways. The extracted components

(protein sub-networks) are further analyzed to understand

the clear dynamics of the HIV on host system.

REFERENCES

[1] “Structures of Life”, 2007.

[2] E. M. Hanna, N. Zaki and A. Amin, "Detecting Protein Complexes in Protein Interaction Networks Modeled as Gene Expression Biclusters," pp. 1-19, 2015.

[3] F. Y. Yu, Z. H. Yang, X. H. Hu, Y. Y. Sun, H. F. Lin and J. Wang, "Protein complex detection in PPI networks based on data integration and supervised learning method," BMC Bioinformatics, vol. 16, no. 12, pp. 1-9, 2015.

[4] L. Ou-Yang, X.-F. Zhang, D.-Q. Dai, M.-Y. Wu, Y. Zhu, Z. Liu and H. Yan, "Protein complex detection based on partially shared multi-view clustering," BMC Bioinformatics, 2016.

[5] S. Wernicke, "A Faster Algorithm for Detecting Motifs," in 5th WABI-05, 2005.

[6] A. Enright, S. Dongen and C. Ouzounis, "An efficient algorithm for largescale detection of protein families," Nucleic Acids Research, vol. 30, no. 7, pp. 1575-1584, 2002.

[7] B. Adamcsek, G. Palla, I. J. Farkas, I. Dernyi and T. Vicsek, "Cfinder: locating cliques and overlapping modules in biological networks," Bioinformatics, vol. 22, no. 8, p. 1021–1023, 2006.

[8] A. D. King, N. Przulj and I. Jurisica, "Protein complex prediction via cost-based clustering," Bioinformatics, vol. 20, no. 17, pp. 3013-3020, 2004.

[9] W. Hwang, Y. R. Cho, A. Zhang and M. Ramanathan, "A novel functional module detection algorithm for protein-protein interaction networks," Algorithms for Molecular Biology, vol. 1, no. 24, 2006.

[10] J. Ekanayake, S. Pallickara and G. Fox, "MapReduce for Data Intensive Scientific Analyses," in Proceeding ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience, 2008.

[11] Schlosser, S. Chen and S. W., "Map-reduce meets wider varieties of applications," 2008.

[12] J. Rosen, N. Polyzotis, V. Borkar, Y. Bu, M. J. Carey, M. Weimer, T. Condie and R. Ramakrishnan, "Iterative mapreduce for large scale machine learning".

[13] D. J and G. S, "MapReduce: simplified data processing on large clusters," Commun ACM, vol. 51, no. 1, p. 107–113, 2008.

[14] LongbinLaix, L. Qinzx, XueminLinx and L. Chang, "Scalable Subgraph Enumeration in MapReduce," in Proceedings of the VLDB Endowment.

[15] B. R. Steven Hill, "An Iterative MapReduce Approach to Frequent Subgraph Mining in Biological Datasets," ACM-BCB‟12, pp. 7-10, 2012.

[16] S. Aridhi, L. D'Orazio, M. Maddouri and E. Mephu, "A Novel MapReduce-based Approach for Distributed Frequent Subgraph Mining," RFIA, 2014.

[17] R.Gowri and R.Rathipriya, "Cohesive Sub-Network Mining in Protein Interaction Networks using Score based Co-Clustering with MapReduce Model (MR-CoC)," 2017.

[18] S. E. Schaeffer, "Graph clustering," Computer Science Review, pp. 27-64, 2007.

[19] H. S. M. Mosaddek, Z. Mahboob, R. Chowdhury, A. Sohel and S. Ray, "Protein Complex Detection in PPI Network by Identifying

International Journal of Pure and Applied Mathematics Special Issue

1809

Page 8: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

Mutually Exclusive Protein-protein Interactions," Procedia Computer Science, vol. 93, pp. 1054-1060, 2016.

[20] J. B, Pereira-Leal, A. J. Enright and C. A. Ouzounis, "Detection of Functional modules from protein Interaction Networks," PROTEINS: Structure, Function, and Bioinformatics, vol. 54, p. 49–57, 2004.

[21] M. T. Dittrich, G. W. Klau, A. Rosenwald, ThomasDandekar and T. Müller, "Identifying Functional Modules in Protein-Protein interaction Networks: an integrated exact approach," ISMB, vol. 24, p. 223–231, 2008.

[22] Y. Zhang, ErliangZeng, T. Li and GiriNarasimhan, "Weighted Consensus Clustering for Identifying Functional Modules In Protein-Protein Interaction Networks".

[23] G. D. Bader and C. W. Hogue, "An automated method for finding molecular complexes in large protein interaction networks," BMC Bioinformatics, vol. 4, no. 2, 2003.

[24] JyotiRao and S. M. Ms., "Efficient Method for Finding Conserved Regions in Protein Interactions Network," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 7, pp. 756-761, 2013.

[25] G. A, K. O and N. R, "Topological properties of protein interaction networks from a structural perspective," Biochemical Society Transactions, pp. 1398-1403, 2008.

[26] D. Reinhard, Graph Theory, 5 ed., Springer, 2016.

[27] Ray and S. Saha, "Subgraphs, Paths and Connected Graphs," in Graph Theory with Algorithms and its Applications, 2013, pp. 11-24.

[28] B. R.B., Graphs and Matrices, Springer, Hindustan Book Agency, 2010.

[29] D. Szklarczyk, A. Franceschini, S. Wyder, KristofferForslund, D. Heller, J. Huerta-Cepas, Milan Simonovic, l. Roth, A. Santos, K. P. Tsafou, M. Kuhn, P. Bork, L. J. Jensen and C. v. Mering, "STRING v10: protein–protein interaction networks, integrated over the tree of life," Nucleic Acids Research, vol. 43, p. 447–452, 2015.

[30] D. Ako-Adjei, W. Fu, C. Wallin, K. S. Katz, G. Song, D. Darji, J. R. Brister, R. G. Ptak and K. D. Pruitt, "HIV-1, human interaction database: current status and new features," Nucleic Acids Research, vol. 43, pp. 566-570, 2015.

[31] A. Ruepp, B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, M. Stransky, B. Waegele, T. Schmidt, O. Doudieu, V. Stümpflen and H. Mewes, "CORUM: the Comprehensive Resource of Mammalian Protein Complexes," Nucleic Acids Res, pp. 449-454, 2008.

[32] Emig-Agius, Dorothea, K. Olivieri, L. Pache, H. L. Shih and O. Pustovalova, "An Integrated Map of HIV-Human Protein Complexes that Facilitate Viral Infection," PLoS ONE, vol. 9, no. 5, 2014.

[33] J. S. J. Stefan Pinkert, "Protein Interaction Networks- More than mere Modules," PLoS Computational Biology, vol. 6, no. 1, 2010.

[34] M. S. a. S. Liang, "Predicting protein functions from redundancies in large-scale protein interaction networks," Proc. of the National Academy of Science, vol. 100, no. 22, p. 12579–12583, 2003.

[35] G. B. a. H. Hogue, "An automated method for finding molecular complexes in large protein-protein interaction networks," BMC Bioinformatics, vol. 4, no. 2, 2003.

[36] R. Gowri and R. Rathipriya, "A Study on Clustering the Protein Interaction Networks using Bio-Inspired Optimization," International Journal Computational Intelligence and Informatics, vol. 3, no. 2, pp. 89-95, 2013.

[37] R. Gowri and R. Rathipriya, "Extraction of Protein Sequence Motif Information using PSO K-Means," Journal of Network and Information Security, 2014.

[38] R. Gowri, S. Sivabalan and R. Rathipriya, "Biclustering using Venus Flytrap Optimization Algorithm," in Computational Intelligence in Data Mining, Proceedings of International Conference on CIDM, Advances in Intelligent Systems and Computing series, vol. 410, 2015, pp. 199- 207.

[39] HuanKe, P. Li, S. Guo and MinyiGuo, "On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications," IEEE Transactions on Parallel and Distributed Systems, 2015.

[40] R. Gowri and R.Rathipriya, "Protein motif comparator using PSO k-means," International Journal of Applied Metaheuristic Computing (IJAMC), vol. 7, no. 3, 2016.

International Journal of Pure and Applied Mathematics Special Issue

1810

Page 9: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

1811

Page 10: Performance Enhancement Algorithms For Data Reduction in …An Iterative MapReduce A pproach to Frequent Sub - graph Mining in Biological Datasets (15) MapReduce Clustering Frequent

1812