30
Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia Tech, Blacksburg, VA 24061.

Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

Embed Size (px)

Citation preview

Page 1: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

Storytelling and Clustering for Cellular Signaling Pathways

M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys

Department of Computer Science,Virginia Tech, Blacksburg, VA 24061.

Page 2: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

2

Objective

STKE Dataset Cell interactions through chemical

signals Discover relationships between the

pathways Graph structure Subgraph discovery problem

Pathways relationships Clustering Storytelling

Page 3: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

Myocyte Adrenergic Pathway (CMP_9043)

Page 4: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

4

Dataset properties

Total Pathways = 50

Size Range

1-1

0

11

-20

21

-30

31

-40

41

-50

51

-60

61

-70

71

-80

81

-90

91

-10

0

10

0-1

10N

um

ber

of

Pat

hw

ays

in S

ize

Ran

ge

0

2

4

6

8

10

12

Page 5: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

5

Design Pipeline

Preprocessor

Frequent Subgraph Discovery

Pathway Graphs

Frequent Subgraph

s

Clustering

STKE Dataset

NN Storytelling

Page 6: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

6

Subsequent Candidate Generation Apriori – incremental approach [17] FSG [2]

Generate a (k+1)-edge candidate subgraph by combining two k-edge subgraphs where these two k-edge subgraphs have a common core subgraph of (k-1)-edges.

Cost of comparison between subgraphs (and core subgraphs) is reduced using hash-code of each subgraph object.

m

n

o

lp

m

n

o

pq l

m

n

o

pq

Page 7: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

7

Subsequent Candidate Generation

Instance: Number of 5-edge

subgraphs: 21 Core subgraph

comparisons for s1: 20

mn

o

l p q

mn

o

p l q

mn

o

p

mn

o

l p

m op

r

nm o

lp

r

n

mn

o

l pm

n

o

l ps

mn

o

ps

mn

o

l p m

n

o

t zNot generated

………………………………………….………………………………................………………………………………….

Page 8: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

Total Unique Nodes:1205Total Relations:1376

Master Pathway Graph (MPG)

Page 9: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

9

SEG - Subgraph Extension Generation

Neighborhood Extension Neighborhood list : {q, r, s}

Comparison is not required. Subgraph is extended from

physical evidence

m

n

o

lp

n

m o

lps

m

n

o

lp

q

m

n

o

lp

r

l

m n

o

q

p

r

s

Page 10: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

10

Design Pipeline

Preprocessor

Frequent Subgraph Discovery

Pathway Graphs

Frequent Subgraph

s

Clustering

STKE Dataset

NN Storytelling

Page 11: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

11

Subgraph Discovery

k # of Subgraphs generated

Time (sec.)

1 1,376 Existing

2 5,380 41

3 29,565 149

4 187,508 971

5 1274,852 7518

--- ---- -----

min_sup=2%

• What so novel about pruning edges?

Page 12: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

12

‘Importance Factor’ of a subgraph: sfipf

jj n

sf1

jij

ipsp

Dipf

:

Subgraph frequency,

Inverse pathway frequency,

ijji ipfsfsfipf ,

For i-th subgraph j-th pathway:

Page 13: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

13

Dataset Properties (sfipf)

min_sfipf

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.1

20

.14

0.1

60

.18

0.2

0

# o

f e

dg

es

le

ft

0200400600800

100012001400

min_sfipf

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.1

20

.14

0.1

60

.18

0.2

0

# o

f p

ath

wa

ys l

eft

0

10

20

30

40

50

Number of edges in MPG=1376Total pathways=50

Page 14: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

14

Subgraph Discovery

min_sup= 4.0%min_sfipf= 0.01

k

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Tim

e (m

s)

0

50x103

100x103

150x103

200x103

250x103

300x103

350x103

400x103

FSGSEG

Page 15: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

15

Subgraph Discovery

min_sup= 4.0%min_sfipf= 0.01

k

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Tim

e (m

s)

0

500

1000

1500

2000

2500

3000

FSGSEG

Page 16: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

16

Subgraph Discovery

min_sup= 4.0%min_sfipf= 0.01

k

3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

# o

f A

tem

pts

0

250000

500000

750000

1000000

1250000

FSGSEG

k Number of

Subgraphs

Time Saved

(%)

Attempts

Saved(%)

2 186 99.83 98.983 246 98.33 86.154 305 98.57 86.385 323 98.95 86.916 313 98.96 85.647 279 98.88 83.258 263 98.67 78.919 292 98.38 74.76

10 364 98.58 74.7511 470 98.76 78.0812 608 99.04 81.8413 785 99.22 85.0214 980 99.38 87.6315 1117 99.48 89.4816 1075 99.53 90.2617 804 99.51 89.4018 430 99.34 85.2219 141 98.76 71.2220 20 96.15 9.1921 1 75.74 -574.47Overall attempts saved = 89.52%

Overall time saved = 99.39%

Page 17: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

18

Clustering

Hierarchical Agglomerative Clustering (HAC)

k-means Unsupervised measure of clusters’

validity Average Silhouette Coefficient (ASC)

[19]

Page 18: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

19

Clustering

min_sup=4%, min_sfipf=0.01

k-means

# of Clusters2 4 6 8 10 12 14 16 18 20

AS

C0.0

0.1

0.2

0.3

0.4Cosine sfipf Dice Jaccard Overlap

min_sup=4%, min_sfipf=0.01

HAC

# of Clusters

2 4 6 8 10 12 14 16 18 20

AS

C

0.0

0.1

0.2

0.3

0.4

Page 19: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

20

Clustering

ASC Contour map for 10 clusters using HAC

0.08

0.08

0.10

0.10

0.12

0.12

0.16

0.16

0.14

0.140.200.18

min_sup4 6 8 10 12

min

_s

fip

f

0.01

0.02

0.03

0.04

0.05

0.08 0.10 0.12 0.14 0.16 0.18 0.20

ASC Contour map for 10 clusters using k-means

0.04

0.04

0.06

0.06

0.060.08

0.08

0.08

0.10

0.14

0.12

0.10

0.10

min_sup4 6 8 10 12

min

_sfi

pf

0.01

0.02

0.03

0.04

0.05

0.04 0.06 0.08 0.10 0.12 0.14

Page 20: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

21

Design Pipeline

Preprocessor

Frequent Subgraph Discovery

Pathway Graphs

Frequent Subgraph

s

Clustering

STKE Dataset

NN Storytelling

Page 21: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

22

Pathway Relations (StoryTelling)

Bidirectional Search Cover tree for NN

S

p1

p2

p3

T

p7

p8

p9

Page 22: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

Day-to-day life example

Roman Holiday

SabrinaBreakfast

at Tiffany’sSome

Like it HotRear

Window

2001: A Space Odyssey

Golden Eye

Die Another Day

Terminator 3

Terminator 3Collateral damage

Lethal Weapon 4

Die Hard 2

SpeedAir Force

One

U.S. Marshals

S.W.A.T.The day after

Tomorrowvan

HelsingBlade: Trinity

Roman Holiday

SabrinaFunny Face

Deep in my Heart

Singing in the rain

An American in Paris

Kismet

Kiss me Kate

High Society

Anchors Aweigh

On the Town

Take me out to the Ball Game

From Roman Holiday

From Terminator 3

From: Roman HolidayTo: Terminator 3

Page 23: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

24

Examples in STKE

http://people.cs.vt.edu/msh/infoviz/3/

Page 24: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

25

Pathway Relations (StoryTelling)

Numbers of varying length storiesfor different branching factor

Story length, t

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Nu

mb

er

of

t-le

ng

th s

tori

es

0

50

100

150

200

250

300

350

b=2b=4b=6b=8

Page 25: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

26

Pathway Relations (StoryTelling)

Numbers of varying length storiesfor different branching factor

Story length, t

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Nu

mb

er

of

t-le

ng

th s

tori

es

0

50

100

150

200

250

300

350

b=2b=3b=4b=5b=6b=7b=8b=9b=10

Page 26: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

27

Pathway Relations (StoryTelling)

Branching factor, b

2 3 4 5 6 7 8 9 10

To

tal

sto

ries

fro

m a

ll p

airs

0

200

400

600

800

1000

Branching factor, b2 3 4 5 6 7 8 9 10

Tim

e to

gen

erat

eal

l st

ori

es (

ms)

0.0

200.0x103

400.0x103

600.0x103

800.0x103

1.0x106

1.2x106

1.4x106

Branching factor, b

2 3 4 5 6 7 8 9 10

Len

gth

of

the

lon

ges

t s

tory

4

6

8

10

12

14

16

Page 27: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

28

Future Directions

Compare our SEG graph methods with text based clustering and storytelling

Examine costs and benefits for combining text and graph mining techniques

Page 28: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

29

References

[1] Science Signaling, The signal Transduction Knowledge Environment (STKE), "The Database of Cell Signaling", http://stke.sciencemag.org/cm/

[2] Kuramochi, M. and Karypis, G., "An efficient algorithm for discovering frequent subgraphs", IEEE Transactions on KDE, Vol. 16(9), September 2004, pp. 1038-1051.

[3] Breslin, T., Krogh, M., Peterson, C., and Troein, C., "Signal transduction pathway profiling of individual tumor samples", BMC Bioinformatics, June 29, 2005.

[4] Kumar, D., Ramakrishnan, N., Helm, R. F., and Potts, M., "Algorithms for Storytelling", IEEE Transactions on KDE, Vol. 20(6), June 2008, pp. 736-751.

[5] Ratprasartporn, N., Cakmak, A., and Ozsoyoglu, G., "On Data and Visualization Models for Signaling Pathways", 18th SSDBM, 2006, pp. 133-142.

[6] Xu, X., and Yu, Y., "Modeling and Verifying WNT Signaling Pathway", 3rd Intl. Conf. on ICNC. 2007, Vol. 2, pp. 319 - 323.

[7] Schreiber, F., "Comparison of metabolic pathways using constraint graph drawing", 1st Asia-Pacific bioinformatics Conf. on Bioinfo., Australia, Vol. 19, 2003, pp. 105 - 110.

[8] Abello, J., van Ham, F., and Krishnan, N., "ASKGraphView: A Large Scale Graph Visualization System", IEEE Transactions on Visualization and Computer Graphics, Vol. 12(5), 2006, pp. 669 - 676.

[9] Miyake, S., Tohsato, A., Takenaka, Y., and Matsuda, H. "A clustering method for comparative analysis between genomes and pathways", 8th Intl. Conf. on Database Systems for Advanced Applications, March 2003 pp. 327 - 334.

Page 29: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

30

References[10] Yan, X., and Han, J. "gSpan: graph-based substructure pattern mining", IEEE ICDM, 2002, pp. 721-

724.

[11] Moti, C., and Ehud, G. "Diagonally Subgraphs Pattern Mining", 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2004, pp. 51-58.

[12] Ketkar, N., Holder, L., Cook, D., Shah, R., and Coble, J. "Subdue: Compression-based Frequent Pattern Discovery in Graph Data", ACM KDD Workshop on Open-Source Data Mining, August 2005, pp. 71-76.

[13] Zhang, T., Ramakrishnan, R., and Livny, M., "BIRCH: An Efficient Data Clustering Method for Very Large Databases", ACM SIGMOD Intl. Conf. on Management of Data, Canada, 1996, pp. 103-114.

[14] Wagsta, K., Cardie, C., Rogers, S., and Schroedl, S., "Constrained K-means Clustering with Background Knowledge", ICML 2001, pp. 577-584.

[15] Lin, F., and Hsueh, C. M., "Knowledge map creation and maintenance for virtual communities of practice", Intl. Journal of Information Processing and Management, ACM, Vol. 42(2), 2006, pp. 551-568.

[16] Beygelzimer, A., Kakade, S., Langford, J., "Cover trees for nearest neighbor", ICML 2006, pp. 97-104.

[17] Agrawal, R., and Srikant, R. "Fast Algorithms for Mining Association Rules", Intl. Conf. on Very Large Data Bases, Santiago, Chile, September 1994, pp. 487-499.

[18] Agrawal, R., Mehta, M., Shafer, J., Srikant, R., Arning, A. and Bollinger, T. "The Quest Data Mining System", KDD'96, USA, 1996, pp. 244-249.

[19] Tan, P. N., Steinbachm, M., and Kumar, V., "Introduction to Data Mining", Addison-Wesley, ISBN: 0321321367, April 2005, pp. 539-547.

[20] http://people.cs.vt.edu/amonika/infoviz/

Page 30: Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia

31

Thank You