33
VISUALIZING THE VISUALIZING THE PROTEIN SEQUENCE PROTEIN SEQUENCE UNIVERSE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 , W.HAYNES 1 , N.KOLKER 1 , W.BROOMALL 1 , S.EKANAYAKE 2 , A.HUGHES 2 , Y.RUAN 2 , J.QIU 2 , E.KOLKER E.KOLKER 1 , G.FOX 2 1 SEATTLE CHILDREN’S, 2 INDIANA UNIVERSITY ECMLS 2012, 18 June 2012 ECMLS 2012, 18 June 2012

ECMLS 2012, 18 June 2012

  • Upload
    shelly

  • View
    19

  • Download
    2

Embed Size (px)

DESCRIPTION

Visualizing the Protein Sequence Universe L.Stanberry 1 , R.Higdon 1 , W.Haynes 1 , N.Kolker 1 , W.Broomall 1 , S.Ekanayake 2 , A.Hughes 2 , Y.Ruan 2 , J.Qiu 2 , E.Kolker 1 , G.Fox 2 1 Seattle Children’s, 2 Indiana University. ECMLS 2012, 18 June 2012. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: ECMLS 2012, 18 June 2012

VISUALIZING THE VISUALIZING THE PROTEIN SEQUENCE PROTEIN SEQUENCE UNIVERSEUNIVERSE

L.STANBERRY1, R.HIGDON1, W.HAYNES1, N.KOLKER1, W.BROOMALL1, S.EKANAYAKE2, A.HUGHES2, Y.RUAN2, J.QIU2, E.KOLKERE.KOLKER11, G.FOX2

1SEATTLE CHILDREN’S, 2INDIANA UNIVERSITY

ECMLS 2012, 18 June 2012 ECMLS 2012, 18 June 2012

Page 2: ECMLS 2012, 18 June 2012

Outline

Visualizing PSU

2

A 4th paradigm problem in biology Assigning functions (annotating)

proteins Challenge Our goal PSU: methods & initial results Conclusions

ECMLS 2012

Page 3: ECMLS 2012, 18 June 2012

Grand Challenge of Functional Genomics

Visualizing PSU

3

New technologies produce peta- and exabytes of data

Protein Sequence Universe (PSU), the protein sequence space, expand exponentially EMP, i5K, iPlant, NEON 30% of existing sequenced proteins

unannotated Existing resources overwhelmed, many

unsupported: COG, Systers, ClusTr, eggNOG.

ECMLS 2012

Page 4: ECMLS 2012, 18 June 2012

Ultimate Goal: Annotate All Proteins

Visualizing PSU

4

Our approach: Revitalize, expand & enhance protein

annotation resources. Develop sustainable software framework. Use HPC and most powerful CI – grids &

clouds. Provide rigorous and reliable tools to

annotate protein sequences.

ECMLS 2012

Page 5: ECMLS 2012, 18 June 2012

COG: Clusters of Orthologous Groups

Visualizing PSU

5

COG database was developed by NCBI. Proteins classified into groups with common

function encoded in complete genomes. Prokaryotes (COG): 66 genomes, 200K proteins,

5K clusters. Eukaryotes (KOG): 7 genomes, 113K proteins,

5K clusters. Valuable scientific resource: 5K citations. Last updated: 2006.

ECMLS 2012

Page 6: ECMLS 2012, 18 June 2012

Clustering 10 million UniRef100

Visualizing PSU

6

UniRef100: 10M proteins including 5.3M bacterial & archaeal

BLAST - common sequence alignment approach

All vs. All alignment on Azure 475 eight-core virtual machines

produced 3+ billion filtered records in 6 days

ECMLS 2012

Page 7: ECMLS 2012, 18 June 2012

Clustering 10 million UniRef100

Visualizing PSU

7

Use prokaryotic COG as a starting point. Expand COGs ~20 fold (3.5 million

proteins). Cluster 2M proteins into 500K functional

groups Single linkage clustering with

MapReduce framework on Hadoop

ECMLS 2012

Page 8: ECMLS 2012, 18 June 2012

Promise and Challenge of Annotation

Visualizing PSU

8

Clustering facilitates mass annotationBUT

Takes considerable efforts and expertise

Multiple cloud systems and compute solutions

ECMLS 2012

Page 9: ECMLS 2012, 18 June 2012

Public Resources

ECMLS 2012Visualizing PSU

9

Struggle to cope with the influx of data Provide limited interactive and analytic

capabilities Many no longer supported (SYSTERS,

CluSTr, COG)

Biological community needs scalable, Biological community needs scalable, sustainable and efficient approach to sustainable and efficient approach to visualize, explore and annotate new visualize, explore and annotate new data. data.

Page 10: ECMLS 2012, 18 June 2012

Protein Sequence Universe

Visualizing PSU

10

PSU Goal: Enhance annotation resources with analytic and visualization (browser) tools.

Project sequence data into 3D using multidimensional scaling (MDS).

MDS interpolation allows expanding the universe without time consuming all vs all O(N2) 3D map allows much faster interpolation

ECMLS 2012

Page 11: ECMLS 2012, 18 June 2012

Multi-Dimensional Scaling (MDS)

Visualizing PSU

11

Sammon‘s objective function

is dissimilarity measure between sequences i and j

d is Euclidean distance between projections xi and xj

Denominator: larger contribution from smaller dissimilarities

f is monotone transformation of dissimilarity measure chosen “artistically”

ECMLS 2012

n

ji ij

jiij

f

xxdfH

)(

),()( 2

ij

Page 12: ECMLS 2012, 18 June 2012

Typical Metagenomic MDS

ECMLS 2012Visualizing PSU

12

Page 13: ECMLS 2012, 18 June 2012

MDS Details

ECMLS 2012Visualizing PSU

13

f chosen heuristically to increase the ratio of standard deviation to mean for and to increase the range of dissimilarity measures.

O(n2) complexity to map n sequences into 3D. MDS can be solved using EM (SMACOF – fastest

but limited) or directly by Newton's method (it’s just 2 )

Used robust implementation of nonlinear 2 minimization with Levenberg-Marquardt

3D projections visualized in PlotViz

)( ijf

Page 14: ECMLS 2012, 18 June 2012

MDS Details

ECMLS 2012Visualizing PSU

14

Input Data: 100K sequences from well-characterized prokaryotic COGs.

Proximity measure: sequence alignment scores Scores calculated using Needleman-Wunsch Scores “sqrt 4D” transformed and fed into MDS

Analytic form for transformation to 4D ij

n decreases dimension n > 1; increases n < 1 “sqrt 4D” reduced dimension of distance data

from 244 for ij to14 for f(ij) Hence more uniform coverage of Euclidean space

Page 15: ECMLS 2012, 18 June 2012

3D View of 100K COG Sequences

Visualizing PSU

15

ECMLS 2012

Page 16: ECMLS 2012, 18 June 2012

Implementation

Visualizing PSU

16

NW computed in parallel on 100 node 8-core system.

Used Twister (IU) in the Reduce phase of MapReduce

MDS Calculations performed on 768 core MS HPC cluster (32 nodes)

Scaling, parallel MPI with threading intranode Parallel efficiency of the code approximately 70% Lost efficiency due memory bandwidth saturation NW required 1 day, MDS job - 3 days.

ECMLS 2012

Page 17: ECMLS 2012, 18 June 2012

Cluster Annotation

Visualizing PSU

17

COG Annotation Uniref100

COG1131 ABC-type multidrug transport system, ATPase component 14406

COG1136ABC-type antimicrobial peptide transport system, ATPase

component 7306

COG1126 ABC-type polar amino acid transport system, ATPase component 4061

COG3839 ABC-type sugar transport systems, ATPase component 4121

COG0444ABC-type dipeptide/oligopeptide/nickel transport system ATPase

comp 3520

COG4608 ABC-type oligopeptide transport system, ATPase component 3074

COG3842 ABC-type spermidine/putrescine transport systems, ATPase comp 3665

COG0333 Ribosomal protein L32 1148

COG0454 Histone acetyltransferase HPA2 and Related acetyltransferases 14085

COG0477 Permeases of the major facilitator superfamily 48590

COG1028 Dehydrogenases with different specificities 37461

ECMLS 2012

Page 18: ECMLS 2012, 18 June 2012

Heatmap of NW vs Euclidean Distances

Visualizing PSU

18

ECMLS 2012

Page 19: ECMLS 2012, 18 June 2012

Dendrogram of Cluster Centroids

Visualizing PSU

19

ECMLS 2012

Page 20: ECMLS 2012, 18 June 2012

Selected Clusters

Visualizing PSU

20

ECMLS 2012

Page 21: ECMLS 2012, 18 June 2012

Heatmap for Selected Clusters

Visualizing PSU

21

ECMLS 2012

Page 22: ECMLS 2012, 18 June 2012

Future Steps

Comparison Needleman-Wunsch v. Blast v. PSIBlast NW easier as complete; Blast has missing distances

Different Transformations distance monotonic function(distance) to reduce formal starting dimension (increase sigma/mean)

Automate cluster consensus finding as sequence that minimizes maximum distance to other sequences

Improve O(N2) to O(N) complexity by interpolating new sequences to original set and only doing small regions with O(N2) Successful in metagenomics Can use Oct-tree from 3D mapping or set of consensus vectors Some clusters diffuse?

ECMLS 2012Visualizing PSU

22

Page 23: ECMLS 2012, 18 June 2012

ECMLS 2012Visualizing PSU

23

Blast6

Page 24: ECMLS 2012, 18 June 2012

ECMLS 2012Visualizing PSU

24

Full DataBlast6

Originalrun has 0.96 cut

Page 25: ECMLS 2012, 18 June 2012

ECMLS 2012Visualizing PSU

25Cluster DataBlast6

Originalrun has 0.96 cut

Page 26: ECMLS 2012, 18 June 2012

26

Use Barnes Hut OctTree originally developed to make O(N2) astrophysics O(NlogN)

Page 27: ECMLS 2012, 18 June 2012

27

OctTree for 100K sample of Fungi

We use OctTree for logarithmic

interpolation

Page 28: ECMLS 2012, 18 June 2012

440K Interpolated28

Page 29: ECMLS 2012, 18 June 2012

Conclusions

Visualizing PSU

29

Data Knowledge: protein annotation Overwhelming influx of new sequences Annotation is an immense challenge. HPC and advanced analytics needed.

PSU as tool to facilitate annotation: Interactive visualization and exploration Integrates info on function, pathways, structure, and

environment MDS preserves grouping structure of protein space MDS can use different proximities and biological data Parallel MDS handles large-scale data MDS interpolation quickly maps new sequences into existing

space

ECMLS 2012

Page 30: ECMLS 2012, 18 June 2012

DELSA: Data → Knowledge → Action

Visualizing PSU

30

Data-Enabled Life Sciences Alliance International

Collective innovation to tackle modern biological challenges through best computational practices and advanced cyberinfrastructure.

Harness expertise and resources across disciplines Promote accurate, sustainable, scalable

approaches Facilitate translation of data influx into tangible

innovations and groundbreaking discoveries

ECMLS 2012

Page 31: ECMLS 2012, 18 June 2012

References and Resources

Visualizing PSU

31

COG data is available at the NCBI siteftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0303/

MDS results are available at http://manxcatcogblog.blogspot.com/

All software used to analyze and visualize the data is an open source.

DELSA: http://www.delsaglobal.org Protein Global Atlas and Data Accessibility

Projects

ECMLS 2012

Page 32: ECMLS 2012, 18 June 2012

Acknowledgements

ECMLS 2012Visualizing PSU

32

Grant support NSF: under DBI: 0969929 (EK) and

0910818 (GF) NIH: 5 RC2 HG 005806- 02 (GF); NIGMS

grant R01 GM-076680-04 (EK); NIDDK grants U01-DK-089571 and U01-DK-072473 (EK)

Page 33: ECMLS 2012, 18 June 2012

Visualizing PSU

25

Thank you for your attention

ECMLS 2012