14
1 __________________________________________________________________________________________________ 10/18/2013 GCBA 815 Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week6: Interaction Network Analysis (Cytoscape) Babu Guda Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center __________________________________________________________________________________________________ 10/18/2013 GCBA 815 Background: • Gene products (RNA/proteins) rarely work alone; most often they interact with other gene products to accomplish a task • Most of the cellular processes are regulated by protein- protein or DNA/RNA-protein complexes • Impaired protein interactions can be causative factors for diseases or metabolic abnormalities • Guilt by association : The unknown function of a protein can be inferred based on the proteins it interacts with, if those proteins have a known function • The field of protein-protein interactions (PPIs) is rapidly advancing at various fronts of biomedical research.

Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

1

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Tools and Algorithms in Bioinformatics GCBA815, Fall 2013

Week6: Interaction Network Analysis

(Cytoscape)

Babu Guda Department of Genetics, Cell Biology and Anatomy

University of Nebraska Medical Center

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Background:

•  Gene products (RNA/proteins) rarely work alone; most often they interact with other gene products to accomplish a task

•  Most of the cellular processes are regulated by protein-protein or DNA/RNA-protein complexes

•  Impaired protein interactions can be causative factors for diseases or metabolic abnormalities

•  Guilt by association: The unknown function of a protein can be inferred based on the proteins it interacts with, if those proteins have a known function

•  The field of protein-protein interactions (PPIs) is rapidly advancing at various fronts of biomedical research.

Page 2: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

2

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Yeast interactome Wuchty et al., 2003; Barabasi and Oltvai, 2004

Red: lethal

Green: non-lethal

Orange: slow growth

Yellow: Unknown

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Human interactome Ulrich et al., 2005, reproduced from Cell Journal

•  Screened 25 million PPIs •  Found 3186 PPIs among 1705 proteins •  Maps 195 disease proteins to new partners •  Functional annotation of 342 uncharacterized human proteins

Light blue: known proteins

Orange: Disease proteins

Yellow: Uncharacterized

Page 3: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

3

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Types of protein-protein interactions

•  Permanent and transient interactions

•  Direct (physical) and indirect (sharing a common partner) interactions

•  Homo and hetero interactions

•  Interchain and intrachain interactions

•  Binary interactions and complexes

•  Spoke and matrix models to expand binary interactions from complexes

•  Interlogs (shared PPIs across species)

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Adapted from PLoS Computational Biology, Shoemaker and Panchenko, 2007, 3:E42

Experimental methods for identifying PPIs

Page 4: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

4

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Computational methods for predicting PPIs

Adapted from PLoS Computational Biology, Shoemaker and Panchenko, 2007, 3:E43

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

PPI databases and their characteristics

Interaction Database

URL Comments on the source and type of data covered

BIND http://www.bind.ca/Action All major model species covered.

BioGRID http://www.thebiogrid.org Mixture of invivo, invitro and Y2H interactions from different sources. Major species covered are Yeast, Drosophila, C.elegans and Human.

DIP http://dip.doe-mbi.ucla.edu/ Mostly Y2H studies, all major model species covered.

HPRD http://www.hprd.org Only Human, manually curated from the literature

IntAct http://www.ebi.ac.uk/intact Mainly literature-curated. All major model species covered

MINT http://mint.bio.uniroma2.it/mint Both experimental and literature-based, major species covered are Yeast, Drosophila and Human.

OPHID http://ophid.utoronto.ca/ophid Only Human (Experimental and predicted)

PRISM http://gordion.hpc.eng.ku.edu.tr/prism

Predicted interactions based on interacting surfaces in X-ray crystal structures

STRING http://string.embl.de Mostly predicted interactions based on multiple criteria

Page 5: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

5

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Domain-Domain Interactions (DDIs)

PPIs à DDIs

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

PPIs

Page 6: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

6

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

•  PROSITE : A database of protein profiles and patterns •  PRODOM : PROtein DOMain Database-built from UNIPROT •  PRINTS: A Compendium of Protein Fingerprints •  PFAM : Protein families database of alignments and HMMs •  TIGRfams: Protein families based on HMMs •  SMART: Simple Modular Architectural Research Tool •  BLOCKS: Blocks WWW Server obtained from PROSITE •  PANTHER: Protein Analysis Through Evolutionary Relationships •  CATH: Class Architecture, Topology & Homologous super family •  SCOP: Structural Classification of Proteins •  Superfamily: Structural and Functional Protein Annotation •  Gene3D: Domain Architecture Classification •  INTERPRO: Integrated Resource of Protein Domains and Functional Sites

Protein Domain databases and Interpro

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Domain architecture of Human EGF protein Family Pal and Guda, 2006, BMC Evolutionary Biology 6:91

Page 7: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

7

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Functional architecture of EGF Receptor protein

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Significance of studying DDIs •  Protein-protein interaction (PPI) data is available as binary data, i.e., an interaction is found’ or ‘not found’.

•  About 70% of eukaryotic proteins are multi-domain proteins. In these cases, it is difficult to know which domains actually participate in each interaction.

•  Studying interactions at the domain level is vital for understanding the functional significance of PPIs.

•  Experimental determination of all DDIs is tedious, hence computational methods can be used to infer DDIs in PPIs and thus can complement experimental investigations.

GR Riggs et al, 2003, EMBO 22:1158

Page 8: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

8

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Flow diagram showing the steps involved in the method development Guda et al., 2009, PLoSONE, 4:e5096

STEPS

•  Datasets: positive and negative PPIs

•  Domain mapping

•  Scoring features and scoring function

•  Testing and validation

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Experimental PPI datasets (Pint) from 5 source databases

Interaction Database

Number of PPIs

Comments on the source and type of data covered

BIND 80,378 All major model species covered.

DIP 53,778 Mostly Y2H studies, all major model species covered.

HPRD 34,367 Only Human, manually curated from the literature

IntAct 125,792 Mainly literature-curated. All major model species covered

MINT 115,383 Both experimental and literature-based, major species covered are Yeast, Drosophila and Human.

Combined unique set

209,165 A non-redundant set of PPIs corresponding to 70,769 unique proteins was obtained by combining the above 5 datasets.

Page 9: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

9

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Domain Mapping

•  Domain Definitions were obtained from the InterPro database that integrates 10 distinct domain databases such as Pfam, Prosite, SMART, Superfamily, etc. Out of 15,064 domain in the InterPro database, 10,389 (~70%) were used in this study.

Positive DDI dataset for validation: •  About 4000 known DDIs were used from the iPfam database. •  The iPfam was created based on domain-domain contacts in solved protein structures and complexes. This dataset has been extensively used as a ‘gold standard’ for validating computational prediction methods.

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Five orthogonal scoring features used in this study

1.  Probabilistic – Ratio of expected frequency of occurrence

2.  Evolutionary – Co-occurrence in Rosetta stone proteins

3.  Evidence-based – Observed in multiple species

4.  Spatial – Co-localized in the same subcellular location

5.  Functional – Semantic similarity of GO annotation

( ) ( )5

1ij ij

kFinalScore d Score Sk

=

=∑

Page 10: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

10

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Validation using positive and negative datasets

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Validation of the method using ROC curves

Page 11: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

11

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Comparison of predictive performance against existing methods

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Domain-Domain Interaction Network of Breast Cancer Proteins

Page 12: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

12

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

BRCT-1% FPR

BRCT-5% FPR

BRCT-10% FPR

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Using Graph theory to study Biological Interaction Networks

•  Graphs are data structures that provide the framework to represent biological networks

•  Nodes or Vertices (singular – vertex) are the building blocks for graphs

•  An edge is the connection between two nodes •  A leaf node is a terminal node in a graph that is connected at

only one end. •  Degree is a node attribute that describes how many times a

node is connected to other nodes •  Both nodes and edges can have weights that shows their

relative importance in a graph; used for quantitative modeling of networks

Page 13: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

13

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Basic Network terminology

•  Directed and undirected graphs •  Cyclical and linear graphs •  Complete and incomplete graphs •  Hub nodes •  Subgraphs •  Graph centrality •  Shortest path •  Graph density •  Power law distribution

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

Cytoscape

•  http://www.cytoscape.org •  Interaction network visualization and analysis software, first

published in 2003 from Trey Ideker’s group •  Open-source tool with active developer support •  Cytoscape version 2.8 •  Cytoscape version 3.0 is a newly released with new features •  Available for all platforms (Mac, PC, Linux) •  Contains extensive collection of Plugins to analyze a variety

of datasets from Biology, social sciences and semantic web •  Integrated with other tools such as the R package

Page 14: Tools and Algorithms in Bioinformatics · 2013-10-18 · BIND 80,378 All major model species covered. DIP 53,778 Mostly Y2H studies, all major model species covered. HPRD 34,367 Only

14

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

What can you do with Cytoscape?

•  Network Visualization •  Load molecular and genetic interaction datasets

•  Protein-protein interaction data •  Protein-DNA interactions •  KEGG pathways •  Expression datasets

•  Network Analysis (mostly using Plugins) •  Analyze networks

•  Network properties (degree distribution etc) •  Annotation-based filtering (subcellular mapping) •  Node and edge attribute analysis

__________________________________________________________________________________________________ 10/18/2013 GCBA 815

How to use Cytoscape? •  Register and download the software from Cytoscape

•  http://www.cytoscape.org •  Install on your local computer (PC/Mac/Linux) •  Locate the folder (Program files) where files are stored •  Use example datasets

•  .sif files are network input files •  A pp B or A pd C

•  node or edge attribute files •  .cys files are cytoscape session files (contains info

on network, attribute and session option data) •  Other formats: Text, Excel, GML, XGMML, SBML,

BioPax, PSI_MI files