Upload
mark-a
View
214
Download
0
Embed Size (px)
Citation preview
ISSN 1742-206X
1742-206X(2012)8:8;1-5
www.molecularbiosystems.org Volume 8 | Number 8 | August 2012 | Pages 2013–2222
PAPERMark A. RaganRewiring the dynamic interactome
Interfacing chemical biology with the -omic sciences and systems biology
MolecularBioSystems
Indexed in
MED
LINE!
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online / Journal Homepage / Table of Contents for this issue
2054 Mol. BioSyst., 2012, 8, 2054–2066 This journal is c The Royal Society of Chemistry 2012
Cite this: Mol. BioSyst., 2012, 8, 2054–2066
Rewiring the dynamic interactomew
Melissa J. Davis,zabc Chang Jin Shin,zab Ning Jingaand Mark A. Ragan*
ab
Received 14th February 2012, Accepted 30th May 2012
DOI: 10.1039/c2mb25050k
Transcriptomics continues to provide ever-more evidence that in morphologically complex
eukaryotes, each protein-coding genetic locus can give rise to multiple transcripts that differ in
length, exon content and/or other sequence features. In humans, more than 60% of loci give rise
to multiple transcripts in this way. Motifs that mediate protein–protein interactions can be
present or absent in these transcripts. Analysis of protein interaction networks has been a
valuable development in systems biology. Interactions are typically recorded for representative
proteins or even genes, although exploratory transcriptomics has revealed great spatiotemporal
diversity in the output of genes at both the transcript and protein-isoform levels. The increasing
availability of high-resolution protein structures has made it possible to identify the
domain–domain interactions that underpin many protein interactions. To explore the impact of
transcript and isoform diversity we use full-length human cDNAs to interrogate the protein-
coding transcriptional output of genes, identifying variation in the inclusion of protein interaction
domains. We map these data to a set of high-quality protein interactions, and characterise the
variation in network connectivity likely to result. We find strong evidence for altered interaction
potential in nearly 20% of genes, suggesting that transcriptional variation can significantly rewire
the human interactome.
Introduction
Mammalian transcriptome sequencing has revealed surprising
diversity in the transcriptional output of those regions of the
genome we typically think of as genes. The ‘‘one gene–one
protein’’ hypothesis prevalent in molecular biology since the
1940s1 has been superseded by a new paradigm formulated
from the results of projects such as the Functional Annotation
of Mouse (FANTOM)2,3 and the Human Genome Informa-
tion Integration Project.4 These projects have demonstrated
great diversity not only in the transcriptional output of genetic
loci, but also in the complement of proteins produced from
those transcripts.3 The idea of a gene as a unit of information
encoding a single piece of biological functionality has given
way to the realisation that any given genetic loci within the
genome may produce a great variety of protein-coding and
non-coding transcripts. Even a single coding transcript may
be spliced in different ways to produce a variety of protein
isoforms. It is likely that in excess of 60% of what were
once thought of as human genes are alternatively spliced.5–7
This, coupled with other sources of variation such as alter-
native promoter/first exon usage, and alternative transcrip-
tional initiation and termination sites, generates great diversity
in the mammalian transcriptome and proteome. These obser-
vations have also given rise to the use of terms such as
‘‘transcriptional unit’’ to capture the idea that a genomic
region can generate a set of related transcripts sharing a core
of genetic information.8
In parallel with our evolving understanding of the complex
mammalian transcriptome, high-throughput experimental
techniques have lead to the increasing use of network analysis
as a framework for interpreting results. Systems of molecular
interactions underpin all cellular processes; as such, under-
standing the ways in which molecular-interaction networks
function in cells provides vital insight into cellular processes
and enhances our understanding of both normal and pathological
cellular states. In recent years, significant advances have been
made in characterizing the molecular interaction networks of
morphologically complex eukaryotes. Currently, public and com-
mercial databases provide access to collections of interaction data
extracted manually from research publications (for example
IntACT,9 DIP,10 and MINT11), or integrated collections
harvested from primary databases (for example STRING,12
APID,13 and ConsensusPathDB14). Many such resources
are listed at http://ppi.fli-leibniz.de/jcb_ppi_databases.html.
It is possible to query these resources and find very large
numbers of interactions, either for specific organisms, or for
a The University of Queensland, Institute for Molecular Bioscience,St Lucia, Queensland 4072, Australia. E-mail: [email protected];Fax: 61-7-3346-2101; Tel: 61-7-3346-2616
bARC Centre of Excellence in Bioinformatics, AustraliacQueensland Facility for Advanced Bioinformatics, Australiaw Electronic supplementary information (ESI) available. See DOI:10.1039/c2mb25050kz Joint First Authors.
MolecularBioSystems
Dynamic Article Links
www.rsc.org/molecularbiosystems PAPER
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2054–2066 2055
individually queried proteins, thus building a picture of the
networks present in a given system of interest. However, these
networks still do not project the dynamics of the interactome:
cellular processes are by nature dynamic, and often specific to
a particular cell type, stage or physiological context.15,16 Like-
wise, many databases and meta-collections assign interactions
to the longest protein known for a particular gene,17,18 or to
the gene itself,19 and the specific protein isoform is rarely
identified. Indeed, the nature of the experiment may mean that
the precise identity of the interacting protein is not known.
Proteins are typically composed of structural and/or func-
tional modules referred to as domains, and the specific,
ordered set of these domains within a protein forms its domain
architecture. Many domains are well-studied, and computa-
tional models exist for their prediction from sequence. Analysis
of the alternative splicing of protein-coding transcripts indicates
that complete domains are alternatively spliced more frequently
than expected by chance,20 and alteration of the domain
architecture of proteins due to alternative splicing or other
post-translational mechanisms is known to result in several
types of loss or gain of function. For example, recent studies
have described changed enzymatic activity,21 altered protein
stability, and changed subcellular localisation.22
Certain combinations of domains are also known to mediate
protein–protein interactions23–25 as determined either by
examination of high-resolution structures,26,27 or by methods
that examine domain co-occurrence in the interactome.28,29
Domains known to be involved in protein interactions are
more likely to be spliced than are others,30 and many alter-
natively spliced regions correspond to interaction interfaces.31
Although some smaller-scale studies have investigated the
effects of splicing on protein functions,32,33 to our knowledge
there has as yet been no systematic, transcriptome-wide study
to examine the collective impact of these changes on the
mammalian interactome. We present here the first such
systematic analysis over integrated human transcriptomic
datasets. We characterise the extent to which changes in the
domain architecture of protein isoforms is likely to alter their
interaction potential, and we illustrate our findings with
experimentally verified examples.
Results
Integrated human transcriptome data
We identified two sources of transcriptome data that provide
full-length cDNA sequences suitable for building a high-
quality picture of the human transcriptome.34 Transcript
clusters based on full-length human cDNA derived from the
H-Invitational Database (H-InvDB)4 and FANTOM32 were
integrated. A full characterisation of this dataset is available in
the Supplementary Material, and notable attributes of the
integrated dataset important for our analysis are described
briefly here.
Integration of these two datasets created a set of 26955 clusters,
covering 68748 transcripts that encode distinct protein isoforms.
Fewer than half (11 397) of the clusters in our set are composed
of transcripts from both H-InvDB and FANTOM3. A further
700 contained representatives from both sets, but not all
transcripts were identified with an Entrez Gene ID (gene
identifier). Over half of the clusters were derived from either
H-InvDB or FANTOM3: 6517 (1453 with gene identifiers)
contain sequences from H-Inv DB, while 8341 (7268 with gene
identifiers) contain sequences from FANTOM3. We further
analysed the composition of data drawn from each resource by
examining the extent to which transcripts were shared between
the datasets, and the extent to which individual exons are
shared between sets (details in Supplementary Fig. 1).
We divided transcript clusters into two categories: those
containing only a single protein-coding transcript (single-isoform
units, 48%) and those containing multiple protein-coding
transcripts (multiple-isoform clusters, 52%) (Fig. 1A). The
average length of coding regions in the transcripts of single-
isoform units was significantly shorter than the average from
multiple-isoform clusters (184 amino acids (aa) compared to
498 aa, p { 0.0001). Further, half of the single-isoform units
have no match to an Entrez gene identifier. Units containing
only one protein-coding transcript are not useful for this study
because they cannot illuminate variation in the protein pro-
ducts of genes. The multiple-isoform clusters contain an
average of four transcripts per unit. Only these clusters con-
tain data informative for our analysis. The high degree of
concordance we see between multiple-isoform clusters and
Entrez gene identifiers (fewer than 2% of these clusters fail
to map to a gene identifier) means that we can, for practical
purposes, consider these transcript clusters to represent genes
in the sense that the encoded isoforms share common genetic
information associated with a particular gene identifier.
Domain predictions
We generated Pfam domain predictions for all protein-coding
transcripts in our dataset; 4251 unique types of Pfam domain
were predicted. The most abundant domains were the
Cys2His2 Zinc finger (PF00096), the Protein kinase domain
(PF00069), and the WD40 repeat (PF00400) (Supplementary
Fig. 2A and B). For 72% of the single-isoform units no PFAM
domain was predicted (Fig. 1A), whereas when we examined the
longest protein-coding region from each multiple-isoform gene,
Fig. 1 Analysis of transcriptional units encoding either a single isoform
(single-isoform unit, or si Unit) or multiple protein-coding transcripts
(multiple-isoform cluster, or mi Cluster). (A) Pfam domain predictions
for clusters, showing the proportion with and without Pfam domain
predictions. (B) Protein–protein interactions (PPI) classified by the type
of gene that encodes each interacting protein (both si Units (s,s), both mi
Clusters (m,m), or one from each (m,s)). The pie chart depicts the
percentage of PPIs collected in this study according to this classification
(X2 = 1692.969, df = 2, P { 0.001).
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
2056 Mol. BioSyst., 2012, 8, 2054–2066 This journal is c The Royal Society of Chemistry 2012
85% had at least one predicted Pfam domain. A chi-squared
test shows this domain distribution to be highly significant
(X2 = 9021.24, df = 3, p { 0.0001). This contrasts with the
reported coverage of Pfam domains over human protein
sequences (B72%35), and suggests that these genes are enriched
with well-studied functions.
We further investigated the low level of domain prediction
observed in the single-isoform units, examining both sequence
length and conservation against orthologous mouse sequences
using SSEARCH.36 The average length of protein-coding
regions without domain prediction in this set (369 nt) was
one-third the length of regions with domain predictions (1014 nt).
Moreover, those protein-coding sequences with no predicted
domains showed low sequence identity when compared to a
presumptive ortholog in the mouse proteome (Supplementary
Table 1), averaging only 38% identity. In contrast, protein-
coding regions with no domain in multiple-isoform genes were
substantially longer on average (498 aa) and showed higher
average sequence identity to mouse (68%).
Variation of domain architecture
Here we are specifically interested in the use of domains likely to
be involved in protein interactions, which we refer to as Protein
Interaction Domains (PIDs). We define a PID as a Pfam domain
present in one of two structurally derived domain–domain inter-
action datasets, 3DID and iPfam (Materials and Methods). Of
11890 multi-isoform genes that contain predicted Pfam domains
(Fig. 1A), we identified a subset of 8860 in which at least one
protein-coding transcript is predicted to contain at least one PID.
We compared the PID architecture of isoforms within each
of these TUXs, defining five types of variation in PID archi-
tecture: identical, conserved, subset, mutually exclusive and
totally removed (Table 1). Except for the identical types, we
did not consider non-PID domains. Genes with identical
structure show no variation in domain architecture. For all
the other categories of variation, we considered the PIDs
predicted in each protein-coding transcript as a multiset
(a set for which we consider the number, but not the order,
of items). Due to the presence of more than two isoforms in
many clusters, individual genes may show more than one kind
of variation.
The conserved type indicates that no variation exists
between a pair of isoforms which have the same multiset of
PIDs (although other isoforms from the gene may vary). The
subset type of variation is seen where the set of PIDs in one
isoform is a subset of the PIDs in another isoform. Mutually
exclusive variation is found where two isoforms each have a
PID not found in the other. Duplicated domains may appear
as mutually exclusive if a novel domain is present in the
isoform with a lower domain copy number (e.g. for isoforms
P and P0, where P= [A1, A2, C1] and P0= [A1, B1, C1], A2 and
B1 are mutually exclusive).The last category, totally removed,
indicates that at least one protein-coding transcript has no
PIDs. A gene that does not have identical domain archi-
tectures (5612 in our dataset, see Table 1) may contain any
combination of kinds of variation, depending on the number
of isoforms and the diversity observed in the use of PIDs for
products of that gene. In this schema, repeats of a specific
domain are considered to be variation.
We used this classification to determine how many genes
show the different kinds of variable domain usage (Table 1,
derived from data presented in Supplementary Data File 1).
Notably, we found that of 8860 genes, 40% do not demon-
strate any variation in the use of PIDs. The proteins produced
from these transcripts are likely to have the same domain-
mediated interaction potential, although their interactions
may be modulated in other ways, for example through post-
translational variation.
As seen in Table 1, 40% of the clusters also produce at least
one isoform that has completely lost all of the known PIDs
found in other isoforms produced from that gene (totally
removed). Unless they contain currently unknown PIDs, these
isoforms are unlikely to interact with other proteins, lacking as
they do the structural elements required to underpin potential
interactions. The next-most-common kind of variation is the
loss of some, but not all, PIDs from an isoform (subset) �31%of our mi-TUXs contain such isoforms. These proteins may
share some, but not all, interaction partners with isoforms
containing more PIDs. The least common category of varia-
tion, mutually exclusive, was observed for relatively few genes
(7%). Finally, 48% of TUXs have at least two protein iso-
forms for which there is no variation in the PID architecture,
although other isoforms from that TUX may vary (conserved).
Table 1 Patterns of variation in protein interaction domains for genes with variable transcriptional output affecting the coding regions oftranscripts
Domain type All domainsaProtein interaction domainsb (across remaining 5612 genes with variablearchitectures)
Category of variation Identical architectures Conserved Subset Mutually exclusive Totally removed
Graphical representation of pattern
Number of TUXs with pattern Total:8860 3248 4282 2669 631 3562
a These identical domain architectures include PID and non-PID domains. All coding transcripts produced from these genes appear to generate
proteins with identical domain architectures. b Clusters can contain more than one kind of variation (Conserved (C), Subset (S), Mutually
exclusive (M), Totally removed (R)): 1413 genes have only one kind of variation (C = 320, S = 344, M = 14, R = 735), 3081 show two kinds
of variation (most common: CR = 1789, CS = 1043, other combinations total 249) and 1118 show three or more categories of variation
(most common CRS = 607).
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2054–2066 2057
We also examined a set of 4196 human signalling genes
derived from the Gene Ontology database. Using Entrez Gene
Identifiers, we mapped 4151 of these genes clusters in our
dataset, and determined that signalling genes were strongly
enriched for genes with variation in protein interaction
domains, and that reciprocally, variable genes are strongly
enriched for signalling function (X2 = 428.6, df = 3, p {0.0001). The full set of 8860 genes, with variation classified and
corresponding domain predictions is presented as Supplementary
Data 1.
Isoforms with altered interaction potential in a set of known
protein interactions
In order to determine the likely impact of these variations on
known protein–protein interactions (PPIs), we mapped our
dataset onto a collection of high-quality, experimentally
determined PPIs (see Materials and Methods). This process
mapped 28 309 interactions involving 8522 unique protein
isoforms. Genes that produce more than one distinct protein
encode the majority of the proteins involved in these PPIs.
Only 14% of proteins involved in our PPI dataset belong to
single-isoform units. As expected given these numbers, most
protein interactions (77%) occur between a pair of proteins
that are each encoded by a multiple-isoform gene. A substan-
tial proportion (20%) of interactions occur between a protein
from a multiple-isoform gene and another protein from a
single-isoform gene, whereas few protein interactions occur
between proteins encoded only by single-isoform genes (3%)
(Fig. 1B).
From this set, we identified 7805 PPIs in which the inter-
acting proteins together contain at least one pair of domains
known to interact (Supplementary Table 2). These interacting
protein pairs contain 1649 DDIs, the vast majority of which
(73%) are found in only five or fewer PPIs. Only 2% of DDIs
occur in more than 50 of our mapped protein interactions
(the full table of DDIs is provided in Supplementary Data 2).
These 41 domain pairs include many signalling-related
domains (involving common domains such as SH2, SH3,
and kinase-related domains) that are abundant in the human
proteome (see Supplementary Fig. 2). The pattern for DDI
frequency within the PPI set is similar: 63% of the DDIs occur
five or fewer times in our PPI set, while 4% occur over 200 times.
The 69 most-frequently occurring DDIs include interactions
between known protein interaction domains (such as Immuno-
globulin I-set, Fibronectin type III and Spectrin domains).
Many PPIs contain more than one DDI, but half of the 7805 PPIs
contain only one identified DDI and the great majority (86%)
contain five or fewer.
Our results show that variable use of PIDs occurs in 60%
of transcript clusters for which we have informative data
(Table 1 – all clusters not identical are variable, with the
exception of 320 transcript clusters with only Conserved PID
multisets). We used the loss or gain of PIDs (as described in
the previous section) to infer the interactive potential of the
protein products of 3814 genes that map to the set of 7805
reference PPIs. With respect to variation in PID architecture,
we observe three classes of genes: (i) those with only one
apparent protein product, which are therefore not capable of
giving rise to functionally differentiated products (614); (ii) those
which generate multiple protein products with identical PID
architectures (1413); and (iii) those which generate multiple
protein products with variable PID architectures (1787, Table 2).
Isoforms of the first two classes (53%) have a static interaction
domain architecture that is unlikely to affect connectivity.
However, nearly half of the genes that map to our reference
PPI network have isoforms with variation in the PID archi-
tecture that may modulate network connectivity.
We examined the effect of this variation on the connectivity
of proteins from these 1787 genes (Table 2). By far the most
common form of variation we found was the loss of all PIDs
from an isoform, implying that 1287 genes produce at least
one protein isoform incapable of participating in known
protein–protein interactions. In the second scenario, the number
of occurrences of a given domain varies. Only a small number of
genes is affected by such variation (215) and it is generally
unclear what effect such variation might have on connectivity.
An example of this type of variation involves the domain repeat
WD40 (PF00400), where 4–16WD40 repeats are known to form
a rigid scaffold for protein interactions. The outcome of an
alteration in the number of repeats is not obvious and it is not
clear if truncation of domain repeats affects a protein’s binding
potential. Finally, 644 genes showed a pattern of PID variation
in which loss of some domains changed interaction potential.
Below we explore examples of this kind of variation identified
through searches of the primary literature.
Table 2 Three possible scenarios in isoform interactions
Loss of whole DDI domain(s),no interaction
Variable members of DDI domains (repeats only),uncertain impact on interaction mechanism
Variable use of DDI domain(s),modulation of connectivity
Diagram
# of genes 1287 215 644
P1 and P01 are isoforms, and P2 and P3 represent partners known to interact with P1. Rectangles and circles in each panel indicate the presence of
interaction domains. The solid line between proteins represents a known interaction and a dot-line indicates predicted interaction based on DDI
existence in PPI. Genes may fall into more than one of the above categories.
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
2058 Mol. BioSyst., 2012, 8, 2054–2066 This journal is c The Royal Society of Chemistry 2012
To characterise the variable genes (Table 2) more fully, we
performed functional enrichment clustering using the DAVID
Bioinformatics Database (see Material andMehtods). Clusters
with enrichment scores410 are presented in full in Supplementary
Data 3. The most-significantly enriched cluster was, unsurprisingly,
related to protein binding and protein dimerisation (enrich-
ment score = 30.02, p value range: 1.5 � 10�38–2.4 � 10�17).
The second most-enriched cluster was related to protein kinase
activity and phosphorylation (enrichment score = 27.46,
p value range: 5.3 � 10�42–1.1 � 10�18), while the third cluster
was related to the regulation and induction of apoptosis and
programmed cell death (enrichment score = 24.34, p value
range: 5.6 � 10�38–5.8 � 10�12). Other clusters with strong
enrichment scores and significant p values include (in decreas-
ing order of enrichment) associations with (i) cell junctions,
focal adhesion and the basolateral plasma membrane,
(ii) cell–cell adhesion, (iii) cell migration and motility,
(iv) regulation of kinase activity and MAPK signalling, and
(v) response to hormone and insulin stimulus. Reactome
pathways related to signalling, apoptosis, and cell surface
interactions were also enriched in this set (Supplementary
Data 3). These results are consistent with the strong enrich-
ment for signalling proteins in the whole variable dataset, and
also indicate that this subset at least contains a strong repre-
sentation of functions associated with cell adhesion and
motility, and apoptosis.
Validation and case studies
Literature searches for papers describing alternative splicing
and protein–protein interactions prioritised the review of fifty
articles, thirty of which met the selection criteria, that is,
they reported experimentally verified protein interactions with
isoform-level specificity. Of these papers, 17 described scenarios
in which alternative splicing changed protein interactions, or
had other significant effects likely to impact interactions, such
as changing the subcellular location of the protein. The effect
of alternative splicing on the 22 genes identified in these papers
is presented in Supplementary Table 3. We select two case
studies from this review (ADAM15 and NRP1), along with a
well-known signalling pathway (JAK-STAT pathway), to
illustrate the impact of changes in the PID architecture on
protein interactions. We also briefly highlight other examples in
which alternative splicing causes the disruption of protein–protein
interactions through other mechanisms (BRCA-1) or disrupts
protein-RNA interactions (RBM9).
Transcript variations alter protein interactions
ADAM15. The protein disintegrin and metalloproteinase
domain-containing protein 15 (ADAM15: ADA15_HUMAN)
is a single-pass type I membrane protein encoded by the
ADAM15 gene. The protein appears to localise largely to a
perinuclear compartment, where it may associate with the
trans-Golgi network or the late endosome.37 It has also been
shown to localise to adherens junctions in epithelial cells.38
Following standard Type I membrane protein topology, the
C-terminus of the protein remains in the cytoplasm.39 The
cytoplasmic part of ADAM15 isoforms is encoded by exons
18–23, and most ADAM15 protein isoforms contain proline-
rich SH3 ligand domains in this C-terminal region part.40
Domain analysis of the N-terminal (lumenal or extracellular)
region of ADAM15 isoforms indicates that most conserve the
N-terminal Disintegrin (PF00200) and ADAM, cysteine-rich
(PF08516) domains, and the peptidase functionality (PF01421
and PF01562) expected of a peptidase involved in ectodomain
shedding.41,42
Alternative splicing causes the cytoplasmic SH3 ligand
domains encoded by exons 18–23 to be assembled in different
combinations across six different isoforms: five contain
domains in various combinations, whereas one isoform lacks
any cytoplasmic SH3-binding domains.40 Two of the SH3
ligand domain subtypes bind most SH3 domains in known
binding partners Src, Tks5 and Lyn amongst others, while an
alternative domain subtype predominantly binds SNX9 and
SNX33. Notably, the isoform lacking SH3 ligand domains in
the C-terminal region fails to bind any SH3-domain-containing
proteins.40 This clearly demonstrates that alternative splicing of
protein interaction domains, in this case SH3 ligand domains,
alters the ability of isoforms to bind SH3-containing signalling
proteins, while the loss of all SH3 ligand domains completely
removes the ability of this protein transduce an intracellular signal
via SH3 signalling. This suggests a divergence in the functional
roles of the ADAM15 isoforms, allowing extracellular signalling
to stimulate alternative pathways of intracellular signalling, or
to completely decouple the external function of the molecule
(integrin binding and proteolysis) from signal transduction.
NRP1. Neuropilins are transmembrane glycoproteins with
large extracellular regions containing two CUB domains
(PF00431), two coagulation factor V/VIII homology domains,
and a MAM domain (PF00629).43,44 In addition they contain
a transmembrane domain and a short intracellular domain
with no clearly defined signalling function.45,46 The protein
neuropilin-1 (NRP1: NRP1_HUMAN) is encoded by the
NRP1 gene. To date, NRP1 is thought to function in
endothelial cells by enhancing vascular endothelial growth
factor (VEGF) binding to vascular endothelial growth factor
receptor 2 (KDR) and downstream signalling events, through
binding of NRP1 to VEGF and KDR.47 Isoform VEGF-165
promotes the formation of a complex of NRP1 and KDR in
endothelial cells, which is thought to be important for optimal
VEGF signalling and function.47 Moreover, the cytosolic
MAM domain of NRP1 is required for interaction of NRP1
and KDR.48
Based on our computational analysis as well as previous
experimental results, neuropilin-1 (NRP1) interacts with KDR
through the DDI between the MAM and immunoglobulin-like
C2-type domains (PF05790). Of the two NRP1 isoforms,
isoform 1 is a membrane-bound receptor involved in the
development of cardiovascular system, angiogenesis and organo-
genesis outside the nervous system, whereas isoform 2 is a
soluble protein.49 The interaction of isoform 1 with KDR
leads to increased VEGF-165 binding to KDR as well as
increased chemotaxis. Alternative splicing removes the signal
peptide and transmembrane domains to generate a soluble
protein. The MAM domain is also excluded from isoform 2 by
alternative splicing, thus disrupting the DDI between MAM
and ig-like C2-type domains. Consequently, the PPI between
NRP1 and KDR is lost.50
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2054–2066 2059
Transcript variation affects the JAK-STAT signalling pathway.
JAK-STAT signalling represents a classical signal transduc-
tion pathway, and is in many respects well-studied and well-
understood. We mapped our dataset of genes with variable
protein interaction domain (PID) architectures onto the 155 genes
recognised by KEGG as components of the JAK-STAT
signalling pathway: first we retrieved Uniprot IDs for the
KEGG pathway components, yielding 152 matches, and then
we retrieved the genes encoding these proteins from our
dataset. Of the genes encoding pathway elements, 72% con-
tained one or more isoforms, and 43 produced alternative
isoforms that did not include these domains. Fig. 2 shows the
KEGG JAK-STAT pathway. We have outlined in red any
node associated with genes for which we found variable PID
architecture in the protein products.
Several well-known genes in the pathway highlight the
potential impact of these variations on signal transduction:
STAT1. Two isoforms of STAT1, a and b, are known.
These isoforms are identical except that isoform b lacks 38
C-terminal residues present in the a form. This region encodes
the STAT1 TAZ2 binding domain (PF12162), which selec-
tively binds the TAZ2 domain of the CREB-binding protein
CREBBP (p300), an acetyltransferase involved in transcrip-
tional regulation.51 STAT1b is unable to bind CREBBP.52
Loss of binding to CREBBP results in the failure of STAT1bto activate transcription on chromatin templates, as CREBBP
functions as a normal STAT1 transcriptional activation
involves the CREBBP-mediated acetylation of histones at
the site of transcription.52
AKT1. There are three canonical transcripts for AKT1
recorded in Entrez Gene DB (GeneID 207), all of which
encode the same protein sequence. Our analysis identified an
additional four transcripts that encode additional putative
isoforms of the protein (Supplementary Data 1). Three distinct
PID architectures are seen for AKT1 products: the canonical
protein contains a plextrin homology (PH) domain (PF00169),
a protein kinase catalytic domain (PF00069) and a protein
kinase C-terminal domain (PF00433). Two isoforms retain the
kinase domains, but do not include the PH domain, raising the
possibility that there exist isoforms of AKT1 that retain kinase
function, but lose the ability to interact with other signalling
molecules that is usually mediated through the PH domain.
PTPN11. The non-receptor protein tyrosine phosphatase
PTPN11 (SHP-2) contains two SH2 domains in the N-terminal
region of the protein. It has been demonstrated that the
N-terminal SH2 domain binds Jak2, and is responsible for
the recruitment of Jak2 to the angiotensin II type AT1
receptor, whereas PTPN11 lacking this region is unable to
bind Jak2, and thus cannot recruit it to the receptor.53
Although this study was conducted using a cell line in which
a point mutation in PTPN11 results in aberrant splicing that
removes exon 3 and results in the deletion of the N-terminal
SH2 domain,54 we hypothesise that naturally occurring alter-
natively spliced isoforms that lack the SH2 domain would
demonstrate a similar uncoupling of downstream JAK-STAT
signalling from angiotensin II stimulation.
GRB2. The adaptor protein, growth factor receptor-bound
protein 2 encoding gene GRB2 generates a canonical protein
product that contains a SH2 domain flanked by two SH3
domains. This protein acts to couple receptor signalling (by
binding to receptor tyrosine kinases via the SH2 domain) to
downstream signal transduction (binding signalling molecules
through the SH3 domains).55 Alternative splicing of GRB2
generates an isoform (known as GRB3-3) lacking a functional
SH2 domain, but retaining the flanking SH3 domains. The
GRB3-3 isoform of GRB2 has been noted to have a dominant
Fig. 2 The JAK-STAT signalling pathway (hsa04630) from KEGG. Nodes associated with genes that have variable PID architecture are
highlighted in red.
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
2060 Mol. BioSyst., 2012, 8, 2054–2066 This journal is c The Royal Society of Chemistry 2012
negative effect over canonical GRB2, causing apoptosis in an
over-expression system.56 It is generally expressed at very low
levels in adult tissue, and is often not detected. Experiments in
rat, however, detected the presence of this isoform spiking in
the rat hippocampus at the same time as a wave of pro-
grammed cell death responsible for neuronal pruning in that
region.56 GRB3-3 has also been observed to be selectively
up-regulated in HIV infected T-cells, where it appears to
promote an environment conducive to HIV replication
through a pathway unrelated to the normal signalling pathway
of the canonical GRB2 protein.57 Identification of a GRB3-3
specific binding partner, adenosine deaminase (ADA),58 which
is known to play a role in immunodeficiency, hints at a
potential mechanism for this effect, as well as pointing to the
possibility of GRB3-3-specific downstream signalling unrelated
to normal GRB2 signalling.
The strong representation of signalling proteins and
domains in our results, as illustrated by the JAK-STAT results
(1.5 times as many variable genes as expected, with a hyper-
geometric probability of 0.00073) suggests that the production
of proteins with tunable interaction potential generates
significant potential for plasticity in signalling networks, and
presents an important mechanism through which these key
information transduction networks can be rapidly modulated,
without disrupting the genomic encoding of core components
of the network, which remain expressed in canonical isoforms.
Alternative splicing changes the location of BRCA1, altering
its interaction profile. BRCA1 is a nuclear protein with a
molecular mass of 220 kDa. Defects in BRCA1 function have
been implicated as a cause of susceptibility to breast cancer,
breast-ovarian cancer familial type 1, and ovarian cancer.59,60
Experiments to identify the subcellular location of the BRCA1
isoform BRCA1D672-4095 have demonstrated that in contrast
to the full-length BRCA1, which is found primarily in the
nucleus, BRCA1D672-4095 is found in the cytoplasm.61
BRCA1D672-4095 is generated by exclusion of exon 11 by
in-frame splicing and produces a 97 kDa protein lacking a
functional nuclear localisation signal. Full-length BCRA1 protein
implements DNA-repair tasks by binding to FANCD1-BRCA2
and RAD51 in a nuclear complex.62 Isoforms missing exon 11,
however, are located only in the cytoplasm, and are therefore not
available to interact with BRCA1’s nuclear-localised binding
partners. This example illustrates how the splicing of a localisa-
tion signal resulting in the altered location of the protein within
the cell disrupts protein interactions that normally occur with
correctly localised binding partners, and renders the isoform
unavailable for its normal function.
Alternative splicing disrupts the protein–RNA interactions of
RBM9. RBM9 (Fox-2) is one member of Fox protein family,
members of which control the alternative splicing of many
transcripts in neurons, muscle, and other tissues. RBM9
produces proteins with a single RNA-binding domain
(RRM) flanked by N- and C-terminal domains that are highly
diversified through the utilization of alternative promoters and
alternative splicing patterns.63–65 All the isoforms have calcitonin
gene-related peptide regulator C terminal domains (PF12414)
and RNA recognition motif 1 domains (RRM_1) (PF00076),
though the RRM_1 domain of isoform 3 is 36 aa shorter than
the canonical one (67 aa). Fox-induced splicing creates a Fox
isoform that lacks a proper RRM_1 domain, and thus will
not mediate splicing enhancement or repression through the
UGCAUG element. Instead, the intact N- and C-terminal
domains of this isoform can counteract the effect of full-length
Fox proteins in enhancing a Fox-dependent exon. As a
consequence, rather than the auto-regulated splicing reducing
the overall level of the protein, the new isoform directly
antagonizes Fox activity.66
Discussion
Quality of the datasets
The most-recent analysis comparable to the work we present
here was published in 2004, from assembled EST data, and
covered B4500 genes.30 By contrast we considered B26 K
transcript clusters, and found informative data on 13 704 genes.
Our analysis is based on full-length cDNAs, avoiding the
problems of 30 and 50 bias seen in EST-based studies.67 We
also filtered domain–domain interactions to ensure only highly
confident DDIs with evidence from eukaryotic structures were
included in our analysis. Because of the stringent criteria we
applied in filtering the DDI data, some, perhaps many, inter-
actions may have been omitted from our dataset. For example,
the disintegrin domain occurs in the 3DID dataset, but because
DDIs including it were not supported by evidence from
eukaryotic PDB structures, DDIs with this domain were not
included in our dataset.
Based on our analysis of the integrated transcript set, we
were initially surprised at the low rate of variation, compared
with recent estimates of transcript variability, that range from
68–92% of genes.68,69 However, higher estimates include
several categories of variability in the untranslated regions of
transcripts,68 only 20% of which is estimated to impact open
reading frames.70 Thus the majority of these events will not be
captured in our variable dataset. Additionally, these studies
consider transcripts mapped to genes, and not all of our
transcripts map to gene identifiers. If we remove clusters not
mapped to known genes, our rate of variability increases to
67%, well within range of some recent estimates. We suspect
that low-quality or dubious transcripts are concentrated in the
single-isoform units rather than in the multiple-isoform genes.
Of those clusters that fail to map to Entrez Gene identifiers,
97% are units producing only a single coding transcript. This,
along with low rates of domain prediction and the very short
average length of sequences that fail to generate domains (369 nt),
suggests that at least the uncharacterised transcriptional units in
the single isoform set may be (i) of dubious quality and prove-
nance, (ii) potentially newly discovered genes, (iii) represent
transcripts of unknown function that do not necessarily encode
proteins, or, more likely (iv) simply represent short proteins which
are typically not well studied or annotated.71 We also noted that
4% of Entrez Gene IDs in our data set map to more than one
transcript cluster. In most of the above cases, one cluster has
multiple isoforms and the other is a single-isoform unit,
suggesting that transcripts assigned to some genes via gene
identifiers do not completely cluster using our criteria.
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2054–2066 2061
We built our proteome dataset by examining the proteins
encoded by ORFs in our transcript dataset. If two transcripts
encoded the same protein sequence, the transcripts were
included in the cluster, but only one representative protein
sequence was included in the proteome set attached to that
gene. Single-nucleotide polymorphisms and minor sequence
variants that produce changes in the amino acid sequence are
included as distinct proteins in our dataset. In most cases, such
minor variation will not affect the domain composition or
function of the protein; however it is possible that variation at
critical points may produce a functionally distinct protein, as
seen, for example, in the human growth hormone receptor
GHR, in which many single-nucleotide polymorphisms have
been associated with diseases (NCBI OMIM database MIM
ID *600946). Inclusion of such sequences in our dataset
represents natural variations on the canonical human genome
sequence.
The DDI data used in this study were derived from high-
resolution structures deposited in the Protein Data Bank.72
We use the domain interaction datasets made available by
3DID and Pfam, without further curation. However, not all
the DDI pairs in the dataset may have a causative role in
protein interactions, but instead may owe their appearance in
the DDI data to protein interactions caused by other features.
Additionally, many domains known to have a role in signal
transduction (for example the SH2, SH3 and kinase domains
common in our dataset) may participate in transient, rather
than stable, protein interactions and may interact with peptide
regions rather than with structured domains. Although the
interaction target of such domains may not always be another
domain, their loss will nevertheless alter the ability of protein
isoforms to participate in transient signalling interactions, and
for this reason we consider the variable use of these domains to
be informative with respect to variation in the interactome.
Emerging theme of isoforms having opposing function
The computational results and examples presented here high-
light an interesting theme: protein isoforms frequently appear
to have opposing function. This trend has been noted for
enzymes, in which splicing events produce truncated versions
of the enzyme, lacking the active centre or deleting catalytic
domains, thus generating isoforms with a dominant negative
effect.73,74 A review of the functional impact of alternative
splicing75 describes alternatively spliced isoforms that display
a dominant negative effect with respect to interactions between
proteins and small ligands and nucleic acids, as well as in the
function of enzymes.
This trend is also seen in receptors. For example, in
membrane-bound receptors, alternative splicing of trans-
membrane domains can generate soluble isoforms which often
act as decoy receptors to decouple a signal from its trans-
membrane transduction. A classic example is the difference in
function of membrane-bound and soluble variants of the surface
antigen FAS (TNR6_HUMAN), for which the isoform integral
to the membrane promotes apoptosis, whereas an alternatively
spliced soluble isoform inhibits it.76 In another example, loss
of the ligand-binding domain (LBD) in androgen receptor
(AR) isoforms leads to constitutively active AR signalling,
uncoupled from ligand-binding activation (that is, signalling is
ON in the absence of ligand, whereas AR with a functional
LBD has signalling OFF in absence of ligand).77 The presence
of this constitutively active AR splice variant is known to be
a significant factor in development of hormone-refractory
prostate cancer.78
Unfortunately, most functional annotation (such as with
terms from the Gene Ontology) is aggregated at the gene level,
and very little transcript- or isoform-specific annotation is
available external to the literature or in machine-readable
format. Increasing appreciation for the importance of anno-
tating biological entities with greater resolution (i.e. at the
isoform-specific level, rather than at the gene or representative
protein level) will enable broader analysis of this phenomenon.
Variable use of other features also impacts interaction potential
of proteins
Other mechanisms through which alternative splicing or other
kinds of transcript variation may induce alterations in the
interactome are hinted at in the literature. Specifically, we
observe that variation in sequence features that encode locali-
sation signals (such as the NLS in BRCA1) can change the
location of the protein, resulting in the disruption of inter-
actions with otherwise co-located partners. A previous study
characterised extensive alteration in the use of other sequence
features (specifically signal peptides and transmembrane
domains) that determine localisation of mouse proteins.22
Given our understanding of the importance of co-localisation
for protein interactions,79 it is likely that many of these events,
which frequently result in dramatically different protein locali-
sation or topology with respect to the membrane, will have
consequences for the participation of those isoforms in molecular
interactions. There is also evidence that small sequence varia-
tions that do not disrupt domains may, nonetheless, have
potential to disrupt post-translational modifications. As many
binding proteins recognise specific post-translational modifi-
cations in target proteins (for example, PTPN11 binds Jak2 at
the phosphorylated Y201 residue,53 loss of which prevents
binding), such small variations may still alter interactions
between proteins.
Implications for the interactome and systems biology
There is evidence that the frequency of alternative splicing is
inversely proportional to gene or paralog copy number.80,81
This suggests that alternative splicing provides an alternative
to gene duplication and divergence as a source of protein
diversity. We also see a strong enrichment for signalling
pathways and function in our set of genes with variable PID
architecture, and a strong enrichment for variability in signal-
ling proteins. Together, these results suggest that the variation
at the level of the transcriptome we describe here presents a
more-rapid, context-specific mechanism to modulate the con-
nectivity of signalling systems than changes to the genome
itself. Other studies have also demonstrated that the inter-
actions of signalling networks evolve at a faster rate than
many other kinds of interactions82 highlighting the importance
of plasticity and robustness in these systems. We see examples of
isoforms which retain selective binding ability while decoupling
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
2062 Mol. BioSyst., 2012, 8, 2054–2066 This journal is c The Royal Society of Chemistry 2012
physical interactions with other proteins or complexes, or
alternate between signalling pathways (see the ADAM15 and
GRB2 examples above).
It is now known that the production of variable transcripts
can be regulated in a tissue- or developmental stage-specific
manner.83–86 For this reason, we hypothesise that modulation
of interaction networks by different protein isoforms will also
demonstrate tissue- or stage-specificity. Evidence in the litera-
ture supports this hypothesis. For example, FLT1 (VEGF
receptor-1) gene expresses a soluble, kinase-deficient isoform
in the murine cornea.87 This isoform functions as ligand trap,
soaking up the growth factor ligands of FLT1 and preventing
the development of vasculature in the cornea. This demon-
strates how the dominant negative effect of isoforms generates
a functional result in a tissue-specific manner.
Since isoforms frequently exhibit very different interaction
capabilities, it is critical to identify specific isoforms in experi-
ments, and in the capture of literature-based metadata for PPI
databases. A shallow (one protein deep) and flattened (ignoring
the specificity introduced by various cell types representing
different tissues or developmental stages) interactome completely
fails to capture the complexity present in the full set of molecular
interactions supported by a proteome of 468 000 protein
isoforms, many with the capability for distinct and indepen-
dent function. The results we present here demonstrate that
protein interactions need to be analysed at the level of protein
isoforms, not of representative proteins or, worse, genes.
Increasingly, advanced-generation sequencing datasets will
capture organism-specific, tissue-specific, or even cell-type-
specific transcript sequences. Consideration of the impact of
transcript diversity will enable context-specific networks to be
derived from these data, as opposed to mapping data to a
shallowly annotated reference interactome.
Material and Methods
Transcriptome data
Two transcriptional datasets were integrated: transcrip-
tional units (TUs) from FANTOM3 (2004-10-17)2,88 and
H-Invitational clusters (HIXs) from H-Invitational DB
(H-InvDB) (2009-03-30).4,89 In our study, only HIXs and
TUs which are constructed based on the full-length cDNAs
and encode proteins were used. We identified transcriptional
units in our merged dataset using an internal transcriptional
unit index (TUX). Clusters were built using the following
shared attributes: genomic loci, mRNAs and assigned Entrez
GeneID (or gene identifier) annotated for each TU and HIX.
Then interacting proteins in the collected PPI set were mapped
to the corresponding cluster and isoform based on either an
exact sequence match to an isoform within a cluster or the
assigned gene identifier for the protein (Supplementary Table 4).
In the FANTOM dataset, if a TU is composed of more than
one transcription framework (a region containing commonly
expressed regions and sharing common transcriptional
features, described in ref. 2) the TU was split into separate
clusters based on the annotated transcriptional frameworks.2,90
Clusters from either set composed of chimeric genes were
discarded. To obtain a non-redundant protein set from each
HIX or TU, we used the isoform protein set (IPS)2 from
FANTOM3 and collected proteins annotated as an isoform
verified by full length cDNA in H-InvDB. Thus, if a HIX
or TU contained no protein-coding transcript, it was not
included in the merge process (depicted in Supplementary
Fig. 3) outlined in the steps below:
STEP (1) TUs in FANTOM3 were built based on the
human genome 17 (hg17) while HIXs in H-InvDB were
constructed based on the human genome 18 (hg18). For
consistency of locus annotation between datasets, we mapped
TUs from hg17 to hg18 by using ‘liftover’ in the UCSC
genome browser91 with the following options: minimum ratio
of bases that must remap (1.0), minimum chain size in target (0),
minimum hit size in query (0), allow multiple output regions
(NO) and min ratio of alignment blocks/exons that must
map (1). After the mapping process, we retain 21480 TUs
from FANTOM3 and 18513 HIXs from H-InvDB for inte-
gration (Supplementary Table 5).
STEP (2) We grouped HIXs and TUs according to their
orientation on the human genome: 20119 and 19874 (HIXs
and TUs in total) for the FORWARD and REVERSE strand
sets, respectively.
STEP (3) HIXs and TUs were merged when the genomic
locations of their transcripts overlapped in the same strand.
HIXs and TUs within a group were regrouped based on
mRNAs produced from each cluster member to distinguish
overlapped genes such as nested genes.92,93 For this regrouping,
we used GenBank accession numbers to identify each mRNA.
When HIXs or TUs within a cluster share at least one mRNA,
they were merged into a sub-cluster. For a cluster in which
there is no common mRNA, we used the gene identifier given
to each transcriptional unit as a proxy for shared transcripts.
Thus, if a transcriptional unit has the same gene identifier as
another transcriptional unit, they are merged. For example,
the cluster consisting of TU7908, HIX0023813 and TU30007
was separated into two groups based on shared transcripts:
HIX0023813 and TU30007 share AK130277 (cluster 2), while
TU7908 (cluster 1) has no common mRNA or gene identifier
with TU30007 or HIX0023813.
STEP (4) We reassessed our clusters by comparing the gene
identifiers of the transcripts in each cluster. If there were any
transcripts with different annotated gene identifiers, we discarded
those clusters from our study. We also examined redundant
isoforms presented in different clusters. If corresponding clusters
have different GI, those clusters were removed from our dataset.
Protein and protein interaction data
Interaction databases and transcriptome data sources use
different standard systems for protein identifiers. We imple-
mented the CRC64 checksum algorithm94 for consistent
identification of protein sequences. All protein sequences in
interaction databases and two isoform sets from H-InvDB and
FANTOM3 were assigned to 16 digits consisting of numbers
and letters, using SPcrc (ftp://ftp.ebi.ac.uk/pub/software/swis
sprot/Swissknife/old/SPcrc.tar.gz).
We downloaded human protein–protein interaction (PPI)
sets obtained from six publicly available PPI databases: BIND
(2006-05-25),95 DIP (2008-10-14),10 HPRD (2009-06-07),96
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2054–2066 2063
IntAct (2009-08-05),9 MINT (2009-07-29)11 and MPPI
(2005).97 We considered only PPIs defined as binary, direct
physical interactions. We excluded indirectly predicted inter-
actions in which PPIs are predicted based on the SPOKE and
the MATRIX models98 because of the high false-positive (FP)
rate of those predicted interactions from protein complexes.99
Then, we merged PPIs based on sequence identity of interacting
proteins (see above Section 1).
Artificial conditions in high-throughput PPI detection methods
are known to generate a high proportion of false positive PPIs,
especially in yeast-two-hybrid screens (Y2H).100 PPIs discovered
through low-throughput screens (LTS) and supported by several
different experimental methods showed more-reliable interactions
than those identified in high-throughput screens (HTS).101 To
obtain a more-reliable PPI set, we filtered PPIs based on the
discovery methods and the lines of evidence, the scale of a
detection method and supporting literature:
1. PPIs generated from only the Y2H analysis were
removed from each PPI dataset due to the high FP rate;102
2. then, we integrated all of the six PPI datasets with
43104 PPIs and 10484 proteins (Supplementary Table 2); and
3. we retained a PPI in our dataset if the interaction has
been reported in more than one paper, or if a PPI was detected
in a small-scale experiment (one where the associated paper
describes fewer than 10 PPIs).
This process resulted in a reliable PPI dataset of 29761
interactions between 8872 proteins (Supplementary Table 2).
We assigned an interacting protein in the PPI set to the
corresponding transcript cluster where that interacting protein
is encoded. For the assignment, the protein sequence and GI
were used. If an interacting protein matches to an isoform with
100 percent sequence identity over the entire length, this
protein was considered to be encoded by that transcript and
to belong to the corresponding gene. If no exactly matched
isoforms could be found, interacting proteins were mapped to
genes identified by the same gene identifier as the interacting
protein. This process mapped 28 309 interactions involving
8522 proteins from our dataset.
Domain prediction and domain–domain interactions
To examine domain architecture for isoforms and interacting
proteins, we used Pfam–A (version 24.035). We used ‘pfam_scan.pl’
which is based on HMM3.0 beta 3103 with Pfam-A.hmm for
the profile-hidden Markov models. 54 630 out of 68748 protein
isoforms in our dataset have at least one Pfam domain.
We took domain–domain interactions (DDIs) from iPfam
(2007, Pfam 21)27 and 3DID (2010, Pfam 24).26 Both DDI sets
were generated based on known, high-resolution 3D structures
available in PDB.104 Only DDIs found in known 3D structures
of eukaryotes were collected, with 1962 DDIs contributed by
iPfam and 2642 DDIs contributed by 3DID to give a total of
3021 DDIs (Supplementary Fig. 4). For this study, we accept
DDIs that result from both inter- and intra-chain molecular inter-
actions. From 3DID, we used only DDIs with a Z-score Z1.3
corresponding to a significance of over 90%.105 These conditions
are stringent, and generate a conservative list of DDIs.
The impact of variation in transcript sequences on potential
isoform interactions was inferred with reference to our DDI
dataset and known protein interactions (Supplementary
Fig. 5). First, the integrated set of DDIs was mapped to our
set of high-quality human PPIs. We then identified interacting
protein pairs that contain at least one DDI, and examine the
isoforms of these reference proteins to see if they maintain
the relevant domains. If a pair of isoforms maintains one of
the DDIs found in the reference PPI, this pair of isoforms was
designated a potential isoform interaction.
We analysed the frequency of protein interaction domains
(PIDs) in isoforms and PPIs (Supplementary Data 2), counting
the number of hits for a given PID in isoforms. For the
abundance of DDI in PPIs, numbers of hits for a given DDI
in PPIs were counted. In both analyses, when a given domain
or DDI was repeated in an isoform or a PPI, only one hit was
counted.
Statistical tests
Enrichment tests for the JAK-STAT pathway were based on
the hypergeometric distribution function, against the back-
ground of our complete dataset, in which genes were assigned
as either variable or not variable. For the JAK-STAT pathway
enrichment, 155 gene identifiers associated with the pathway
were downloaded from KEGG. Clusters were mapped to the
JAK-STAT set via Entrez Gene IDs. Five genes mapped to
two clusters (Gene IDs: 867, 2057, 3575, 10000, and 30837). In
these cases, we took the value (variable or not variable) of the
multi-isoform cluster mapped to the gene in preference to the
single-isoform unit. Four JAK-STAT genes were not mapped
to clusters in our dataset (Gene IDs: 1270, 3563, 10379 and
64109). In the absence of data, these were assigned as not
variable. All other enrichment tests were calculated using the
Chi-squared distribution.
The statistical significance of differences in length, and
identity of proteins from single-isoform units compared with
proteins from multiple-isoform clusters, were calculated using
Student’s T test. The analysis of the distribution of categories
of PPI in Fig. 1B used an expected frequency calculated from
the conditional probability of picking a pair of proteins
given the number of proteins in each category (P(m,m) = 0.66,
P(m,s) = 0.30, P(s,s) = 0.04). Pairs are undirected, so P(m,s)
includes P(m,s)=0.15 and P(s,m) = 0.15.
Functional analysis of the variable genes mapped to PPIs
was conducted using the DAVID Bioinformatics Database
(http://david.abcc.ncifcrf.gov/). Entrez Gene IDs associated
with our variable multiple-isoform clusters were uploaded
as a query list, and compared to the whole human genome
background. Gene ontology categories Biological Process,
Molecular Function and Cellular component were collectively
interrogated using the functional annotation clustering
method. DAVID pathway analysis was conducted against
annotation from the Reactome Pathway Database. The p values
we report from DAVID control for multiple hypothesis testing
using the Benjamini-Hochberg correction. Results of these
enrichment tests are available in Supplementary Data File 3.
Signalling proteins were identified based on annotation in
the Gene Ontology Database, where 7053 gene products are
annotated with the term GO:0023052 Signalling, or annotated
with children of this term. We collected these proteins from
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
2064 Mol. BioSyst., 2012, 8, 2054–2066 This journal is c The Royal Society of Chemistry 2012
the human gene association file (September 2011). Uniprot
entries for the Uniprot IDs in this file were downloaded, and
Entrez Gene IDs and HGNC IDs were extracted from the
entries. If Entrez Gene IDs were not present in the Uniprot
protein entry, the HGNC ID was used to retrieve Entrez Gene
IDs from the HGNC database. The 571 proteins for which we
were unable to retrieve an Entrez Gene ID were omitted from
the analysis. These lists were converted to a non-redundant set
of Entrez Gene IDs, which were mapped onto our dataset to
assign a value of either variable or not variable.
Acknowledgements
The authors acknowledge the computational infrastructure
provided by theQueensland Facility for Advanced Bioinformatics,
which is supported by a Queensland Government Smart State
grant. The authors also thank Chang Liu for his contribution
to the STAT1 case study, and Dr Stefan Maetschke for
technical advice. MJD and MAR are funded by the Australian
Research Council [grant number DP110103384].
References
1 G. W. Beadle and E. L. Tatum, Proc. Natl. Acad. Sci. U. S. A.,1941, 27, 499–506.
2 P. Carninci, T. Kasukawa, S. Katayama, J. Gough, M. C. Frith,N. Maeda, R. Oyama, T. Ravasi, B. Lenhard, C. Wells,R. Kodzius, K. Shimokawa, V. B. Bajic, S. E. Brenner,S. Batalov, A. R. Forrest, M. Zavolan, M. J. Davis, L. G.Wilming, V. Aidinis, J. E. Allen, A. Ambesi-Impiombato,R. Apweiler, R. N. Aturaliya, T. L. Bailey, M. Bansal, L. Baxter,K. W. Beisel, T. Bersano, H. Bono, A. M. Chalk, K. P. Chiu,V. Choudhary, A. Christoffels, D. R. Clutterbuck, M. L. Crowe,E. Dalla, B. P. Dalrymple, B. de Bono, G. Della Gatta,D. di Bernardo, T. Down, P. Engstrom, M. Fagiolini, G. Faulkner,C. F. Fletcher, T. Fukushima, M. Furuno, S. Futaki, M. Gariboldi,P. Georgii-Hemming, T. R. Gingeras, T. Gojobori, R. E. Green,S. Gustincich, M. Harbers, Y. Hayashi, T. K. Hensch, N. Hirokawa,D. Hill, L. Huminiecki, M. Iacono, K. Ikeo, A. Iwama, T. Ishikawa,M. Jakt, A. Kanapin, M. Katoh, Y. Kawasawa, J. Kelso,H. Kitamura, H. Kitano, G. Kollias, S. P. Krishnan, A. Kruger,S. K. Kummerfeld, I. V. Kurochkin, L. F. Lareau, D. Lazarevic,L. Lipovich, J. Liu, S. Liuni, S. McWilliam, M. Madan Babu,M. Madera, L. Marchionni, H. Matsuda, S. Matsuzawa, H. Miki,F. Mignone, S. Miyake, K. Morris, S. Mottagui-Tabar, N. Mulder,N. Nakano, H. Nakauchi, P. Ng, R. Nilsson, S. Nishiguchi,S. Nishikawa, F. Nori, O. Ohara, Y. Okazaki, V. Orlando,K. C. Pang, W. J. Pavan, G. Pavesi, G. Pesole, N. Petrovsky,S. Piazza, J. Reed, J. F. Reid, B. Z. Ring, M. Ringwald, B. Rost,Y. Ruan, S. L. Salzberg, A. Sandelin, C. Schneider, C. Schonbach,K. Sekiguchi, C. A. Semple, S. Seno, L. Sessa, Y. Sheng, Y. Shibata,H. Shimada, K. Shimada, D. Silva, B. Sinclair, S. Sperling, E. Stupka,K. Sugiura, R. Sultana, Y. Takenaka, K. Taki, K. Tammoja,S. L. Tan, S. Tang, M. S. Taylor, J. Tegner, S. A. Teichmann,H. R. Ueda, E. van Nimwegen, R. Verardo, C. L. Wei, K. Yagi,H. Yamanishi, E. Zabarovsky, S. Zhu, A. Zimmer, W. Hide, C. Bult,S. M. Grimmond, R. D. Teasdale, E. T. Liu, V. Brusic,J. Quackenbush, C. Wahlestedt, J. S. Mattick, D. A. Hume,C. Kai, D. Sasaki, Y. Tomaru, S. Fukuda, M. Kanamori-Katayama,M. Suzuki, J. Aoki, T. Arakawa, J. Iida, K. Imamura, M. Itoh,T. Kato, H. Kawaji, N. Kawagashira, T. Kawashima, M. Kojima,S. Kondo, H. Konno, K. Nakano, N. Ninomiya, T. Nishio,M. Okada, C. Plessy, K. Shibata, T. Shiraki, S. Suzuki,M. Tagami, K. Waki, A. Watahiki, Y. Okamura-Oho, H. Suzuki,J. Kawai and Y. Hayashizaki, Science, 2005, 309, 1559–1563.
3 Y. Hayashizaki and P. Carninci, PLoS Genet., 2006, 2, e63.4 T. Imanishi, T. Itoh, Y. Suzuki, C. O’Donovan, S. Fukuchi,K. O. Koyanagi, R. A. Barrero, T. Tamura, Y. Yamaguchi-Kabata, M. Tanino, K. Yura, S. Miyazaki, K. Ikeo, K. Homma,
A. Kasprzyk, T. Nishikawa, M. Hirakawa, J. Thierry-Mieg,D. Thierry-Mieg, J. Ashurst, L. Jia, M. Nakao, M. A. Thomas,N. Mulder, Y. Karavidopoulou, L. Jin, S. Kim, T. Yasuda,B. Lenhard, E. Eveno, Y. Suzuki, C. Yamasaki, J. Takeda,C. Gough, P. Hilton, Y. Fujii, H. Sakai, S. Tanaka, C. Amid,M. Bellgard, F. Bonaldo Mde, H. Bono, S. K. Bromberg,A. J. Brookes, E. Bruford, P. Carninci, C. Chelala,C. Couillault, S. J. de Souza, M. A. Debily, M. D. Devignes,I. Dubchak, T. Endo, A. Estreicher, E. Eyras, K. Fukami-Kobayashi, G. R. Gopinath, E. Graudens, Y. Hahn, M. Han,Z. G. Han, K. Hanada, H. Hanaoka, E. Harada, K. Hashimoto,U. Hinz, M. Hirai, T. Hishiki, I. Hopkinson, S. Imbeaud,H. Inoko, A. Kanapin, Y. Kaneko, T. Kasukawa, J. Kelso,P. Kersey, R. Kikuno, K. Kimura, B. Korn, V. Kuryshev,I. Makalowska, T. Makino, S. Mano, R. Mariage-Samson,J. Mashima, H. Matsuda, H. W. Mewes, S. Minoshima,K. Nagai, H. Nagasaki, N. Nagata, R. Nigam, O. Ogasawara,O. Ohara, M. Ohtsubo, N. Okada, T. Okido, S. Oota, M. Ota,T. Ota, T. Otsuki, D. Piatier-Tonneau, A. Poustka, S. X. Ren,N. Saitou, K. Sakai, S. Sakamoto, R. Sakate, I. Schupp,F. Servant, S. Sherry, R. Shiba, N. Shimizu, M. Shimoyama,A. J. Simpson, B. Soares, C. Steward, M. Suwa, M. Suzuki,A. Takahashi, G. Tamiya, H. Tanaka, T. Taylor, J. D. Terwilliger,P. Unneberg, V. Veeramachaneni, S. Watanabe, L. Wilming,N. Yasuda, H. S. Yoo, M. Stodolsky, W. Makalowski, M. Go,K. Nakai, T. Takagi, M. Kanehisa, Y. Sakaki, J. Quackenbush,Y. Okazaki, Y. Hayashizaki, W. Hide, R. Chakraborty,K. Nishikawa, H. Sugawara, Y. Tateno, Z. Chen, M. Oishi,P. Tonellato, R. Apweiler, K. Okubo, L. Wagner, S. Wiemann,R. L. Strausberg, T. Isogai, C. Auffray, N. Nomura, T. Gojoboriand S. Sugano, PLoS Biol., 2004, 2, e162.
5 A. J. Matlin, F. Clark and C. W. Smith, Nat. Rev. Mol. Cell Biol.,2005, 6, 386–398.
6 B. Modrek and C. Lee, Nat. Genet., 2002, 30, 13–19.7 B. R. Graveley, Trends Genet., 2001, 17, 100–107.8 Y. Hasegawa and Y. Hayashizaki, in Introduction to Systems
Biology, ed. S. Choi, Humana Press, Totowa, 2007, pp. 85–105.9 S. Kerrien, Y. Alam-Faruque, B. Aranda, I. Bancarz, A. Bridge,
C. Derow, E. Dimmer, M. Feuermann, A. Friedrichsen, R. Huntley,C. Kohler, J. Khadake, C. Leroy, A. Liban, C. Lieftink, L.Montecchi-Palazzi, S. Orchard, J. Risse, K. Robbe, B. Roechert, D. Thorneycroft,Y. Zhang, R. Apweiler and H. Hermjakob, Nucleic Acids Res., 2007,35, D561–D565.
10 L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowieand D. Eisenberg, Nucleic Acids Res., 2004, 32, D449–D451.
11 A. Chatr-aryamontri, A. Ceol, L. M. Palazzi, G. Nardelli,M. V. Schneider, L. Castagnoli and G. Cesareni, Nucleic AcidsRes., 2007, 35, D572–D574.
12 D. Szklarczyk, A. Franceschini, M. Kuhn, M. Simonovic,A. Roth, P. Minguez, T. Doerks, M. Stark, J. Muller, P. Bork,L. J. Jensen and C. von Mering, Nucleic Acids Res., 2011, 39,D561–D568.
13 C. Prieto and J. De Las Rivas, Nucleic Acids Res., 2006, 34,W298–W302.
14 A. Kamburov, K. Pentchev, H. Galicka, C. Wierling, H. Lehrachand R. Herwig, Nucleic Acids Res., 2011, 39, D712–D717.
15 R. M. Kaake, T. Milenkovic, N. Przulj, P. Kaiser and L. Huang,J. Proteome Res., 9, 2016–2029.
16 A. J. Lusis and J. N. Weiss, Circulation, 2010, 121, 157–170.17 B. Lehne and T. Schlitt, Hum. Genomics, 2009, 3, 291–297.18 A. Chatr-aryamontri, S. Kerrien, J. Khadake, S. Orchard, A. Ceol,
L. Licata, L. Castagnoli, S. Costa, C. Derow, R. Huntley,B. Aranda, C. Leroy, D. Thorneycroft, R. Apweiler, G. Cesareniand H. Hermjakob, GenomeBiology, 2008, 9(Suppl 2), S5.
19 S. Mathivanan, B. Periaswamy, T. K. Gandhi, K. Kandasamy,S. Suresh, R. Mohmood, Y. L. Ramachandra and A. Pandey,BMC Bioinformatics, 2006, 7(Suppl 5), S19.
20 E. V. Kriventseva, I. Koch, R. Apweiler, M. Vingron, P. Bork,M. S. Gelfand and S. Sunyaev, Trends Genet., 2003, 19, 124–128.
21 M. Lorenz, B. Hewing, J. Hui, A. Zepp, G. Baumann,A. Bindereif, V. Stangl and K. Stangl, FASEB J., 2007, 21,1556–1564.
22 M. J. Davis, K. A. Hanson, F. Clark, J. L. Fink, F. Zhang,T. Kasukawa, C. Kai, J. Kawai, P. Carninci, Y. Hayashizaki andR. D. Teasdale, PLoS Genet., 2006, 2, e46.
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2054–2066 2065
23 T. Pawson and P. Nash, Science, 2003, 300, 445–452.24 P. Aloy and R. B. Russell, FEBS Lett., 2005, 579,
1854–1858.25 T. Pawson and P. Nash, Genes Dev., 2000, 14, 1027–1047.26 A. Stein, A. Panjkovich and P. Aloy, Nucleic Acids Res., 2009, 37,
D300–D304.27 R. D. Finn, M. Marshall and A. Bateman, Bioinformatics, 2005,
21, 410–412.28 M. Deng, S. Mehta, F. Sun and T. Chen, Genome Res., 2002, 12,
1540–1548.29 R. Riley, C. Lee, C. Sabatti and D. Eisenberg, GenomeBiology,
2005, 6, R89.30 A. Resch, Y. Xing, B. Modrek, M. Gorlick, R. Riley and C. Lee,
J. Proteome Res., 2004, 3, 76–83.31 K. Yura, M. Shionyu, K. Hagino, A. Hijikata, Y. Hirashima,
T. Nakahara, T. Eguchi, K. Shinoda, A. Yamaguchi,K. Takahashi, T. Itoh, T. Imanishi, T. Gojobori and M. Go,Gene, 2006, 380, 63–71.
32 P. R. Romero, S. Zaidi, Y. Y. Fang, V. N. Uversky, P. Radivojac,C. J. Oldfield, M. S. Cortese, M. Sickmeier, T. LeGall,Z. Obradovic and A. K. Dunker, Proc. Natl. Acad. Sci. U. S. A.,2006, 103, 8390–8395.
33 A. Valletti, A. Anselmo, M. Mangiulli, I. Boria, F. Mignone,G. Merla, V. D’Angelo, A. Tullo, E. Sbisa, A. D’Erchia andG. Pesole, Mol. Cancer, 2010, 9, 230.
34 J. Kawai, A. Shinagawa, K. Shibata, M. Yoshino, M. Itoh,Y. Ishii, T. Arakawa, A. Hara, Y. Fukunishi, H. Konno,J. Adachi, S. Fukuda, K. Aizawa, M. Izawa, K. Nishi,H. Kiyosawa, S. Kondo, I. Yamanaka, T. Saito, Y. Okazaki,T. Gojobori, H. Bono, T. Kasukawa, R. Saito, K. Kadota,H. Matsuda, M. Ashburner, S. Batalov, T. Casavant,W. Fleischmann, T. Gaasterland, C. Gissi, B. King,H. Kochiwa, P. Kuehl, S. Lewis, Y. Matsuo, I. Nikaido,G. Pesole, J. Quackenbush, L. M. Schriml, F. Staubli,R. Suzuki, M. Tomita, L. Wagner, T. Washio, K. Sakai,T. Okido, M. Furuno, H. Aono, R. Baldarelli, G. Barsh,J. Blake, D. Boffelli, N. Bojunga, P. Carninci, M. F. de Bonaldo,M. J. Brownstein, C. Bult, C. Fletcher, M. Fujita, M. Gariboldi,S. Gustincich, D. Hill, M. Hofmann, D. A. Hume, M. Kamiya,N. H. Lee, P. Lyons, L. Marchionni, J. Mashima, J. Mazzarelli,P. Mombaerts, P. Nordone, B. Ring, M. Ringwald, I. Rodriguez,N. Sakamoto, H. Sasaki, K. Sato, C. Schonbach, T. Seya,Y. Shibata, K. F. Storch, H. Suzuki, K. Toyo-oka,K. H. Wang, C. Weitz, C. Whittaker, L. Wilming,A. Wynshaw-Boris, K. Yoshida, Y. Hasegawa, H. Kawaji,S. Kohtsuki and Y. Hayashizaki, Nature, 2001, 409, 685–690.
35 R. D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger,J. E. Pollington, O. L. Gavin, P. Gunasekaran, G. Ceric,K. Forslund, L. Holm, E. L. Sonnhammer, S. R. Eddy andA. Bateman, Nucleic Acids Res., 2010, 38, D211–D222.
36 W. R. Pearson, Methods Mol. Biol., 2000, 132, 185–219.37 L. Lum, M. S. Reid and C. P. Blobel, J. Biol. Chem., 1998, 273,
26236–26247.38 C. Ham, B. Levkau, E. W. Raines and B. Herren, Exp. Cell Res.,
2002, 279, 239–247.39 M. Higy, T. Junne and M. Spiess, Biochemistry, 2004, 43,
12716–12722.40 I. Kleino, R. M. Ortiz, M. Yritys, A.-P. J. Huovila and
K. Saksela, J. Cell. Biochem., 2009, 108, 877–885.41 D. R. Dries and G. Yu, Proc. Natl. Acad. Sci. U. S. A., 2009, 106,
14737–14738.42 P. D. Sbarba and E. Rovida, Biol. Chem., 2002, 383, 69–83.43 B. Herzog, C. Pellet-Many, G. Britton, B. Hartzoulakis and
I. C. Zachary, Mol. Biol. Cell, 2011, 22, 2766–2776.44 C. Pellet-Many, P. Frankel, H. Jia and I. Zachary, Biochem. J.,
2008, 411, 211–226.45 H. Fujisawa and T. Kitsukawa, Curr. Opin. Neurobiol., 1998, 8,
587–592.46 M. Rossignol, M. L. Gagnon and M. Klagsbrun, Genomics, 2000,
70, 211–222.47 S. Soker, H. Q. Miao, M. Nomi, S. Takashima and
M. Klagsbrun, J. Cell. Biochem., 2002, 85, 357–368.48 C. Prahst, M. Heroult, A. A. Lanahan, N. Uziel, O. Kessler,
N. Shraga-Heled, M. Simons, G. Neufeld and H. G. Augustin,J. Biol. Chem., 2008, 283, 25110–25114.
49 B. A. Appleton, P. Wu, J. Maloney, J. Yin, W. C. Liang,S. Stawicki, K. Mortara, K. K. Bowman, J. M. Elliott,W. Desmarais, J. F. Bazan, A. Bagri, M. Tessier-Lavigne,A. W. Koch, Y. Wu, R. J. Watts and C. Wiesmann, EMBO J.,2007, 26, 4902–4912.
50 M. L. Gagnon, D. R. Bielenberg, Z. Gechtman, H. Q. Miao,S. Takashima, S. Soker and M. Klagsbrun, Proc. Natl. Acad. Sci.U. S. A., 2000, 97, 2573–2578.
51 S. Biethahn, F. Alves, S. Wilde and W. Hiddemann, Exp.Hematol., 1999, 27, 885–894.
52 N. Zakharova, E. S. Lymar, E. Yang, S. Malik, J. J. Zhang,R. G. Roeder and J. E. Darnell, J. Biol. Chem., 2003, 278,43067–43073.
53 M. D. Godeny, J. Sayyah, D. VonDerLinden, M. Johns, D. A.Ostrov, J. Caldwell-Busby and P. P. Sayeski, Cell. Signalling,2007, 19, 600–609.
54 T. M. Saxton, M. Henkemeyer, S. Gasca, R. Shen, D. J. Rossi,F. Shalaby, G.-S. Feng and T. Pawson, EMBO J., 1997, 16,2352–2364.
55 E. J. Lowenstein, R. J. Daly, A. G. Batzer, W. Li, B. Margolis,R. Lammers, A. Ullrich, E. Y. Skolnik, D. Bar-Sagi andJ. Schlessinger, Cell, 1992, 70, 431–442.
56 I. Fath, F. Schweighoffer, I. Rey, M.-C. Multon, J. Boiziau,M. Duchesne and B. Tocque, Science, 1994, 264, 971–974.
57 X. Li, M.-C. Multon, Y. Henin, F. Schweighoffer, C. Venot,J. Josef, C. Zhou, J. LaVecchio, P. Stuckert, M. Raab,A. Mhashilkar, B. Tocque and W. A. Marasco, The Journal ofBiological Chemistry, 2000, 275, 30925–30933.
58 F. Ramos-Morales, A. Domınguez, R. M. Rios, S. I. Barroso,C. Infante, F. Schweighoffer, B. Tocque, J. Pintor-Toro andM. Tortolero, Biochem. Biophys. Res. Commun., 1997, 237,735–740.
59 S. Cantor, D. Bell, S. Ganesan, E. Kass, R. Drapkin,S. Grossman, D. Wahrer, D. Sgroi, W. Lane and D. Haber, Cell,2001, 105, 149–160.
60 F. Durocher, D. Shattuck-Eidens, M. McClure, F. Labrie,M. Skolnick, D. Goldgar and J. Simard, Hum. Mol. Genet.,1996, 5, 835.
61 S. Thakur, H. B. Zhang, Y. Peng, H. Le, B. Carroll, T. Ward,J. Yao, L. M. Farid, F. J. Couch, R. B. Wilson and B. L. Weber,Mol. Cell. Biol., 1997, 17, 444–452.
62 E. Witt and A. Ashworth, Science, 2002, 297, 534.63 G. W. Yeo, E. Van Nostrand, D. Holste, T. Poggio and
C. B. Burge, Proc. Natl. Acad. Sci. U. S. A., 2005, 102, 2850–2855.64 J. L. Ponthier, C. Schluepen, W. Chen, R. A. Lersch, S. L. Gee,
V. C. Hou, A. J. Lo, S. A. Short, J. A. Chasis, J. C. Winkelmannand J. G. Conboy, J. Biol. Chem., 2006, 281, 12468–12474.
65 A. P. Baraniak, J. R. Chen and M. A. Garcia-Blanco, Mol. Cell.Biol., 2006, 26, 1209–1222.
66 A. Damianov and D. L. Black, RNA, 16, 405–416.67 B. J. Blencowe, Cell, 2006, 126, 37–47.68 E. T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang,
C. Mayr, S. F. Kingsmore, G. P. Schroth and C. B. Burge,Nature, 2008, 456, 470–476.
69 G. Koscielny, V. Le Texier, C. Gopalakrishnan, V. Kumanduri,J. J. Riethoven, F. Nardone, E. Stanley, C. Fallsehr,O. Hofmann, M. Kull, E. Harrington, S. Boue, E. Eyras,M. Plass, F. Lopez, W. Ritchie, V. Moucadel, T. Ara,H. Pospisil, A. Herrmann, G. R. J. R. Guigo, P. Bork,M. K. Doeberitz, J. Vilo, W. Hide, R. Apweiler,T. A. Thanaraj and D. Gautheret, Genomics, 2009, 93, 213–220.
70 F. Denoeud, P. Kapranov, C. Ucla, A. Frankish, R. Castelo,J. Drenkow, J. Largarde, T. Alioto, C. Manzano, J. Chrast,S. Dike, C. Wyss, C. Henrichsen, N. Holroyd, M. Dickson,R. Taylor, Z. Hance, S. Foissac, R. Myers, J. Rogers,T. Hubbard, J. Harrow, R. Guigo, T. Gingeras, S. Antonarakisand A. Reymond, Genome Res., 2007, 17, 746–759.
71 M. C. Frith, A. R. Forrest, E. Nourbakhsh, K. C. Pang, C. Kai,J. Kawai, P. Carninci, Y. Hayashizaki, T. L. Bailey andS. M. Grimmond, PLoS Genet., 2006, 2, e52.
72 A. Kouranov, L. Xie, J. de la Cruz, L. Chen, J. Westbrook,P. Bourne and H. Berman, Nucleic AcidsRes., 2006, 34,302–305.
73 S. Li and A. E. Koromilas, Journal of Biological Chemistry, 2001,276, 13881–13890.
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online
2066 Mol. BioSyst., 2012, 8, 2054–2066 This journal is c The Royal Society of Chemistry 2012
74 Y. Stasiv, M. Regulski, B. Kuzin, T. Tully and G. Enikolopov,J. Biol. Chem., 2001, 276, 42241–42251.
75 S. Stamm, S. Ben-Ari, I. Rafalska, Y. Tang, Z. Zhang, D. Toiber,T. A. Thanaraj and H. Soreq, Gene, 2005, 344, 1–20.
76 I. Cascino, G. Fiucci, G. Papoff and G. Ruberti, Immunology,1995, 154, 2706–2713.
77 S. M. Dehm and D. J. Tindall, Endocrine Related Cancer, 2011.78 R. Hu, T. A. Dunn, S. Wei, S. Isharwal, R. W. Veltri,
E. Humphries, M. Han, A. W. Partin, R. L. Vessella,W. B. Isaacs, G. S. Bova and J. Luo, Cancer Res., 2009, 69, 16–22.
79 C. J. Shin, S. Wong, M. J. Davis and M. A. Ragan, BMC Syst.Biol., 2009, 3, 28.
80 N. M. Kopelman, D. Lancet and I. Yanai, Nat. Genet., 2005, 37,588–589.
81 Z. Su, J. Wang, J. Yu, X. Huang and X. Gu, Genome Res., 2006,16, 182–189.
82 C. Shou, N. Bhardwaj, H. Y. K. Lam, K.-K. Yan, P. M. Kim,M. Snyder and M. B. Gerstein, PLoS Comput. Biol., 2011,7, e1001050.
83 S.-J. Noh, K. Lee, H. Paik and C.-G. Hur, DNA Res., 2006, 13,229–243.
84 Q. Xu, B. Modrek and C. Lee, Nucleic Acids Res., 2002, 30,3754–3766.
85 Y. Xing and C. J. Lee, PLoS Genet., 2005, 1, e34.86 K. Shimokawa, Y. Okamura-Oho, T. Kurita, M. C. Frith,
J. Kawai, P. Carninci and Y. Hayashizaki, BMC Bioinformatics,2007, 8, 161.
87 B. Ambati, M. Nozaki, N. Singh, A. Takeda, P. Jani, T. Suthar,R. Albuquerque, E. Richter, E. Sakurai, M. Newcomb,M. Kleinman, R. Caldwell, Q. Lin, Y. Ogura, A. Orecchia,D. Samuelson, D. Agnew, J. St Leger, W. Green,P. Mahasreshti, D. Curiel, D. Kwan, H. Marsh, S. Ikeda,L. Leiper, J. Collinson, S. Bogdanovich, T. Khurana,M. Shibuya, M. Baldwin, N. Ferrara, H. Gerber, S. De Falco,J. Witta, J. Baffi, B. Raisler and J. Ambati, Nature, 2006, 443.
88 T. Kasukawa, S. Katayama, H. Kawaji, H. Suzuki, D. A. Humeand Y. Hayashizaki, Genomics, 2004, 84, 913–921.
89 J. Takeda, Y. Suzuki, M. Nakao, R. A. Barrero, K. O. Koyanagi,L. Jin, C. Motono, H. Hata, T. Isogai, K. Nagai, T. Otsuki,V. Kuryshev, M. Shionyu, K. Yura, M. Go, J. Thierry-Mieg,D. Thierry-Mieg, S. Wiemann, N. Nomura, S. Sugano,T. Gojobori and T. Imanishi, Nucleic Acids Res., 2006, 34,3917–3928.
90 S. Gustincich, A. Sandelin, C. Plessy, S. Katayama, R. Simone,D. Lazarevic, Y. Hayashizaki and P. Carninci, J. Physiol., 2006,575, 321–332.
91 Q. Pan, O. Shai, L. J. Lee, B. J. Frey and B. J. Blencowe,Nat. Genet., 2008, 40, 1413–1415.
92 A. Kumar, Eukaryotic Cell, 2009, 8, 1321–1329.93 J. Takeda, Y. Suzuki, M. Nakao, T. Kuroda, S. Sugano,
T. Gojobori and T. Imanishi, Nucleic Acids Res., 2007, 35,D104–D109.
94 W. H. Press and W. H. Press, Numerical recipes in C: the art ofscientific computing, Cambridge University Press, Cambridge,2nd edn, 1992.
95 C. Alfarano, C. E. Andrade, K. Anthony, N. Bahroos, M. Bajec,K. Bantoft, D. Betel, B. Bobechko, K. Boutilier, E. Burgess,K. Buzadzija, R. Cavero, C. D’Abreo, I. Donaldson,D. Dorairajoo, M. J. Dumontier, M. R. Dumontier, V. Earles,R. Farrall, H. Feldman, E. Garderman, Y. Gong, R. Gonzaga,V. Grytsan, E. Gryz, V. Gu, E. Haldorsen, A. Halupa, R. Haw,A. Hrvojic, L. Hurrell, R. Isserlin, F. Jack, F. Juma, A. Khan,T. Kon, S. Konopinsky, V. Le, E. Lee, S. Ling, M. Magidin,J. Moniakis, J. Montojo, S. Moore, B. Muskat, I. Ng,J. P. Paraiso, B. Parker, G. Pintilie, R. Pirone, J. J. Salama,S. Sgro, T. Shan, Y. Shu, J. Siew, D. Skinner, K. Snyder,R. Stasiuk, D. Strumpf, B. Tuekam, S. Tao, Z. Wang,M. White, R. Willis, C. Wolting, S. Wong, A. Wrong, C. Xin,R. Yao, B. Yates, S. Zhang, K. Zheng, T. Pawson, B. F. Ouelletteand C. W. Hogue, Nucleic Acids Res., 2005, 33, D418–D424.
96 T. S. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar,S. Kumar, S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen,A. Venugopal, L. Balakrishnan, A. Marimuthu, S. Banerjee,D. S. Somanathan, A. Sebastian, S. Rani, S. Ray, C. J. HarrysKishore, S. Kanth, M. Ahmed, M. K. Kashyap, R. Mohmood,Y. L. Ramachandra, V. Krishna, B. A. Rahiman, S. Mohan,P. Ranganathan, S. Ramabadran, R. Chaerkady and A. Pandey,Nucleic Acids Res., 2009, 37, D767–D772.
97 P. Pagel, S. Kovac, M. Oesterheld, B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, P. Mark, V. Stumpflen,H. W. Mewes, A. Ruepp and D. Frishman, Bioinformatics, 2005,21, 832–834.
98 G. D. Bader and C. W. Hogue, Nat. Biotechnol., 2002, 20,991–997.
99 B. Zhang, B. H. Park, T. Karpinets and N. F. Samatova,Bioinformatics, 2008, 24, 979–986.
100 S. Fields, FEBS J., 2005, 272, 5391–5399.101 C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver,
S. Fields and P. Bork, Nature, 2002, 417, 399–403.102 E. Sprinzak, S. Sattath and H. Margalit, J. Mol. Biol., 2003, 327,
919–923.103 S. R. Eddy, Bioinformatics, 1998, 14, 755–763.104 S. Dutta, K. Burkhardt, J. Young, G. J. Swaminathan,
T. Matsuura, K. Henrick, H. Nakamura and H. M. Berman,Mol. Biotechnol., 2009, 42, 1–13.
105 P. Aloy and R. B. Russell, Bioinformatics, 2003, 19, 161–162.
Publ
ishe
d on
01
June
201
2. D
ownl
oade
d by
Uni
vers
itat A
uton
oma
de B
arce
lona
on
30/1
0/20
14 1
2:40
:45.
View Article Online