Upload
lydieu
View
213
Download
0
Embed Size (px)
Citation preview
Threshold
5' UTR CDS 3' UTR Intron ncRNA Intergenic Antisense
0 (n=167,766)
30 (n=74,741)
100 (n=21,048)
Per
cent
age
of p
eaks
20
40
60
80
100
Supplementary Figure 1 eIF4AIII is associated with coding sequences. Percentage of CLIP-seq peaks in genomic regions depending on different peak height thresholds. Light blue bars, no threshold. Intermediate blue bars, threshold 30. Heavy blue bars, threshold 100. The total number n of peaks above each threshold is indicated.
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
Freq
uenc
y of
mR
NA
-seq
read
s
2
3
4
1
0
5
−200
Freq
uenc
y of
C
LIP
1 re
ads
0.0
0.1
0.2
0.3
0.4
0.5
Distance from exon 3' end (nt) Distance from exon 5' end (nt)
−150 −100 −50 0 200150100500
–24
0.6
a
−200 −150 −100 −50
1
1.5
2
Distance from exon 3' end (nt) Freq
uenc
y of
CLI
P2
read
s
0.5
00
2.5
–24
b
PPME1 mRNA, exon 2
5’ 3’
50 bases
207 _
0 172 _
0
-24
Num
ber o
f rea
ds
_
_
c
Supplementary Figure 2 eIF4AIII binds at the canonical EJC localization. (a) Distribution of the frequency of read centers (blue: mRNA-seq, y-axis left, red: CLIP1, y-axis right) according to their distance (in nucleotides, nt) from the 5´ or 3´ exon-intron junction. Only internal exons of protein-coding and multi-exonic transcripts were considered. (b) Distribution of the frequency of read centers for exons usptream of minor introns (AT-AC). CLIP2 reads for 99 minor introns are represented in red. (c) Exon 2 of the PPME1 mRNA shows a -24 signature linked to the minor spliceosome. Brown boxes and black lines represent respectively exons and introns. Number of reads is indicated in the left part of the exons. Size scales in bases are represented under the exons. The 5´ to 3´ sense is also indicated.
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
0 _
581
656 _
0
_
_
5' 3'
SRSF1 mRNACLIP eIF4AIII
mRNA-seq
Num
ber o
f rea
ds
2 kb
0 _
79
1,930 _
0
_
_Num
ber o
f rea
ds
TOMM7 mRNA
5' 3'
5 kb
NC
mR
NA
s
0 _
194
806 _
0 _
_
5' 3'
500 bases
ATP5I mRNA
Num
ber o
f rea
ds
CLIP eIF4AIII
mRNA-seq
C m
RN
As
5 kb
0
_24
905 _
0 _
_
NDUFA7 mRNA
5' 3'
Num
ber o
f rea
ds
0 _554 _
0
_
_
8
200 bases
SF3B5 mRNA
Num
bero
f rea
ds
5' 3'
2 _
0 64 _
0 _
_
PRR24 mRNA
200 bases
Num
bero
f rea
dsCLIP eIF4AIII
mRNA-seq
Intro
nles
s m
RN
As
5' 3'
3,921 _
01,344 _
0
_
_
MAT2A mRNA
5' 3'
Num
ber o
f rea
ds
C +
NC
mR
NA
s
2 kb
CLIP eIF4AIII
mRNA-seq
0
_2,008
198 _
0
_
_
5 kb
5' 3'
Num
ber o
f rea
ds SRSF10 mRNA
Supplementary Figure 3 Selected intronless transcripts and spliced transcripts associated to (i) canonical and non-canonical peaks (C + NC mRNAs), (ii) only canonical peaks (C mRNAs) and (iii) only non-canonical peaks (NC mRNAs). Brown boxes and black lines represent respectively exons and introns. All isoforms of some mRNAs are not represented. Number of reads is indicated in the left part of the transcripts. Size scales in bases are represented under the transcripts. The 5´ to 3´ sense is also indicated.
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
RNA localization
Establishment of RNA localization
Nucleic acid transportRNA transport
Chromatin organization
Chromatin modification
Chromosome organization
Nuclear transportNucleocytoplasmic transport
RNA processing
mRNA processingRNA splicing
mRNA metabolic process
Top 15 GO categories for top peaks
–log(pvalue)0 10 20
mRNA transportNucleotide and nucleic acid transport
Supplementary Figure 4 eIF4AIII is mainly associated with mRNAs involved in gene expression Gene Ontology (GO) categories significantly enriched in the genes containing CLIP-seq peaks relative to the genes identified as expressed with the mRNA-seq data. The top 1,000 canonical peaks and the top 1,000 non-canonical peaks were used. The top 15 categories identified by the DAVID functional annotation tool are represented and ranked by increasing p-values.
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
C G U
Per
cent
age
of d
elet
ions
mRNA-seqCLIP2CLIP1
A0
10
20
30
40
50
Supplementary Figure 5 Crosslinking-induced mutation sites in CLIP-seq Percentage of deleted bases (A, C, G or U) at crosslinking-induced mutation sites (CIMS) of uniquely mapped reads to the human genome. Orange, red and blue columns represent respectively CLIP1, CLIP2 and mRNA-seq.
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
Supplementary note for
CLIP-seq of eIF4AIII reveals transcriptome-wide mapping of the
human exon junction complex
Jérôme Saulière, Valentine Murigneux, Zhen Wang, Emélie Marquenet, Isabelle
Barbosa, Olivier Le Tonquèze, Yann Audic, Luc Paillard, Hugues Roest Crollius
and Hervé Le Hir
Preprocessing of reads
Before mapping, the 3' adapter sequence was searched and removed in CLIP-seq
reads in two steps. First, we used the Btrim software1 to trim reads containing the
entire 3' adapter sequence (parameters -3 -v7 -l 15). Reads shorter than 15 nt after
trimming were discarded. Then a python script was used to trim reads containing a
partial 3' adapter at the end of the read (up to 3 bases). Respectively 94.0% and
99.9% of reads were trimmed in CLIP1 and CLIP2 respectively, indicating that nearly
all RNA fragments were fully sequenced. The median read length was 40 nt and 38
nt for CLIP1 and CLIP2 respectively.
Mapping of reads to the human genome
Prior to mapping, pseudogenes regions in the human genome (hg19, Ensembl63)
were masked. Reads were mapped to the genome using a two-step strategy. We first
used the program Novoalign V2.07.11 (Novocraft, 2010, http://www.novocraft.com/),
which allows gapped alignments and therefore the detection of small insertions and
deletions. Fastq files were used as input and the option -F ILMFQ was added to
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
ensure correct interpretation of quality values. Second, reads that were not uniquely
mapped by Novoalign were mapped to the human genome using BLAT version 34
(BLAST-Like Alignment Tool2). BLAT can efficiently map short reads across exon-
exon junctions. BLAT was run with Fasta files as input and default parameters except
the minScore set to 15. All uniquely mapped reads with Novoalign and BLAT were
used in subsequent processing. The .wig format files were used for visualization in
the University of California at Santa Cruz Genome Browser. To identify ribosomic
reads, raw reads were mapped to rRNA consensus sequences with Novoalign. The
accession numbers for the rRNA sequences used are U13369 (Genbank Human
ribosomal DNA complete repeating unit) and NR_023371 (NCBI Reference
Sequence Homo sapiens RNA, 5S ribosomal 9).
Distribution of reads and peaks in the genome
Ensembl63 annotations based on the human hg19 genome assembly were used to
assign reads and peaks to genomic regions. Six hierarchical categories were
constructed: CDS > 5' UTRs > 3' UTRs > ncRNAs > introns > intergenic regions.
Genomic regions were identified using the subtractBED function v2.5.4 from
BEDTools3. Reads were assigned to a category using the intersectBED function
v2.12.0 from BEDTools. The -s option preserved strand information and thus allowed
the detection of antisense reads. To compute read or peak enrichment in a given
category, we calculated the total number of nucleotides in the human genome for
each category and divided the corresponding proportion of reads or peaks by this
value.
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
Mapping of reads to representative transcripts
The selection of a representative transcript from each gene was done among
annotated transcripts from Ensembl63. Processed transcripts and pseudogenes were
filtered out. For genes having multiple isoforms, the transcript with the largest number
of exons was chosen as representative for the gene, leading to the selection of
28,838 transcripts. Mapping of reads to representative transcripts was performed
with the same two-step strategy as previously described for the mapping to the
genome. We detected 1.5% and 3% of antisense reads for CLIP1 and CLIP2
respectively, confirming the strand specificity of the protocol. The expression of a
transcript was computed as the number of mRNA-seq reads divided by the length of
the transcript (Reads Per Base, RPB).
Distribution of reads in exons (Frequency of read centers relative to splice
sites)
We limited the analysis to internal coding exons of multi-exonic and well-expressed
transcripts (RPB > 0.1). We considered reads mapped to transcripts by Novoalign
and plotted the distance from the center of each read to the 5' and 3' end of exon. To
adjust for the uneven distribution of exon lengths, we divided the number of read
centers mapped at a given position by the number of exons that cover this position.
Peak calling (Identification of binding sites)
FindPeaks version 4.0.164 from the Vancouver Short Read Analysis Package was
used for peak calling. The parameters used were: -aligner bed -dist_type 0 40 -
landerwaterman 0.01 -subpeaks 0.5 -bedgraph -readahead_window. We selected
the Lander-Waterman based FDR calculation (FDR < 0.01). This parameter
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
estimates the background noise by assuming that the reads follow a Poisson
distribution. Prior to peak calling, uniquely mapped reads from the two CLIP
replicates were pooled. To detect enriched regions in the whole genome, peak calling
was first performed using reads mapped to the genome. We identified 167,766 peaks
in the genome. Then to identify binding sites within mRNAs, we used the mapping to
representative transcripts and identified 177,524 peaks. Reads that align antisense to
transcripts were removed. Each peak has a height corresponding to the number of
reads at the most occupied position within the peak.
Peak classification
We focused on peaks identified in the set of representative transcripts. Exonic peaks
were classified into two classes: canonical (peaks with centers between -40 and -10
nt upstream spliced junctions) and non-canonical (peaks with centers outside the -40
to -10 nt window). Among protein-coding, multi-exonic and highly expressed (RPB >
0.5) transcripts, we selected three classes of exons of at least 40 nt (excluding the
last one). The first class corresponded to exons containing one canonical peak with a
height threshold of 100 (4,661 exons). The second class comprised exons containing
one non-canonical peak with a height threshold of 100 (7,211 exons). The last class
included exons without peak among transcripts containing at least one canonical
peak above 100 (7,120 exons).
Variability between exons within a transcript
We selected 4,251 highly expressed (RPB > 0.5) and protein-coding transcripts
containing more than 3 exons. We computed the mean read per base in the
canonical region (between -40 and -10 nt upstream of splice junctions) for the
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
different exons of each transcript (excluding the last one). Exon mean numbers are
scaled such that the largest value for each transcript becomes 1.0 and exons are
ranked by increasing value. Then we used the lm function from R to fit a linear model
with these values. We compared the estimates of the slope of the linear model
between CLIP-seq and mRNA-seq.
Gene Ontology analysis
The enrichment of Gene Ontology (GO) terms in eIF4AIII targets was analysed using
the DAVID Bioinformatics Resources version 6.7 (the Database for Annotation,
Visualization, and Integrated Discovery5). Gene lists corresponding to the top 1,000
canonical and the top 1,000 non-canonical peaks were merged and submitted to the
functional annotation tool of DAVID. Background genes were obtained from the
mRNA-seq data (10,225 well-expressed transcripts, RPB > 0.1). DAVID reported a
p-value for each biological process associated with the gene list. The Bonferroni
correction was used to correct for multiple testing of genes. The top 15 enriched
categories were shown.
5' and 3' splice sites strength
To compute a score for the 5' and 3' splice sites, we used the maximum entropy
models for splice sites (MaxEntScan6). The 5'ss scoring uses a 9-mer sequence (the
last 3 nt of the exon and the first 6 nt of the downstream intron), while the 3'ss
scoring uses a 23-mer sequence (the last 20 nt of the upstream intron and the first 3
nt of the exon).
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
AG percent
For exons associated to canonical and non-canonical peaks the 31-nt sequence
around the center of the peak was extracted to compute the AG percent. As a
control, we calculated the AG percent in the -40 to -10 sequence of exons without
peak.
Motif analysis
Multiple Em for Motif Elicitation version 4.4.0 (MEME7) was used to identify enriched
motifs within the sequences around peaks. Canonical and non-canonical peaks were
ranked according to their enrichment in CLIP reads relative to the expression of the
mRNA in which they occurred. For each class, the top 1,000 peaks were selected.
The motif search was performed using the discriminative motif discovery tool of the
MEME suite8, where a positive and a negative set of sequences allow the
identification of motifs specific to the positive set . For the two classes of peaks, 31-nt
sequences around peak centers were input as the positive set. The two negative sets
of 1,000 sequences of 31 nt were randomly extracted from exons without peak (the -
40 to -10 nt region for the canonical peaks and a randomly chosen sequence within
the exon but outside the canonical region for the non-canonical peaks). The motif
width was comprised between 6 and 8 nt, and only zero or one occurrence of the
motif per input peak sequence was allowed (-mod zoops -minw 6 -maxw 8). Other
parameters used were : -dna -nmotifs 5 -revcomp.
Secondary structure prediction
For the three defined classes of exons, the set of 31-nt sequences used for the
secondary structure prediction was the same as described in the AG percent
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420
calculation. We used the hybrid-ss-min function from the UNAFold software for
Nucleic Acid Folding and Hybridization version 3.89 to compute a minimum energy of
folding of sequences.
1. Kong,Y.Btrim:afast,lightweightadapterandqualitytrimmingprogramfor
next‐generationsequencingtechnologies.Genomics98,152‐3(2011).
2. Kent,W.J.BLAT‐‐theBLAST‐likealignmenttool.GenomeRes12,656‐64(2002).
3. Quinlan,A.R.&Hall,I.M.BEDTools:aflexiblesuiteofutilitiesforcomparing
genomicfeatures.Bioinformatics26,841‐2(2010).
4. Fejes,A.P.etal.FindPeaks3.1:atoolforidentifyingareasofenrichmentfrom
massivelyparallelshort‐readsequencingtechnology.Bioinformatics24,1729‐30
(2008).
5. Huangda,W.,Sherman,B.T.&Lempicki,R.A.Systematicandintegrativeanalysis
oflargegenelistsusingDAVIDbioinformaticsresources.NatProtoc4,44‐57
(2009).
6. Yeo,G.&Burge,C.B.Maximumentropymodelingofshortsequencemotifswith
applicationstoRNAsplicingsignals.JComputBiol11,377‐94(2004).
7. Bailey,T.L.&Elkan,C.Fittingamixturemodelbyexpectationmaximizationto
discovermotifsinbiopolymers.ProcIntConfIntellSystMolBiol2,28‐36(1994).
8. Bailey,T.L.,Boden,M.,Whitington,T.&Machanick,P.Thevalueofposition‐
specificpriorsinmotifdiscoveryusingMEME.BMCBioinformatics11,179
(2010).
9. Markham,N.R.&Zuker,M.UNAFold:softwarefornucleicacidfoldingand
hybridization.MethodsMolBiol453,3‐31(2008).
Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420