eIF4AIII is associated with coding sequences. … · Intermediate blue bars, threshold 30. Heavy blue bars, threshold 100. ... Valentine Murigneux, Zhen Wang, ... Then a python script

Threshold

5' UTR CDS 3' UTR Intron ncRNA Intergenic Antisense

0 (n=167,766)

30 (n=74,741)

100 (n=21,048)

Per

cent

age

of p

eaks

20

40

60

80

100

Supplementary Figure 1 eIF4AIII is associated with coding sequences. Percentage of CLIP-seq peaks in genomic regions depending on different peak height thresholds. Light blue bars, no threshold. Intermediate blue bars, threshold 30. Heavy blue bars, threshold 100. The total number n of peaks above each threshold is indicated.

Nature Structural & Molecular Biology: doi:10.1038/nsmb.2420

Freq

uenc

y of

mR

NA

-seq

read

s

2

3

4

1

0

5

−200

Freq

uenc

y of

C

LIP

1 re

ads

0.0

0.1

0.2

0.3

0.4

0.5

Distance from exon 3' end (nt) Distance from exon 5' end (nt)

−150 −100 −50 0 200150100500

–24

0.6

a

−200 −150 −100 −50

1

1.5

2

Distance from exon 3' end (nt) Freq

uenc

y of

CLI

P2

read

s

0.5

00

2.5

–24

b

PPME1 mRNA, exon 2

5’ 3’

50 bases

207 _

0 172 _

0

-24

Num

ber o

f rea

ds

_

_

c

Supplementary Figure 2 eIF4AIII binds at the canonical EJC localization. (a) Distribution of the frequency of read centers (blue: mRNA-seq, y-axis left, red: CLIP1, y-axis right) according to their distance (in nucleotides, nt) from the 5´ or 3´ exon-intron junction. Only internal exons of protein-coding and multi-exonic transcripts were considered. (b) Distribution of the frequency of read centers for exons usptream of minor introns (AT-AC). CLIP2 reads for 99 minor introns are represented in red. (c) Exon 2 of the PPME1 mRNA shows a -24 signature linked to the minor spliceosome. Brown boxes and black lines represent respectively exons and introns. Number of reads is indicated in the left part of the exons. Size scales in bases are represented under the exons. The 5´ to 3´ sense is also indicated.


0 _

581

656 _

0

_

_

5' 3'

SRSF1 mRNACLIP eIF4AIII

mRNA-seq

Num

ber o

f rea

ds

2 kb

0 _

79

1,930 _

0

_

_Num

ber o

f rea

ds

TOMM7 mRNA

5' 3'

5 kb

NC

mR

NA

s

0 _

194

806 _

0 _

_

5' 3'

500 bases

ATP5I mRNA

Num

ber o

f rea

ds

CLIP eIF4AIII

mRNA-seq

C m

RN

As

5 kb

0

_24

905 _

0 _

_

NDUFA7 mRNA

5' 3'

Num

ber o

f rea

ds

0 _554 _

0

_

_

8

200 bases

SF3B5 mRNA

Num

bero

f rea

ds

5' 3'

2 _

0 64 _

0 _

_

PRR24 mRNA

200 bases

Num

bero

f rea

dsCLIP eIF4AIII

mRNA-seq

Intro

nles

s m

RN

As

5' 3'

3,921 _

01,344 _

0

_

_

MAT2A mRNA

5' 3'

Num

ber o

f rea

ds

C +

NC

mR

NA

s

2 kb

CLIP eIF4AIII

mRNA-seq

0

_2,008

198 _

0

_

_

5 kb

5' 3'

Num

ber o

f rea

ds SRSF10 mRNA

Supplementary Figure 3 Selected intronless transcripts and spliced transcripts associated to (i) canonical and non-canonical peaks (C + NC mRNAs), (ii) only canonical peaks (C mRNAs) and (iii) only non-canonical peaks (NC mRNAs). Brown boxes and black lines represent respectively exons and introns. All isoforms of some mRNAs are not represented. Number of reads is indicated in the left part of the transcripts. Size scales in bases are represented under the transcripts. The 5´ to 3´ sense is also indicated.


RNA localization

Establishment of RNA localization

Nucleic acid transportRNA transport

Chromatin organization

Chromatin modification

Chromosome organization

Nuclear transportNucleocytoplasmic transport

RNA processing

mRNA processingRNA splicing

mRNA metabolic process

Top 15 GO categories for top peaks

–log(pvalue)0 10 20

mRNA transportNucleotide and nucleic acid transport

Supplementary Figure 4 eIF4AIII is mainly associated with mRNAs involved in gene expression Gene Ontology (GO) categories significantly enriched in the genes containing CLIP-seq peaks relative to the genes identified as expressed with the mRNA-seq data. The top 1,000 canonical peaks and the top 1,000 non-canonical peaks were used. The top 15 categories identified by the DAVID functional annotation tool are represented and ranked by increasing p-values.


C G U

Per

cent

age

of d

elet

ions

mRNA-seqCLIP2CLIP1

A0

10

20

30

40

50

Supplementary Figure 5 Crosslinking-induced mutation sites in CLIP-seq Percentage of deleted bases (A, C, G or U) at crosslinking-induced mutation sites (CIMS) of uniquely mapped reads to the human genome. Orange, red and blue columns represent respectively CLIP1, CLIP2 and mRNA-seq.


Supplementary note for

CLIP-seq of eIF4AIII reveals transcriptome-wide mapping of the

human exon junction complex

Jérôme Saulière, Valentine Murigneux, Zhen Wang, Emélie Marquenet, Isabelle

Barbosa, Olivier Le Tonquèze, Yann Audic, Luc Paillard, Hugues Roest Crollius

and Hervé Le Hir

Preprocessing of reads

Before mapping, the 3' adapter sequence was searched and removed in CLIP-seq

reads in two steps. First, we used the Btrim software1 to trim reads containing the

entire 3' adapter sequence (parameters -3 -v7 -l 15). Reads shorter than 15 nt after

trimming were discarded. Then a python script was used to trim reads containing a

partial 3' adapter at the end of the read (up to 3 bases). Respectively 94.0% and

99.9% of reads were trimmed in CLIP1 and CLIP2 respectively, indicating that nearly

all RNA fragments were fully sequenced. The median read length was 40 nt and 38

nt for CLIP1 and CLIP2 respectively.

Mapping of reads to the human genome

Prior to mapping, pseudogenes regions in the human genome (hg19, Ensembl63)

were masked. Reads were mapped to the genome using a two-step strategy. We first

used the program Novoalign V2.07.11 (Novocraft, 2010, http://www.novocraft.com/),

which allows gapped alignments and therefore the detection of small insertions and

deletions. Fastq files were used as input and the option -F ILMFQ was added to


ensure correct interpretation of quality values. Second, reads that were not uniquely

mapped by Novoalign were mapped to the human genome using BLAT version 34

(BLAST-Like Alignment Tool2). BLAT can efficiently map short reads across exon-

exon junctions. BLAT was run with Fasta files as input and default parameters except

the minScore set to 15. All uniquely mapped reads with Novoalign and BLAT were

used in subsequent processing. The .wig format files were used for visualization in

the University of California at Santa Cruz Genome Browser. To identify ribosomic

reads, raw reads were mapped to rRNA consensus sequences with Novoalign. The

accession numbers for the rRNA sequences used are U13369 (Genbank Human

ribosomal DNA complete repeating unit) and NR_023371 (NCBI Reference

Sequence Homo sapiens RNA, 5S ribosomal 9).

Distribution of reads and peaks in the genome

Ensembl63 annotations based on the human hg19 genome assembly were used to

assign reads and peaks to genomic regions. Six hierarchical categories were

constructed: CDS > 5' UTRs > 3' UTRs > ncRNAs > introns > intergenic regions.

Genomic regions were identified using the subtractBED function v2.5.4 from

BEDTools3. Reads were assigned to a category using the intersectBED function

v2.12.0 from BEDTools. The -s option preserved strand information and thus allowed

the detection of antisense reads. To compute read or peak enrichment in a given

category, we calculated the total number of nucleotides in the human genome for

each category and divided the corresponding proportion of reads or peaks by this

value.


Mapping of reads to representative transcripts

The selection of a representative transcript from each gene was done among

annotated transcripts from Ensembl63. Processed transcripts and pseudogenes were

filtered out. For genes having multiple isoforms, the transcript with the largest number

of exons was chosen as representative for the gene, leading to the selection of

28,838 transcripts. Mapping of reads to representative transcripts was performed

with the same two-step strategy as previously described for the mapping to the

genome. We detected 1.5% and 3% of antisense reads for CLIP1 and CLIP2

respectively, confirming the strand specificity of the protocol. The expression of a

transcript was computed as the number of mRNA-seq reads divided by the length of

the transcript (Reads Per Base, RPB).

Distribution of reads in exons (Frequency of read centers relative to splice

sites)

We limited the analysis to internal coding exons of multi-exonic and well-expressed

transcripts (RPB > 0.1). We considered reads mapped to transcripts by Novoalign

and plotted the distance from the center of each read to the 5' and 3' end of exon. To

adjust for the uneven distribution of exon lengths, we divided the number of read

centers mapped at a given position by the number of exons that cover this position.

Peak calling (Identification of binding sites)

FindPeaks version 4.0.164 from the Vancouver Short Read Analysis Package was

used for peak calling. The parameters used were: -aligner bed -dist_type 0 40 -

landerwaterman 0.01 -subpeaks 0.5 -bedgraph -readahead_window. We selected

the Lander-Waterman based FDR calculation (FDR < 0.01). This parameter


estimates the background noise by assuming that the reads follow a Poisson

distribution. Prior to peak calling, uniquely mapped reads from the two CLIP

replicates were pooled. To detect enriched regions in the whole genome, peak calling

was first performed using reads mapped to the genome. We identified 167,766 peaks

in the genome. Then to identify binding sites within mRNAs, we used the mapping to

representative transcripts and identified 177,524 peaks. Reads that align antisense to

transcripts were removed. Each peak has a height corresponding to the number of

reads at the most occupied position within the peak.

Peak classification

We focused on peaks identified in the set of representative transcripts. Exonic peaks

were classified into two classes: canonical (peaks with centers between -40 and -10

nt upstream spliced junctions) and non-canonical (peaks with centers outside the -40

to -10 nt window). Among protein-coding, multi-exonic and highly expressed (RPB >

0.5) transcripts, we selected three classes of exons of at least 40 nt (excluding the

last one). The first class corresponded to exons containing one canonical peak with a

height threshold of 100 (4,661 exons). The second class comprised exons containing

one non-canonical peak with a height threshold of 100 (7,211 exons). The last class

included exons without peak among transcripts containing at least one canonical

peak above 100 (7,120 exons).

Variability between exons within a transcript

We selected 4,251 highly expressed (RPB > 0.5) and protein-coding transcripts

containing more than 3 exons. We computed the mean read per base in the

canonical region (between -40 and -10 nt upstream of splice junctions) for the


different exons of each transcript (excluding the last one). Exon mean numbers are

scaled such that the largest value for each transcript becomes 1.0 and exons are

ranked by increasing value. Then we used the lm function from R to fit a linear model

with these values. We compared the estimates of the slope of the linear model

between CLIP-seq and mRNA-seq.

Gene Ontology analysis

The enrichment of Gene Ontology (GO) terms in eIF4AIII targets was analysed using

the DAVID Bioinformatics Resources version 6.7 (the Database for Annotation,

Visualization, and Integrated Discovery5). Gene lists corresponding to the top 1,000

canonical and the top 1,000 non-canonical peaks were merged and submitted to the

functional annotation tool of DAVID. Background genes were obtained from the

mRNA-seq data (10,225 well-expressed transcripts, RPB > 0.1). DAVID reported a

p-value for each biological process associated with the gene list. The Bonferroni

correction was used to correct for multiple testing of genes. The top 15 enriched

categories were shown.

5' and 3' splice sites strength

To compute a score for the 5' and 3' splice sites, we used the maximum entropy

models for splice sites (MaxEntScan6). The 5'ss scoring uses a 9-mer sequence (the

last 3 nt of the exon and the first 6 nt of the downstream intron), while the 3'ss

scoring uses a 23-mer sequence (the last 20 nt of the upstream intron and the first 3

nt of the exon).


AG percent

For exons associated to canonical and non-canonical peaks the 31-nt sequence

around the center of the peak was extracted to compute the AG percent. As a

control, we calculated the AG percent in the -40 to -10 sequence of exons without

peak.

Motif analysis

Multiple Em for Motif Elicitation version 4.4.0 (MEME7) was used to identify enriched

motifs within the sequences around peaks. Canonical and non-canonical peaks were

ranked according to their enrichment in CLIP reads relative to the expression of the

mRNA in which they occurred. For each class, the top 1,000 peaks were selected.

The motif search was performed using the discriminative motif discovery tool of the

MEME suite8, where a positive and a negative set of sequences allow the

identification of motifs specific to the positive set . For the two classes of peaks, 31-nt

sequences around peak centers were input as the positive set. The two negative sets

of 1,000 sequences of 31 nt were randomly extracted from exons without peak (the -

40 to -10 nt region for the canonical peaks and a randomly chosen sequence within

the exon but outside the canonical region for the non-canonical peaks). The motif

width was comprised between 6 and 8 nt, and only zero or one occurrence of the

motif per input peak sequence was allowed (-mod zoops -minw 6 -maxw 8). Other

parameters used were : -dna -nmotifs 5 -revcomp.

Secondary structure prediction

For the three defined classes of exons, the set of 31-nt sequences used for the

secondary structure prediction was the same as described in the AG percent


calculation. We used the hybrid-ss-min function from the UNAFold software for

Nucleic Acid Folding and Hybridization version 3.89 to compute a minimum energy of

folding of sequences.

1. Kong,Y.Btrim:afast,lightweightadapterandqualitytrimmingprogramfor

next‐generationsequencingtechnologies.Genomics98,152‐3(2011).

2. Kent,W.J.BLAT‐‐theBLAST‐likealignmenttool.GenomeRes12,656‐64(2002).

3. Quinlan,A.R.&Hall,I.M.BEDTools:aflexiblesuiteofutilitiesforcomparing

genomicfeatures.Bioinformatics26,841‐2(2010).

4. Fejes,A.P.etal.FindPeaks3.1:atoolforidentifyingareasofenrichmentfrom

massivelyparallelshort‐readsequencingtechnology.Bioinformatics24,1729‐30

(2008).

5. Huangda,W.,Sherman,B.T.&Lempicki,R.A.Systematicandintegrativeanalysis

oflargegenelistsusingDAVIDbioinformaticsresources.NatProtoc4,44‐57

(2009).

6. Yeo,G.&Burge,C.B.Maximumentropymodelingofshortsequencemotifswith

applicationstoRNAsplicingsignals.JComputBiol11,377‐94(2004).

7. Bailey,T.L.&Elkan,C.Fittingamixturemodelbyexpectationmaximizationto

discovermotifsinbiopolymers.ProcIntConfIntellSystMolBiol2,28‐36(1994).

8. Bailey,T.L.,Boden,M.,Whitington,T.&Machanick,P.Thevalueofposition‐

specificpriorsinmotifdiscoveryusingMEME.BMCBioinformatics11,179

(2010).

9. Markham,N.R.&Zuker,M.UNAFold:softwarefornucleicacidfoldingand

hybridization.MethodsMolBiol453,3‐31(2008).


Documents

eIF4AIII is associated with coding sequences. … · Intermediate blue bars, threshold 30. Heavy blue bars, threshold 100. ... Valentine Murigneux, Zhen Wang, ... Then a python script