View
222
Download
0
Category
Preview:
Citation preview
Searching for Transcription Start Sites in Drosophila
01/2018
Wilson Leung
OutlineTranscription start sites (TSS) annotation goalsPromoter architecture in D. melanogaster
New D. melanogaster TSS datasetsFind the initial transcribed exonAnnotate putative transcription start sites
Search for core promoter motifs
Muller F element, heterochromatic, and euchromatic genes show similar expression
levels
Riddle NC, et al. PLoS Genet. 2012 Sep;8(9):e1002954.
TSS of F element genes show lower levels of H3K9me3 and HP1a
H3K9me2
H3K9me3
H3K36me3
HP1a
POF
Su(var)3-9
PolII
Riddle NC, et al. PLoS Genet. 2012 Sep;8(9):e1002954.
Three strategies for motif finding
Multiple genes in a single speciesGenes with common expression patternSequences associated with ChIP-Seq peaks
Single gene in multiple speciesPhylogenetic footprinting
Multiple genes in multiple speciesCompare multiple sequence alignment profiles of multiple genes (Magma)
Motif finding using multiple genes within a single species
Sequences surrounding TSS
Predicted motif instances
1 2 3 4 5 6 7 8 9 10A 4 6 12 0 3 1 13 5 7 9C 22 23 6 64 0 70 0 54 13 33G 9 11 30 3 14 0 2 7 11 10T 36 31 23 4 54 0 56 5 40 19
Bits
0.0
1.0
2.0
5 10
Trl: FlyReg_DNaseI
Genes
PhyloP
phastConsConserved Elements
Multiple Sequence Alignment
D. mel: chr4
Motif finding using single gene in multiple species
Motif finding using multiple genes in multiple species (PhyloNet)
Based on Figure 1 from Wang T and Stormo GD. PNAS 2005 Nov 29;102(48):17400-5.
Promoter sequences
Conserved motifs
1. Identify conserved regions (profiles) in whole genome multiple sequence alignments
2. Identify multiple genes in the genome with similar alignment profiles
Magma: Multiple Aligner of Genomic Multiple AlignmentsKey features of Magma:
Runs ~70x faster than PhyloNetAnalyze multiple sequence alignments with gapsUse set-covering approach to minimize redundancy in discovered motifs
Ihuegbu NE, Stormo GD, Buhler J. J Comput Biol. 2012 Feb;19(2):139-47.
Computationally tractable to analyze conserved motifs in multiple eukaryotic genomes
Goals for the transcription start sites (TSS) annotations
Research goal:Identify motifs that enable Muller F element genes to function within a heterochromatic environment
Annotation goals:Define search regions enriched in regulatory motifs
Define precise location of TSS if possibleDefine search regions where TSS could be found
Document the evidence used to support the TSS annotations
Estimated evolutionary distances with respect to D. melanogaster
D. melanogasterD. simulansD. sechelliaD. yakubaD. erectaD. ficusphilaD. eugracilisD. biarmipesD. takahashiiD. elegansD. rhopaloaD. kikkawai
D. ananassaeD. bipectinata
D. pseudoobscuraD. persimilisD. willistoniD. mojavensisD. virilisD. grimshawi
Species sequenced by modENCODE
Species Substitutions per neutral
siteD. ficusphila 0.80D. eugracilis 0.76D. biarmipes 0.70D. takahashii 0.65D. elegans 0.72D. rhopaloa 0.66D. kikkawai 0.89D. bipectinata 0.99Data from Table 1 of the modENCODE comparative genomics white paper
GEP annotation projects
Challenges with TSS annotations
Fewer constraints on untranslated regions (UTRs)
UTRs evolve more quickly than coding regionsOpen reading frames, compatible phases of donor and acceptor sites do not apply to UTRs
Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRsMost gene finders do not predict UTRsLack of experimental data
Cannot use RNA-Seq data to precisely define the TSS
TSS annotation workflow1. Identify the ortholog2. Note the gene structure in D. melanogaster
3. Annotate the coding exons4. Classify the type of core promoter in D.
melanogaster
5. Annotate the initial transcribed exon6. Identify any core promoter motifs in region7. Define TSS positions or TSS search regions
RNA Polymerase II core promoter
Initiator motif (Inr) contains the TSSTFIID binds to the TATA box and Inr to initiate the assembly of the pre-initiation complex (PIC)
Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol. 2010 Mar 15;339(2):225-9.
Peaked versus broad promoters
Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol. 2012 Jan-Feb;1(1):40-51.
Peaked promoter(Single strong TSS)
Broad promoter(Multiple weak TSS)
50-300 bp
RNA-Seq biases introduced by library construction
cDNA fragmentationStrong bias at the 3’ end
RNA fragmentationMore uniform coverageMiss the 5’ and 3’ ends of the transcript
Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.
Gene Span
RNA-
Seq
Read
Cou
nt
5’ 3’
Techniques for finding TSSIdentify the 5’ cap at the beginning of the mRNA
Cap Analysis of Gene Expression (CAGE)RNA Ligase Mediated Rapid Amplification of cDNA Ends (RLM-RACE)Cap-trapped Expressed Sequence Tags (5’ ESTs)
More information on these techniques:Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol. 2012 786:181-200.Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007 Jun;8(6):424-36.
Promoter architecture in Drosophila
Classify core promoter based on the Shape Index (SI)
Determined by the distribution of CAGE and 5’ RLM-RACE readsShape index is a continuum
Most promoters in D. melanogaster contain multiple TSS
Median width = 162 bp
~70% of vertebrate genes have broad promoters
Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.
Genes with peaked promoters show stronger spatial and tissue specificity
46% of genes with broad promoters are expressed in all stages of embryonic development19% of genes with peaked promoters are expressed in all stages
Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.
Peaked and broad promoters are enriched in different core promoter motifs
Rach EA, et al. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73.
Resources for classifying the type of core promoter in D. melanogaster Only a subset of the modENCODE data are available through FlyBase D. melanogaster GEP UCSC Genome Browser [Aug. 2014 (BDGP Release 6) assembly]
FlyBase gene annotations (release 6.16)modENCODE TSS (Celniker) annotationsDNase I hypersensitive sites (DHS)CAGE and RAMPAGE TSS datasets9-state and 16-state chromatin modelsTranscription factor binding site (TFBS) HOT spots
9-state chromatin model
Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011 Mar 24;471(7339):480-5.
DNaseI Hypersensitive Sites (DHS) correspond to accessible regions
Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119:355-62.
Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature. 2014 Aug 28;512(7515):449-52.
modENCODE TSS annotations
Two sets of modENCODE TSS predictionsTSS (Celniker)
Most recent dataset produced by modENCODEAvailable on the GEP UCSC Genome Browser
TSS (Embryonic)Older dataset available from FlyBase GBrowse
Use TSS (Celniker) dataset as the primary evidenceHoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92
Classify the D. melanogaster core promoter based on (TSS) Celniker annotations and
DHS positions
Consider DHS positions within a 300bp window surrounding the start of the D. melanogaster transcript
TSS classification
# Annotated TSS
# DHS positions
Peaked101
011
Intermediate ≤ 1> 1
> 1≤ 1
Broad > 1 > 1Insufficient evidence 0 0
DEMO: Classify the core promoter of D. melanogaster Rad23
BG3 9-stateS2 9-state
chr4
CG2316-RBCG2316-RDCG2316-RCCG2316-RACG2316-RGCG2316-RH
BG3 DHSS2 DHSKc DHS
Additional DHS data from different stages of embryonic development
DHS data produced by the BDTNP project
Evidence tracks:Detected DHS Positions (Embryos)DHS Read Density (Embryos)
Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43.
Stage 11Stage 14
TSS (Celniker)
Stage 5Stage 9
Stage 10
Additional TSS data available in FlyBase release 6.11
Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 2013 Jan;23(1):169-80.
Benefits of RAMPAGERAMPAGE = RNA Annotation and Mapping of Promoters for Analysis of Gene ExpressionCAGE only allows sequencing of short sequence tags (~27 bp) near the 5’ cap
Ambiguous read mapping to large parts of the genome
RAMPAGE produces long paired-end reads instead of short sequence tags
Developed novel algorithm to identify TSS clustersUsed paired-end information during peak callingUsed Cufflinks to produce partial transcript models
Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5'-complete cDNAs. Curr Protoc Mol Biol. 2013 Nov 11;104:Unit 25B.11.
RAMPAGE results on the GEP UCSC Genome Browser
Lifted RAMPAGE results from release 5 to release 6Results from 36 developmental stagesCombined TSS peak call from all samples
Available under the “Expression and Regulation” section
Standardize analysis of MachiBase and modENCODE CAGE data using
CAGErBioconductor package developed by RIKEN
Map datasets against release 6 assembly37 modENCODE CAGE samples; 7 MachiBase samplesDefine TSS and promoters for each sampleDefine consensus promoters across all samples
Haberle V, et al. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015 Apr 30;43(8):e51.
TSS classifications based on CAGErPeakedFlyBase Genes
modENCODE CAGE PeaksmodENCODE CAGE (Plus)
IntermediateFlyBase GenesmodENCODE CAGE PeaksmodENCODE CAGE (Plus)
BroadmodENCODE CAGE Peaks
modENCODE CAGE (Minus)
FlyBase Genes
Changes in the dominant TSS of Rad23 across different developmental stages
Stag
es o
f Dev
elop
men
t
Adult females
Rad23
CAGE Tag Clusters
Evidence for TSS annotations(in general order of importance)
1. Experimental dataRNA-SeqRNA Pol II ChIP-Seq
2. ConservationType of TSS (peaked/intermediate/broad) in D. melanogasterSequence similarity to initial exon in D. melanogasterSequence similarity to other Drosophila species (Multiz)
3. Core promoter motifsInr, TATA box, etc.
Determine the gene structure in D. melanogaster
FlyBase: GBrowse UTR CDS
Gene Record Finder: Transcript Details
Identify the initial transcribed exon using NCBI blastn
Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record FinderUse placement of the flanking exons to reduce the size of the search region if possibleIncrease sensitivity of nucleotide searches
Change Program Selection to blastnChange Word size to 7Change Match/Mismatch Scores to +1, -1Change Gap Costs to Existence: 2, Extension: 1
Extrapolate TSS position based on blastn alignment of the initial
transcribed exon
Assume the length of the initial transcribed exon is conserved between D. melanogaster and D. biarmipes
blastn: D. mel: Rad23:1 (Query) vs. contig19 (Sbjct)
Query start: 6
Extrapolate TSS position:
28,941-5 = 28,936
Core promoter motifs can affect gene expression levels
Juven-Gershon T, et al. Rational design of a super core promoter that enhances gene expression. Nat Methods. 2006 Nov;3(11):917-22.
SCP1:
Use core promoter motifs to support TSS annotations
Some sequence motifs are enriched in the region (~300 bp) surrounding the TSS
Some motifs (e.g., Inr, TATA) are well-characterizedOther motifs are identified based on computational analysis
Presence of core promoter motifs can be used to support the TSS annotations
Inr motif (TCAKTY) overlaps with the TSS (-2 to +4)
Absence of core promoter motifs is a negative result
Most D. melanogaster TSS do not contain the Inr motif
Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs
Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087.
TATA box
Initiator (Inr)
Available under “Projects” “Annotation Resources” “Core Promoter Motifs” on the GEP web site: http://gander.wustl.edu/~wilson/core_promoter_motifs.html
Core Promoter Motifs tracks
Show core promoter motif matches for each contig
Separated by strand
Visualize matches to different core promoter motifsUse UCSC Table Browser (or other means) to export the list of motif matches within the search region
DEMO: Use the Inr motif to support the TSS position of Rad23
RNA PolII ChIP-Seq tracks(available for D. biarmipes, D. elegans, and D. ficusphila)
Show regions that are enriched in RNA Polymerase II compared to input DNA
Gene Models
RNA PolII Peaks
RNA PolII Enrichment
RNA-Seq
Using RNA-Seq and RNA PolII ChIP-Seq data to define the TSS search region
Narrow TSS search region
D. mel Transcripts
RNA PolII PeaksRNA PolII
Enrichment
RNA-Seq
TSS annotation for Rad23TSS position: 28,936
Conservation with D. melanogaster blastn search of initial exon “D. mel Transcripts” track
Location of the Inr motif
TSS search region: 28,716-28,936Enrichment of RNA PolII upstream of the TSS positionRNA-Seq read coverage upstream of the TSS positionSearch region defined by the extent of the RNA PolII peak
TSS annotation resourcesWalkthroughs:
Annotation of Transcription Start Sites in DrosophilaSample TSS report for onecut
Reference:TSS Annotation Workflow
GEP Annotation Report:Classify the type of core promoterEvidence that supports or refutes the TSS annotationDistribution of core promoter motifs
Additional TSS annotation resources
The D. melanogaster gene annotations are the primary source of evidenceResources that could be useful if the D. melanogaster evidence is ambiguous
Whole genome alignments of 26 Drosophila speciesPhastCons and PhyloP conservation scores
Genome browsers for 26 Drosophila speciesRNA Pol II ChIP-Seq (D. biarmipes, D. elegans, and D. ficusphila)CAGE data for D. pseudoobscuraRNA-Seq coverage, splice junctions, assembled transcriptsGnomon and Augustus gene predictions
Tree scale: 0.1
Published
GEP projects
Reference
Phylogenetic tree for 26 Drosophila species
Additional species
D. serrata
D. s
uzuk
ii
D. m
irand
a
D. a
rizon
aeD. navojoa
D. busckiiD. takahashii
Genome alignments of 25 Drosophila species against D. melanogaster
Drosophila Chain/Net:
Use the conservation tracks to identify regions under selection
PhyloP scores:
Under negative selection
Fast-evolving
phastCons
PhyloP
Conserved Elements
Examine the ROAST alignments to identify the orthologous TSS regions
TSS annotation summary
Most of the D. melanogaster core promoters have multiple TSS
Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogasterDefine search regions that might contain TSS
Use multiple lines of evidence to infer the TSS region
Identify initial exonRNA-Seq coverageblastn (change search parameters)
Distribution of core promoter motifs (e.g., Inr)RNA PolII ChIP-Seq peaksMaintain conservation compared to D. melanogaster
Questions?
Structure of a typical mRNA
Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3) reviews0004.1-reviews0004.10.
D. ananassae and D. melanogaster F element genes show similar range of expression levels
Chen Z-X, et al. Genome Res. 24:1209-1223
D. ananassae Adult Females
CAI (Codon Bias)
Expression Levels (rlog)
LOESS Regression Line
F elementAdult MalesAdult Females
D. mel: F (modENCODE)
D. ana: F (modENCODE) 4R4L
4
Phylogenetic tree based on the analysis of 13 Type IIB restriction endonucleases
Simulate restriction digests of 21 genomesDNA fragments range from 21-33 bp in sizeCalculate distance between two genomes based on number of shared fragments
D. melanogaster
D. simulansD. sechellia
D. yakuba
D. erecta
D. ficusphila
D. eugracilisD. biarmipesD. takahashiiD. elegansD. rhopaloa
D. kikkawaiD. ananassaeD. bipectinata
D. santomea
D. persimilis
D. willistoni
D. mojavensisD. virilis
D. grimshawi
D. pseudoobscuraSeetharam AS and Stuart GW. Whole genome phylogeny for 21 Drosophila species using predicted 2b-RAD fragments. PeerJ. 2013 Dec 23;1:e226.
RAMPAGE protocol
Signals in the FlyBase RAMPAGE and MachiBase TSS tracks are off by one base
https://wiki.flybase.org/wiki/FlyBase:GBrowse_Tracks#Aligned_Evidence
“FlyBase: GBrowse Tracks” page on the FlyBase Wiki
DEMOblastn search of the initial transcribed exon of Rad23 against D. biarmipes contig19
Optimize alignment parameters based on expected levels of conservation
Derive alignment scores using information theory
Relative entropy of target and background frequencies
Match +2, Mismatch -3 optimized for 90% identityMatch +1, Mismatch -1 optimized for 75% identity
States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices. Methods. 1991 3:66-70.
Use RNA PolII tracks on the D. biarmipes genome browser to identify putative TSS
April 2013 (BCM-HGSC/Dbia_2.0) assemblySearch for orthologous regions in D. elegans
Conservation tracks on the D. melanogaster GEP UCSC Genome Browser
Whole genome alignments of multiple Drosophila species
Drosophila Chain/Net composite track
Generate multiple sequence alignments from these pairwise alignments (ROAST) Identify conserved regions from Multiz alignments
PhastCons: identify conserved elementsPhyloP: measure level of selection at each nucleotide
Multiz alignment of 27 insect species available on the official UCSC Genome Browser
Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) assembly
Improve the multiple sequence alignments for Drosophila
Use the most recent version of the RefSeq assemblies
UCSC uses draft assemblies instead of the CAF1 assemblies
Species-specific repeat maskingUse more sensitive alignment tools
LAST, ROAST
Manual optimizations of alignment parameters
Recommended