47
Chap 9. Gene Discovery

Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Embed Size (px)

Citation preview

Page 1: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Chap 9. Gene Discovery

Page 2: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

DNA RNA

cDNA

protein

EST (Expressed Seq. Tag)

Page 3: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Gene Discovery

A major application of bioinformatics Matching known patterns of genes A gene

Promoter + 5’ UTR + Protein coding sequence + 3’ UTR

Coding sequence starts with ATG, stops with TAG,TGA or TAA Coding sequence is called an open reading

frame (ORF)

Page 4: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Gene Structure

Page 5: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

ORF (Open Reading Frame): DNA can encode six Proteins

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT

5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Page 6: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Transcription

Gene sequence is copied from one strand Sense strand = mRNA sequence

Antisense strand is used to generate mRNA sequence

5’CGCTATAGCGTTTCAT 3’ -- antisense, template strand 3’GCGATATCGCAAAGTA 5’ – sense, coding strand

Page 7: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

sense

Template, anti-sense

Page 8: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Transcription initiation Double-helix DNA strands are separated in the gene coding

region Which enzyme detects the beginning of a gene ?

RNA Polymerase (multi-subunit enzyme that synthesize RNA) binds to promoter RNA polymerase I – 28S, 5.8S and 18S rRNA

genes RNA polymerase II – coding genes, snRNA RNA polymerase III – tRNA, 5S rRNA, snoRNA

Other enzymes General (Basal) Transcription Factor (GTF)

TFIIA, TFIIB, TFIID TFIID – recognize promoter sequence

http://www.youtube.com/watch?v=MkUgkDLp2iE

Page 9: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Promoter in E.coli

Page 10: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Transcription initiation in E.coli

Page 11: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Transcription initiation in eukaryotes Promoter consists of

-25 or TATA box(TATAWAW; W=A, T) And Inr (initiator) seq. (YYCARR: Y=C,T; R=A,G)

Page 12: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Transcription initiation in eukaryotes

Initial contact is made by general transcription factor (GTF) TFIID, which consists of TATA-binding protein (TBP) and at least 12 TBP-associated factors (TAF)

Page 13: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)
Page 14: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Transcription Start Site (TSS) www.cs.uml.edu/~Kim/580/review_polII_11_Kadonaga.p

df TSS – the first base copied to mRNA Core promoter – region around a TSS Conventionally, core promoter has

TA box at -30 bp of a Inr (Initiator) Transcription Factor (TF) bind to TATA box, Inr

sequence, and other sites; bend DNA 90 degree; recruite general TF

CpG islands: 300-3000 bp of C & G in 40% of promoters

More recently, TATA box only in 10-20% or promoters

Page 15: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Core Promoter Elements

IIB Recognition Element (BRE) (SSRCGCC) BREu (BREd) suppresses (enhances) transcription

TATA box – TATAWAAR (metazoans) W (A,T); R (A,G-Purine); Y (T,C – Pyrimidine)

Inr – YYANWYY (A+1) DPE (downstream Core Promoter Element) MTE (Motif Ten Element)

Page 16: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Focused/Dispersed TSS

Focused (Sharp) TSS Distinct TSS site Usually TATA box in sharp TSS Primarily in tissue-specific expressions

Dispersed (Broad) TSS Multiple weak start sites in 50-100 nt A few Inr or Inr-like seq in the neighborhood Generally associated with ubiquitously expressed

genes Thought to be related to CpG islands

Page 17: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)
Page 18: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

How to recognize the end of transcription ? Terminator seq. stalls polymerase

Page 19: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Splicing Alternative splicing to produce mRNA

Splicesome – a collection of snRNA

Page 20: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Function of Introns www.cs.uml.edu/~kim/580/review_intron.pdf When inserted into protomer, boost expression

level First introns are long Alternative exons are flanked by long introns But, association between intron length and expression

breadth in human is not found Removal of 2nd intron of human beta-globin gene reduces

the efficiency of 3’-end formation RNA pol II elongation rate – 3.8kb/min

Introns may serve as time delays between activation of a gene

Page 21: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Annotation: How do I get from this…

>mouse_ear_cress_1080 AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTGTGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGTTGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGAGAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACCAGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAGATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTTTCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGACTTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAAGCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCCCAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGGAAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTTGGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAATAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTTCTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAAAGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACACTTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTTTACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATATTTGA

Page 22: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

…to this?

Page 23: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Meaning?

Page 24: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Comparative Tools (Database searches)

Page 25: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

What do we know about genes? Expressed (Transcribed)

Transcriptional start & termination sites (TXSS, TXTS) Transcription artifacts (cDNA & ESTs (Expressed

Sequence Tags)) Regulated

Promoters (TATAAA) Transcription Factor Binding Sites CpG

Meaningful (Translated) 3n basepairs Codon usage Translational start & stop/termination codons (TLSS,

TLTS) Translation artifacts (proteins)

Spliced Splice sites (GT-AG)

Derived (Homology: Paralogy/Orthology) Search for known genes, proteins (BLAST)

Page 26: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

How might this knowledge help to find genes?

Predict genes Look for potential starts and stops. Connect them into open reading frames (ORFs). Filter for “correct’ length & codon usage.

Search databases Known genes: UniGene Known proteins: UniProt

Use transcript evidence cDNA ESTs (Expressed Sequence Tags) proteins

Page 27: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Exon ExonIntron

Pre-m

RN

A

5’ Splice Site

3’ Splice Site

Reddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94Of 1588 examined predicted splice sites in Arabidopsis

1470 sites (93%) followed the canonical GT…AG

consensus. (Plant (2004) 39, 877–885)

Canonical splice sites

Page 28: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

The primary transcript of a gene is spliced into different mRNAs leading to multiple proteins

generated from the same gene.

- Contributes to protein diversity.- Can occur in any part of the transcript including

UTRs..- Can alter start codons, stop codons, reading

frame, CDS, UTRs.- May alter stability-life, translation (time, location,

duration), protein sequence, or both.

Alternative Splicing

Page 29: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

One gene, one enzyme

One gene, one polypetide

One gene, one set of transcripts (> 0)

The dogmas – they are a~changing…

Page 30: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Alternative splicing in metazoans (Animalia)

• Alternative splicing well characterized in animals.• As many as 96% of human genes may have multiple splice forms.• Functional significance of alternative spicing still poorly understood.

Alternative splicing in animals. Nature Genetics Research 36; 2004

Bridging the gap between genome and transcriptome Nucleic Acids Research 32, 2004.

Splice statistics for human genes

Page 31: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

RuBisCo alternative splicing one of first plant examples:

“The data presented here demonstrate the existence of alternative splicing in plant systems, but the physiological significance of synthesizing two forms of rubisco activase remains unclear. However, this process may have important implications in photosynthesis. If these polypeptides were functionally equivalent enzymes in the chloroplast, there would be no need for the production of both….”

Alternative splicing in plants

Page 32: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Biological significance of AS in plants…includes:

- regulation of flowering;- resistance to diseases;- enzyme activity (timing, duration, turn-over time, location).

Most genome databases give alternatively spliced plant gene variants

Page 33: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Example: Jasmonate signaling in Arabidopsis

- Plant hormone; affects cell division, growth, reproduction and responses to insects, pathogens, and abiotic stress factors.

- Jasmonate Signaling Repressor Protein JAZ 10 splice variants JAZ 10.1, JAZ 10.3 and JAZ 10.4 differ in susceptibility to degradation.

- Phenotypic consequences include male sterility and altered root growth.

Page 34: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Example: Jasmonate signaling in Arabidopsis

- Alternative splice sites C’ and D’ lead to different splice variants

- JAZ10.3: premature stop codon in D exon, intact JAS domain

- JAZ10.4: truncated C exon, protein lacks JAS domain- JAZ 10 encoded by At5G13220

Page 35: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

AS in different Reading Frames

Page 36: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Gene Prediction

Page 37: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Gene Prediction Methods

Intrinsic or template methods (ab initio) Search by signal

Signals (Short, functional DNA elements involved in gene spec)

Four basic signals defining coding exons Translation start site, 5’ (donor), 3’ (acceptor),

stop site Search by content

Extrinsic or look-up methods Homology-based

Compare sequence of interest against known coding sequences

Comparative gene prediction Compare sequence of interest against anonymous

sequences

Page 38: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Gene Prediction Methods

Sequence-based Search for ORFs, and consensus sequences

Alignment-based Search for orthologous genes of other organisms

Search for strong conservation of a genome region Content-based

Search for patterns such as nucleotide or codon frequency, characteristic of coding sequences

Probabilistic Prediction algorithsm

Page 39: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Typical Computational Steps in Gene Prediction Identify and score suitable splice sites and start/stop signals

along the query sequence Predict candidate exons as detected by these signals Score exons as a function of signals and coding stats

Factor in the quality of alignment between the query and known coding sequences

Assemble a subset of these exon candidates into a predicted gene structure Assemble to maximizes a particular scoring function

Page 40: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Prediction and Scoring of Exons Protein coding regions have characteristic

compositional bias e.g., A triplet pattern in coding region

Hexamer frequency method with 5th order Markov models widely used Likelihood of a particular base at a given position is

dependent on five preceding bases

Page 41: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

From Exons to RNA

Assembly of several Exons to a gene Combinatorially difficult Can use dynamic programming GRAIL (Gene Recognition and Anslysis Internet

Link), FGENESH, GENEID HMM (Hidden Markov Model)

GENSCAN Sequence Similarity-Based Gene Prediction

GENEWISE

Page 42: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

How Well Do Predictions Work ?

Sensitivity (Sn) = TP / (TP+FN) Specificity (Sp) = TP / (TP+FP) Correlation coefficient (CC)

Page 43: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)
Page 44: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Accuracy of Gene Finding Programs

• Sanja Rogic, Alan K. Mackworth, and Francis B.F. Ouellette (2001) Genome Research 11

Page 45: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Promoter Analysis

Page 46: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Annotation Cheat Sheet

• Open existing project or generate new (Red square)

• Run RepeatMasker

• Generate evidence (Predictions, BLAST searches)

• Synthesize evidence into gene models (Apollo)

• Browse results locally and in context (Phytozome)

• Conduct functional analysis (link from Browser)

• Prospect for gene family (Yellow Line from Browser)

A. DNA Subway

Page 47: Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

• Select region that holds biological gene evidence

• Optimize work space and zoom to region (View tab)

• Expand all tiers (Tiers tab)

• Drag evidence item(s) onto workspace (mouse)

• Edit to match biol. evidence (right-click item for tools)

• Record what was done in Annotation Info Editor

• Assess necessity to build alternative model(s)

• Upload model(s) to DNA Subway (File tab)

B. Apollo