Upload
lamphuc
View
228
Download
0
Embed Size (px)
Citation preview
Genome project
• Genome projects have generally become small-scale affairs that are often carried out by an individual laboratory.
• Genome annotation:
– gene prediction & functional annotation
Biological significance
Sequence
Assembly Genome annota5on Downstream analysis
2
Eukaryo5c genome annota5on Sequencing has become quick and cheap, but annota6on has become more challenging. Shorter read length of NGS
The contents of genome are o@en terra incognita
6
Genome annota6on
1. General considera6on about gene and genomes
2. Genome Repeat Masking
3. Gene Finding
4. Gene annota6on
• Prokaryote versus Eukaryote versus Organelle • Genome size:
– Number of chromosomes – Number of base pairs – Number of genes
• GC/AT rela6ve content • Repeat content • Genome duplica6ons and polyploidy • Gene content
See: Genomes, 2nd edi5on Terence A Brown. ISBN-‐10: 0-‐471-‐25046-‐5 See NCBI Bookshelve: hVp://www.ncbi.nlm.nih.gov/books/NBK21128/
General Variables of Genomes
Eukaryote Prokaryote
Size
• Large (10 Mb – 100,000 Mb)
• There is not generally a relationship between organism complexity and its genome size (many plants have larger genomes than human!)
• Generally small (<10 Mb; most < 5Mb)
• Complexity (as measured by # of genes and metabolism) generally proportional to genome size
Content • Most DNA is non-‐coding • DNA is “coding gene dense”
Telomeres/ Centromeres
• Present (Linear DNA) • Circular DNA, doesn't need telomeres
• Don’t have mitosis, hence, no centromeres.
Number of chromosomes
• More than one, (often) including those discriminating sexual identity
• Often one, sometimes more, -‐but plasmids, not true chromosome.
Chromatin • Histone bound (which serves as a genome regulation point)
• No histones
• Uses supercoiling to pack genome
Eukaryote versus Prokaryote Genomes
Eukaryote Prokaryote
Genes
• Often have introns
• Intraspecific gene order and number generally relatively stable
• many non-‐coding (RNA) genes
• There is NOT generally a relationship between organism complexity and gene number
• No introns
• Gene order and number may vary between strains of a species
Gene regulation
• Promoters, often with distal long range enhancers/silencers, MARS, transcriptional domains
• Generally mono-‐cistronic
• Promoters
• Enhancers/silencers rare
• Genes often regulated as polycistronic operons
Repetitive sequences • Generally highly repetitive with genome wide families from transposable element propagation
• Generally few repeated sequences
• Relatively few transposons
Organelle (subgenomes)
• Mitochondrial (all)
• chloroplasts (in plants) • Absent
Eukaryote versus Prokaryote Genomes
• Physical: – Amount of DNA / number of base pairs – Number of chromosomes/linkage groups – Informa6on resources:
• NCBI: hVp://www.ncbi.nlm.nih.gov/genome • Animals: hVp://www.genomesize.com/ • Plants: hVp://data.kew.org/cvalues/ • Fungi: hVp://www.zbi.ee/fungal-‐genomesize/
• Gene6c: – Number of genes in the genome
Gregory TR. 2002. Genome size and developmental complexity. Gene$ca. May;115(1):131-‐46.
Genome Size
Species Type of organism Genome size (kb)
Mitochondrial genomes
Plasmodium falciparum Protozoan (malaria parasite) 6
Chlamydomonas reinhard$i Green alga 16
Mus musculus Vertebrate (mouse) 16
Homo sapiens Vertebrate (human) 17
Metridium senile Invertebrate (sea anemone) 17
Drosophila melanogaster Invertebrate (fruit fly) 19
Chondrus crispus Red alga 26
Aspergillus nidulans Ascomycete fungus 33
Reclinomonas americana Protozoa 69
Saccharomyces cerevisiae Yeast 75
Suillus grisellus Basidiomycete fungus 121
Brassica oleracea Flowering plant (cabbage) 160
Arabidopsis thaliana Flowering plant (vetch) 367
Zea mays Flowering plant (maize) 570
Cucumis melo Flowering plant (melon) 2500
Chloroplast genomes
Pisum sa$vum Flowering plant (pea) 120
Marchan$a polymorpha Liverwort 121
Oryza sa$va Flowering plant (rice) 136
Nico$ana tabacum Flowering plant (tobacco) 156
Chlamydomonas reinhard$i Green alga 195
hVp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511
Size of Organelle Genomes
DOGMA is for annota5ng plant chloroplast and animal mitochondrial genomes.
4
Species DNA molecules Size (Mb) Number of genes Escherichia coli K-‐12 One circular molecule 4.639 4397
Vibrio cholerae El Tor N16961 Two circular molecules
Main chromosome 2.961 2770
Megaplasmid 1.073 1115 Deinococcus radiodurans R1 Four circular molecules
Chromosome 1 2.649 2633 Chromosome 2 0.412 369 Megaplasmid 0.177 145 Plasmid 0.046 40
Borrelia burgdorferi B31 seven or eight circular molecules, 11 linear molecules
Linear chromosome 0.911 853
Circular plasmid cp9 0.009 12
Circular plasmid cp26 0.026 29
Circular plasmid cp32* 0.032 Not known
Linear plasmid lp17 0.017 25
Linear plasmid lp25 0.024 32
Linear plasmid lp28-‐1 0.027 32
Linear plasmid lp28-‐2 0.030 34
Linear plasmid lp28-‐3 0.029 41
Linear plasmid lp28-‐4 0.027 43
Linear plasmid lp36 0.037 54
Linear plasmid lp38 0.039 52
Linear plasmid lp54 0.054 76
Linear plasmid lp56 0.056 Not known
hVp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524
Size of Prokaryote Genomes
Species Genome size (Mb)
Fungi
Saccharomyces cerevisiae 12.1
Aspergillus nidulans 25.4
Protozoa
Tetrahymena pyriformis 190
Invertebrates
Caenorhabdi$s elegans 97
Drosophila melanogaster 180
Bombyx mori (silkworm) 490
Strongylocentrotus purpuratus (sea urchin) 845
Locusta migratoria (locust) 5000
Vertebrates
Takifugu rubripes (pufferfish) 400
Homo sapiens 3200
Mus musculus (mouse) 3300
Plants
Arabidopsis thaliana (vetch) 125
Oryza sa$va (rice) 430
Zea mays (maize) 2500
Pisum sa$vum (pea) 4800
Tri$cum aes$vum (wheat) 16 000
Fri$llaria assyriaca (fri6llary) 120 000
hSp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471
Size of Eukaryote Genomes
hVp://en.wikipedia.org/wiki/Genome_size hVp://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes
Genome size
Species Ploidy Cs Size (Mb) No. Genes
Saccharomyces cerevisiae 2 16 12 6,281
Plasmodium falciparum 2 14 23 5,509
Caenorhabdi6s elegans 2 6 100 21,175
Drosophila melanogaster 2 6 139 15,016
Oryza sa6va 2 12 410 30,294
Canis lupus familaris 2 39 2,445 24,044
Homo sapiens 2 24 3,100 36,036
Zea mays 2 10 2,046 42,000-‐56,000 (*)
Protopterus aethiopicus ? ? 130,000 ?
Paris japonica 8 40 150,000 ?
Polychaos dubium ? ? 670,000 ?
hSp://www.ncbi.nlm.nih.gov/genome
(*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-‐24
Number of Genes
• Regional varia6ons correlates with genomic content and func6on like transposable element distribu6on, gene density, gene regula6on, methyla6on, etc.
• Olen introduces bias in sequencing processes (e.g. library yields, PCR amplifica6on, NGS sequencing)
Species GC% Streptomyces coelicolor A3(2) 72 Plasmodium falciparum 20 Arabidopsis thaliana 36 Saccharomyces cerevisiae 38 Arabidopsis thaliana 36 Homo sapiens 41 (35 – 60)
Romiguier et al. 2010. Contras5ng GC-‐content dynamics across 33 mammalian genomes: Rela5onship with life-‐history traits and chromosome sizes. Genome Res. 20: 1001-‐1009
AT/GC content
• Large genomes generally reflect evolu6onary expansion of large families of repe66ve DNA (by RNA/DNA transposon amplifica6on/inser6on, gene6c recombina6on)
• Repeats drive genome muta6onal processes: – Recombina6on resul6ng in inser6on, dele6on, transloca6on,
segmental duplica6on of DNA – Inser6onal mutagenesis, possibly including de novo crea6on of genes – Insert novel regulatory signals
• Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annota6on can also be problema6c as transposons mimic gene structures.
Jurka et al. 2007. Repe55ve sequences in complex genomes: structure and evolu5on. Annu Rev Genomics Hum Genet. 2007;8:241-‐59.
Repeat Content
• Segmental duplica6ons (i.e. by recombina6on) – Tandem: direct and inverted
• Whole genome duplica6on & loss, e.g. • Ancestral vertebrate: 2 rounds
– HOX gene clusters…
• Polyploidy -‐ ~70% of all angiosperms – Genomic hybridiza6on (allopolyploids) – Can lead to immediate and extensive changes in gene expression
– Mapping of homeologous gene loci can be tricky
Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplica5on in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371
Adams and Wendel. 2005. Polyploidy and genome evolu5on in plants. Curr. Opin. Plant Biol. 8(2):135–141
Genome Duplica6ons/Polyploidy
• All of these genomic variables: – Type of organism: i.e. prokaryote versus eukaryote
– Genome size – GC/AT rela6ve content – Repeat content – Genome duplica6ons and polyploidy – Gene content
are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annota6on.
The boVom line
Composition of human genome
Human genome > 3000 Mb
Gene fragments
Introns & UTRs
Genes & gene-related sequences 1200 Mb
Intergenic DNA ~2000 Mb
Exons 48 Mb Related
sequences 1152 Mb
Pseudogenes
Microsatellites 90Mb
Others >500 Mb
LINEs 640 Mb
SINEs 420 Mb
Transposons 90Mb
genome-wide repeats 1400 Mb
46% of human genome is repeats
LTR elements 250 Mb 7
Genome annota6on
1. General considera6on about gene and genomes
2. Genome Repeat Masking
3. Gene Finding
4. Gene annota6on
• Classic approach: search against repeat libraries • RepeatMasker
hSp://www.repeatmasker.org/ – Uses a previously compiled library of repeat families – Uses (user configured) external sequence search program – Computa6onally intensive but… – …the project web site also provides “pre-‐masked” genomic data for many completed genomes, complete with some sta6s6cal characteriza6on.
Genomic (DNA) Sequence Repeat Masking
Genome annota6on
• de novo iden6fica6on and classifica6on: – RECON: hSp://www.gene5cs.wustl.edu/eddy/recon – RepeatGluer: hSp://nbcr.sdsc.edu/euler/ – PILER: hSp://www.drive5.com/piler
• Repeat databases: – Repbase: hSp://www.girinst.org/repbase/index.html – plants: hSp://plantrepeats.plantbiology.msu.edu/
• Related algorithms: – “probability clouds” Gu et al. 2008. Iden5fica5on of repeat structure in large genomes using
repeat probability clouds. Anal Biochem. 380(1): 77–83
More Repeat Masking …
Genome annota6on
1. General considera6on about gene and genomes
2. Genome Repeat Masking
3. Gene Finding
4. Gene annota6on
25
• Review of differences in prokaryotic and eukaryotic gene organization. "
• Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes."
• Appreciate HMM as powerful tool (in many areas of computational biology!)"
• Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges."
Objec5ves
• Which genes are present? • How did they get there (evolu6on)? • Are the genes present in more than one copy? • Which genes are not there that we would expect to be present?
• What order are the genes in, and does this have any significance?
• How similar is the genome of one organism to that of another?
Genome annota6on Ques6ons
27
• Whole-‐genome annota6on – Genome sequence does not give you list of all genes
• Fully characterizing Yfg (“your favourite gene”) – example: A disease is associated with a SNP in a loca6on in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene?
Why Gene-‐finding?
Aler comple6ng the human genome we faced 3 Gigabytes of this
Not immediately apparent where the genes are…
30
Prokaryotes • High gene density • mRNA transcrip6on-‐
transla6on is coupled
• Genes are usually con6guous stretches of coding DNA
• mRNAs olen polycistronic gene ____________________
• Low gene density • mRNA transcribed then
transported to cytoplasm for transla6on
• Genes’ coding DNA olen split by non-‐coding introns
• mRNAs are generally monocistronic gene
___________
Eukaryotes
Great real-‐.me Transcrip.on-‐Transla.on video: hRp://www.youtube.com/watch?v=41_Ne5mS2ls
ß transcript à
Raw Biological Materials
• 2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000)
• 2001: only 35,000?
• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)
• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)
• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes -‐ but with alterna6ve splicing these produce likely many more…
How many genes in human genome?
• 2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000)
• 2001: only 35,000?
• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)
• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)
• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes -‐ but with alterna6ve splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl)
How many genes in human genome?
1 gene in how many basepairs?... a. 1:10,000,000 b. 1:1,000,000 c. 1:100,000 roughly for human d. 1:10,000 (1:5000 for C. elegans) e. 1:1000 roughly for most bacteria f. 1:100 g. 1:10
33
Gene density
9
ab initio gene predictors
10
Evidence-drivable gene predictor
11
Annotation pipeline & browser
• Iden6fy repe66ve sequences • Iden6fy structural RNA encoding genes
(by comparison to known rRNA / tRNA sequences)
• Iden6fy protein-‐encoding genes • Iden6fy func6ons of these genes
12
Steps in genome annotation
Iden5fying ORFs • Rela6vely easy in bacteria, sequence is scanned
for ORFs (sequences between start and stop codon) of greater than a fixed length
• More complicated in eukaryotes because of introns.
Exons and Introns
• Size distribu6on of exons varies according to posi6on in the gene. It is also quite different between plants and animals.
• Exons are generally shorter than prokaryo6c ORFs, as short as 10 bp.
• Introns can be incredibly long, with some human introns over 400,000 bp. Minimum size is about 50 bp.
• Many genes have alternate splicing paVerns: a sequence that is an exon in one 6ssue might be an intron in another 6ssue.
14
Genome annota6on
Splicing consensus sequences • 5ʹ′ splice site – GU
• 3ʹ′ splice site – AG
• 5ʹ′-‐UACUAAC-‐3ʹ′ sequence between 18 to 140 bases upstream of 3ʹ′ splice site (yeast).
• Second type of intron (quite rare), 5ʹ′ splice site – AU, 3ʹ′ splice site – AC.
Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern.
A common machine learning approach used in gene discovery (and many other bioinformatics applications) is hidden Markov models (HMMs).
16
ab initio gene discovery approaches
An example state diagram for an HMM for gene discovery
begin gene region
start translation
donor splice site
acceptor splice site
stop translation
end gene region
exon final exon
initial exon 5’ UTR 3’ UTR
intron A,T,G,C single exon
Use a training set of known genes (from the same or closely related species) to determine transmission and emission probabilities.
17
ab initio gene discovery—HMMs
• Combine gene models with alignment to known ESTs & protein sequences • EST sequences/RNA-Seq data used for training set/consensus gene model.
18
Evidence based Approaches
E.g., tRNA, rRNA, miRNA, various other ncRNAs Harder to find than protein-‐coding genes Why? • Olen not poly-‐A tailed—don’t end up in cDNA libraries
• No ORF structure
• Constraint on sequence divergence at nucleo6de not protein level.
• How do we find these? secondary structure: • homology, especially alignment of related species • experimentally • isola6on through non-‐polyA dependent
• cloning methods • microarrays
Finding non–protein-‐coding genes
•Standard types of evidence for validation of predictions include:
Ø match to previously annotated cDNA
Ø match to EST from same organism
Ø similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank
Ø protein structure prediction match to a PFAM domain
21
ab initio gene discovery—validating predictions and refining gene models
• Three commonly used measures of gene-‐finder performance are sensi5vity, specificity and accuracy. (Genomics, 1996).
SN = TP / (TP + FN) SP = TP / (TP + FP)
AC = (SN + SP) / 2
AED = 1 – AC
Annota6on edit distance (AED)
22
How gene predic5on accuracies are calculated
• Sensivity: Sensi6vity (SN) is the frac6on of the reference feature that is predicted by the gene predictor
• Specifity: specificity (SP) is the frac6on of the predic6on overlapping the reference feature
• Accuracy: SN and SP are olen combined into a single measure called accuracy (AC)
TP = True posi6ve FN = False Nega6ve
SN = TP / (TP + FN) SP = TP / (TP + FP) AC = (SN + SP) / 2
50 bp 50 bp
50 bp 50 bp 50 bp
100 bp 100 bp 75 bp
TP = 75+50; FN = 25+50 SN = 125/(125+75) = 0.625 FP = 0 ;SP = 125/ (125+0) = 1 AC= (0.625+1)/2 = 0.8125
AED = 1 – AC Annota6on edit distance (AED)
AED 0 0.19
22
How gene predic5on accuracies are calculated
Parenthesis value at exon level
AED=0 indicates that the annota6on is in perfect agreement with its evidence, whereas AED=1 indicates a complete lack of evidence support for the annota6on.
23
Annota6on edit distance (AED)
NATURE REVIEWS, May 2012 24
Gene predic6on & gene annota6on
High-quality draft genome • Obtaining a high-‐quality dral assembly is a first achievable goal for most genome projects. – Scaffold and con5g N50s
• larger than gene size
– Percent gaps – Percent coverage
• Genome coverage of 90–95% is generally considered to be good, as most genomes contain a considerable frac6on of repe66ve regions that are difficult to sequence.
26
When we start the annota6on process?
29
MAKER
Genemark-‐ES maker1 SNAP 1st SNAP 2nd make2 Annotation result
• Repeats from RepeatMasker and the MAKER internal RepeatRunner
• EST alignments from both EXONERATE and BLASTN • Protein alignments from EXONERATE and BLASTX • ab initio gene predictions from SNAP, Augustus, FGENESH,
and GeneMark … • Final gene models from MAKER
30
Maker2 annotation pipeline
• Requirements: – Genome assembly (nucleo6de fasta file) – CDSs (ESTs or RNA-‐Seq assembly) from the same species, if possible – Protein set from a closely related species, if possible – MAKER2 pipeline from hVp://www.yandell-‐ lab.org/solware/maker.html – GeneMark-‐ES gene finder from hVp://exon.gatech.edu/license_download.cgi
31
Maker2 annotation pipeline
SNAP 2
nd make2
01 3
maker1
SNAP 1
st Genem
ark-‐ES
Run Step 0: Genemark-‐es predic6on: Elapsed 6me: 1:45:08
======================== Run Step 1:
Maker1 predic6on: Elapsed 6me: 13:39:08
======================== Run Step 2:
SNAP1 predic6on: Elapsed 6me: 13:48:30
======================== Run Step 3:
SNAP2 predic6on: Elapsed 6me: 13:50:22
======================== Run Step 4:
Maker2 predic6on: Elapsed 6me: 14:51:44
======================== Elapsed 6me of whole pipeline: 57:54:56
Run time: with cpu=4 32Mb of genome
36
MAKER PIPELINE
Predictor Genecounts Augustus 7,641 Genemark-‐ES 9,637 FgeneSH 7,302 SNAP 9,579 A@ermaker maker 7,050 non_overlapping_ab_ini6o 2,938
37
statistics of Gene model
1. Blast hits of “non_overlapping_ab_ini6o” againts nr (with E-‐value ≤ 10-‐10 )
2. Swiss-‐Prot, which is manually annotated and reviewed.
– Release 2013_10 of 16-‐Oct-‐13 of UniProtKB/Swiss-‐Prot contains 541561 sequence entries, comprising 192480382 amino acids abstracted from 223284 references.
lp://lp.uniprot.org/pub/databases/uniprot/current_release/knowledg ebase/complete/uniprot_sprot.fasta.gz
38
Add other protein datasets
Predictor Genecounts Augustus 7,641 Genemark-ES 9,637 FgeneSH 7,302 SNAP 9,549 Aftermaker maker 8,088
non_overlapping_ab_initio 1,742
39
statistics of Gene model
40
MAKER-generated annotations, shown in Apollo
Way of representing gene structure hVp://www.sequenceontology.org/gff3.shtml
46
gff3 file
hVp://modencode.oicr.on.ca/cgi-‐bin/validate_gff3_online 48
Online Validator
• MAKER's AED score
AED=0 AED=0.19
Annotation edit distance (AED) AED=0 indicates that the annotation is in perfect agreement with its evidence. AED=1 indicates a complete lack of evidence support for the annotation.
49
Prediction accuracy?
Predictor Genecounts
Augustus 7,641
Genemark-‐ES 9,637
FgeneSH 7,302
SNAP 9,549
A@ermaker
maker 8,088
50
ANNOTATION
Predictor Genecounts
Augustus 7,641
Genemark-‐ES 9,637
FgeneSH 7,302
SNAP 9,549
A@ermaker
maker 8,088
non_overlapping_ab_ini6o 1,742
51
Genome annota6on