Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Genome project

• Genome projects have generally become small-scale affairs that are often carried out by an individual laboratory.

• Genome annotation:

– gene prediction & functional annotation

Biological significance

Sequence

Assembly Genome annota5on Downstream analysis

2

Eukaryo5c genome annota5on Sequencing has become quick and cheap, but annota6on has become more challenging. Shorter read length of NGS

The contents of genome are o@en terra incognita

6

Genome annota6on

1. General considera6on about gene and genomes

2. Genome Repeat Masking

3. Gene Finding

4. Gene annota6on

•  Prokaryote versus Eukaryote versus Organelle •  Genome size:

– Number of chromosomes – Number of base pairs – Number of genes

•  GC/AT rela6ve content •  Repeat content •  Genome duplica6ons and polyploidy •  Gene content

See: Genomes, 2nd edi5on Terence A Brown. ISBN-‐10: 0-‐471-‐25046-‐5 See NCBI Bookshelve: hVp://www.ncbi.nlm.nih.gov/books/NBK21128/

General Variables of Genomes

Eukaryote Prokaryote

Size

• Large (10 Mb – 100,000 Mb)

• There is not generally a relationship between organism complexity and its genome size (many plants have larger genomes than human!)

• Generally small (<10 Mb; most < 5Mb)

• Complexity (as measured by # of genes and metabolism) generally proportional to genome size

Content • Most DNA is non-‐coding • DNA is “coding gene dense”

Telomeres/ Centromeres

• Present (Linear DNA) • Circular DNA, doesn't need telomeres

• Don’t have mitosis, hence, no centromeres.

Number of chromosomes

• More than one, (often) including those discriminating sexual identity

• Often one, sometimes more, -‐but plasmids, not true chromosome.

Chromatin • Histone bound (which serves as a genome regulation point)

• No histones

• Uses supercoiling to pack genome

Eukaryote versus Prokaryote Genomes

Eukaryote Prokaryote

Genes

• Often have introns

• Intraspecific gene order and number generally relatively stable

• many non-‐coding (RNA) genes

• There is NOT generally a relationship between organism complexity and gene number

• No introns

• Gene order and number may vary between strains of a species

Gene regulation

• Promoters, often with distal long range enhancers/silencers, MARS, transcriptional domains

• Generally mono-‐cistronic

• Promoters

• Enhancers/silencers rare

• Genes often regulated as polycistronic operons

Repetitive sequences • Generally highly repetitive with genome wide families from transposable element propagation

• Generally few repeated sequences

• Relatively few transposons

Organelle (subgenomes)

• Mitochondrial (all)

• chloroplasts (in plants) • Absent

Eukaryote versus Prokaryote Genomes

•  Physical: – Amount of DNA / number of base pairs – Number of chromosomes/linkage groups –  Informa6on resources:

•  NCBI: hVp://www.ncbi.nlm.nih.gov/genome •  Animals: hVp://www.genomesize.com/ •  Plants: hVp://data.kew.org/cvalues/ •  Fungi: hVp://www.zbi.ee/fungal-‐genomesize/

•  Gene6c: – Number of genes in the genome

Gregory TR. 2002. Genome size and developmental complexity. Gene$ca. May;115(1):131-‐46.

Genome Size

Species Type of organism Genome size (kb)

Mitochondrial genomes

Plasmodium falciparum Protozoan (malaria parasite) 6

Chlamydomonas reinhard$i Green alga 16

Mus musculus Vertebrate (mouse) 16

Homo sapiens Vertebrate (human) 17

Metridium senile Invertebrate (sea anemone) 17

Drosophila melanogaster Invertebrate (fruit fly) 19

Chondrus crispus Red alga 26

Aspergillus nidulans Ascomycete fungus 33

Reclinomonas americana Protozoa 69

Saccharomyces cerevisiae Yeast 75

Suillus grisellus Basidiomycete fungus 121

Brassica oleracea Flowering plant (cabbage) 160

Arabidopsis thaliana Flowering plant (vetch) 367

Zea mays Flowering plant (maize) 570

Cucumis melo Flowering plant (melon) 2500

Chloroplast genomes

Pisum sa$vum Flowering plant (pea) 120

Marchan$a polymorpha Liverwort 121

Oryza sa$va Flowering plant (rice) 136

Nico$ana tabacum Flowering plant (tobacco) 156

Chlamydomonas reinhard$i Green alga 195

hVp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511

Size of Organelle Genomes

DOGMA is for annota5ng plant chloroplast and animal mitochondrial genomes.

4

Species DNA molecules Size (Mb) Number of genes Escherichia coli K-‐12 One circular molecule 4.639 4397

Vibrio cholerae El Tor N16961 Two circular molecules

Main chromosome 2.961 2770

Megaplasmid 1.073 1115 Deinococcus radiodurans R1 Four circular molecules

Chromosome 1 2.649 2633 Chromosome 2 0.412 369 Megaplasmid 0.177 145 Plasmid 0.046 40

Borrelia burgdorferi B31 seven or eight circular molecules, 11 linear molecules

Linear chromosome 0.911 853

Circular plasmid cp9 0.009 12

Circular plasmid cp26 0.026 29

Circular plasmid cp32* 0.032 Not known

Linear plasmid lp17 0.017 25


Linear plasmid lp28-‐1 0.027 32







Linear plasmid lp56 0.056 Not known

hVp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524

Size of Prokaryote Genomes

Species Genome size (Mb)

Fungi

Saccharomyces cerevisiae 12.1

Aspergillus nidulans 25.4

Protozoa

Tetrahymena pyriformis 190

Invertebrates

Caenorhabdi$s elegans 97

Drosophila melanogaster 180

Bombyx mori (silkworm) 490

Strongylocentrotus purpuratus (sea urchin) 845

Locusta migratoria (locust) 5000

Vertebrates

Takifugu rubripes (pufferfish) 400

Homo sapiens 3200

Mus musculus (mouse) 3300

Plants

Arabidopsis thaliana (vetch) 125

Oryza sa$va (rice) 430

Zea mays (maize) 2500

Pisum sa$vum (pea) 4800

Tri$cum aes$vum (wheat) 16 000

Fri$llaria assyriaca (fri6llary) 120 000

hSp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471

Size of Eukaryote Genomes

hVp://en.wikipedia.org/wiki/Genome_size hVp://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes

Genome size

Species Ploidy Cs Size (Mb) No. Genes

Saccharomyces cerevisiae 2 16 12 6,281

Plasmodium falciparum 2 14 23 5,509

Caenorhabdi6s elegans 2 6 100 21,175

Drosophila melanogaster 2 6 139 15,016

Oryza sa6va 2 12 410 30,294

Canis lupus familaris 2 39 2,445 24,044

Homo sapiens 2 24 3,100 36,036

Zea mays 2 10 2,046 42,000-‐56,000 (*)

Protopterus aethiopicus ? ? 130,000 ?

Paris japonica 8 40 150,000 ?

Polychaos dubium ? ? 670,000 ?

hSp://www.ncbi.nlm.nih.gov/genome

(*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-‐24

Number of Genes

•  Regional varia6ons correlates with genomic content and func6on like transposable element distribu6on, gene density, gene regula6on, methyla6on, etc.

•  Olen introduces bias in sequencing processes (e.g. library yields, PCR amplifica6on, NGS sequencing)

Species GC% Streptomyces coelicolor A3(2) 72 Plasmodium falciparum 20 Arabidopsis thaliana 36 Saccharomyces cerevisiae 38 Arabidopsis thaliana 36 Homo sapiens 41 (35 – 60)

Romiguier et al. 2010. Contras5ng GC-‐content dynamics across 33 mammalian genomes: Rela5onship with life-‐history traits and chromosome sizes. Genome Res. 20: 1001-‐1009

AT/GC content

•  Large genomes generally reflect evolu6onary expansion of large families of repe66ve DNA (by RNA/DNA transposon amplifica6on/inser6on, gene6c recombina6on)

•  Repeats drive genome muta6onal processes: –  Recombina6on resul6ng in inser6on, dele6on, transloca6on,

segmental duplica6on of DNA –  Inser6onal mutagenesis, possibly including de novo crea6on of genes –  Insert novel regulatory signals

•  Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annota6on can also be problema6c as transposons mimic gene structures.

Jurka et al. 2007. Repe55ve sequences in complex genomes: structure and evolu5on. Annu Rev Genomics Hum Genet. 2007;8:241-‐59.

Repeat Content

•  Segmental duplica6ons (i.e. by recombina6on) –  Tandem: direct and inverted

•  Whole genome duplica6on & loss, e.g. •  Ancestral vertebrate: 2 rounds

–  HOX gene clusters…

•  Polyploidy -‐ ~70% of all angiosperms – Genomic hybridiza6on (allopolyploids) –  Can lead to immediate and extensive changes in gene expression

– Mapping of homeologous gene loci can be tricky

Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplica5on in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371

Adams and Wendel. 2005. Polyploidy and genome evolu5on in plants. Curr. Opin. Plant Biol. 8(2):135–141

Genome Duplica6ons/Polyploidy

•  All of these genomic variables: – Type of organism: i.e. prokaryote versus eukaryote

– Genome size – GC/AT rela6ve content – Repeat content – Genome duplica6ons and polyploidy – Gene content

are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annota6on.

The boVom line

Composition of human genome

Human genome > 3000 Mb

Gene fragments

Introns & UTRs

Genes & gene-related sequences 1200 Mb

Intergenic DNA ~2000 Mb

Exons 48 Mb Related

sequences 1152 Mb

Pseudogenes

Microsatellites 90Mb

Others >500 Mb

LINEs 640 Mb

SINEs 420 Mb

Transposons 90Mb

genome-wide repeats 1400 Mb

46% of human genome is repeats

LTR elements 250 Mb 7

Genome annota6on



3. Gene Finding

4. Gene annota6on

•  Classic approach: search against repeat libraries •  RepeatMasker

hSp://www.repeatmasker.org/ –  Uses a previously compiled library of repeat families –  Uses (user configured) external sequence search program –  Computa6onally intensive but… –  …the project web site also provides “pre-‐masked” genomic data for many completed genomes, complete with some sta6s6cal characteriza6on.

Genomic (DNA) Sequence Repeat Masking

Genome annota6on

•  de novo iden6fica6on and classifica6on: –  RECON: hSp://www.gene5cs.wustl.edu/eddy/recon –  RepeatGluer: hSp://nbcr.sdsc.edu/euler/ –  PILER: hSp://www.drive5.com/piler

•  Repeat databases: –  Repbase: hSp://www.girinst.org/repbase/index.html –  plants: hSp://plantrepeats.plantbiology.msu.edu/

•  Related algorithms: –  “probability clouds” Gu et al. 2008. Iden5fica5on of repeat structure in large genomes using

repeat probability clouds. Anal Biochem. 380(1): 77–83

More Repeat Masking …

Genome annota6on



3. Gene Finding

4. Gene annota6on

25

•  Review of differences in prokaryotic and eukaryotic gene organization. "

•  Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes."

•  Appreciate HMM as powerful tool (in many areas of computational biology!)"

•  Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges."

Objec5ves

•  Which genes are present? •  How did they get there (evolu6on)? •  Are the genes present in more than one copy? •  Which genes are not there that we would expect to be present?

•  What order are the genes in, and does this have any significance?

•  How similar is the genome of one organism to that of another?

Genome annota6on Ques6ons

27

•  Whole-‐genome annota6on – Genome sequence does not give you list of all genes

•  Fully characterizing Yfg (“your favourite gene”) – example: A disease is associated with a SNP in a loca6on in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene?

Why Gene-‐finding?

Aler comple6ng the human genome we faced 3 Gigabytes of this

Not immediately apparent where the genes are…

30

Prokaryotes •  High gene density •  mRNA transcrip6on-‐

transla6on is coupled

•  Genes are usually con6guous stretches of coding DNA

•  mRNAs olen polycistronic gene ____________________

•  Low gene density •  mRNA transcribed then

transported to cytoplasm for transla6on

•  Genes’ coding DNA olen split by non-‐coding introns

•  mRNAs are generally monocistronic gene

___________

Eukaryotes

Great real-‐.me Transcrip.on-‐Transla.on video: hRp://www.youtube.com/watch?v=41_Ne5mS2ls

ß transcript à

Raw Biological Materials

•  2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000)

•  2001: only 35,000?

•  2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)

•  2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)

•  Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes -‐ but with alterna6ve splicing these produce likely many more…

How many genes in human genome?

•  2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000)

•  2001: only 35,000?

•  2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)

•  2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)

•  Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes -‐ but with alterna6ve splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl)

How many genes in human genome?

1 gene in how many basepairs?... a.  1:10,000,000 b.  1:1,000,000 c.  1:100,000 roughly for human d.  1:10,000 (1:5000 for C. elegans) e.  1:1000 roughly for most bacteria f.  1:100 g.  1:10

33

Gene density

9

ab initio gene predictors

10

Evidence-drivable gene predictor

11

Annotation pipeline & browser

• Iden6fy repe66ve sequences • Iden6fy structural RNA encoding genes

(by comparison to known rRNA / tRNA sequences)

• Iden6fy protein-‐encoding genes • Iden6fy func6ons of these genes

12

Steps in genome annotation

Iden5fying ORFs • Rela6vely easy in bacteria, sequence is scanned

for ORFs (sequences between start and stop codon) of greater than a fixed length

• More complicated in eukaryotes because of introns.

Exons and Introns

• Size distribu6on of exons varies according to posi6on in the gene. It is also quite different between plants and animals.

• Exons are generally shorter than prokaryo6c ORFs, as short as 10 bp.

• Introns can be incredibly long, with some human introns over 400,000 bp. Minimum size is about 50 bp.

• Many genes have alternate splicing paVerns: a sequence that is an exon in one 6ssue might be an intron in another 6ssue.

14

Genome annota6on

Splicing consensus sequences • 5ʹ′ splice site – GU

• 3ʹ′ splice site – AG

• 5ʹ′-‐UACUAAC-‐3ʹ′ sequence between 18 to 140 bases upstream of 3ʹ′ splice site (yeast).

• Second type of intron (quite rare), 5ʹ′ splice site – AU, 3ʹ′ splice site – AC.

Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern.

A common machine learning approach used in gene discovery (and many other bioinformatics applications) is hidden Markov models (HMMs).

16

ab initio gene discovery approaches

An example state diagram for an HMM for gene discovery

begin gene region

start translation

donor splice site

acceptor splice site

stop translation

end gene region

exon final exon

initial exon 5’ UTR 3’ UTR

intron A,T,G,C single exon

Use a training set of known genes (from the same or closely related species) to determine transmission and emission probabilities.

17

ab initio gene discovery—HMMs

• Combine gene models with alignment to known ESTs & protein sequences • EST sequences/RNA-Seq data used for training set/consensus gene model.

18

Evidence based Approaches

E.g., tRNA, rRNA, miRNA, various other ncRNAs Harder to find than protein-‐coding genes Why? •  Olen not poly-‐A tailed—don’t end up in cDNA libraries

•  No ORF structure

•  Constraint on sequence divergence at nucleo6de not protein level.

•  How do we find these? secondary structure: •  homology, especially alignment of related species •  experimentally •  isola6on through non-‐polyA dependent

•  cloning methods •  microarrays

Finding non–protein-‐coding genes

•Standard types of evidence for validation of predictions include:

Ø match to previously annotated cDNA

Ø match to EST from same organism

Ø similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank

Ø protein structure prediction match to a PFAM domain

21

ab initio gene discovery—validating predictions and refining gene models

• Three commonly used measures of gene-‐finder performance are sensi5vity, specificity and accuracy. (Genomics, 1996).

SN = TP / (TP + FN) SP = TP / (TP + FP)

AC = (SN + SP) / 2

AED = 1 – AC

Annota6on edit distance (AED)

22

How gene predic5on accuracies are calculated

•  Sensivity: Sensi6vity (SN) is the frac6on of the reference feature that is predicted by the gene predictor

•  Specifity: specificity (SP) is the frac6on of the predic6on overlapping the reference feature

•  Accuracy: SN and SP are olen combined into a single measure called accuracy (AC)

TP = True posi6ve FN = False Nega6ve

SN = TP / (TP + FN) SP = TP / (TP + FP) AC = (SN + SP) / 2

50 bp 50 bp

50 bp 50 bp 50 bp

100 bp 100 bp 75 bp

TP = 75+50; FN = 25+50 SN = 125/(125+75) = 0.625 FP = 0 ;SP = 125/ (125+0) = 1 AC= (0.625+1)/2 = 0.8125

AED = 1 – AC Annota6on edit distance (AED)

AED 0 0.19

22

How gene predic5on accuracies are calculated

Parenthesis value at exon level

AED=0 indicates that the annota6on is in perfect agreement with its evidence, whereas AED=1 indicates a complete lack of evidence support for the annota6on.

23

Annota6on edit distance (AED)

NATURE REVIEWS, May 2012 24

Gene predic6on & gene annota6on

High-quality draft genome • Obtaining a high-‐quality dral assembly is a first achievable goal for most genome projects. – Scaffold and con5g N50s

• larger than gene size

– Percent gaps – Percent coverage

• Genome coverage of 90–95% is generally considered to be good, as most genomes contain a considerable frac6on of repe66ve regions that are difficult to sequence.

26

When we start the annota6on process?

29

MAKER

Genemark-‐ES maker1 SNAP 1st SNAP 2nd make2 Annotation result

• Repeats from RepeatMasker and the MAKER internal RepeatRunner

• EST alignments from both EXONERATE and BLASTN • Protein alignments from EXONERATE and BLASTX • ab initio gene predictions from SNAP, Augustus, FGENESH,

and GeneMark … • Final gene models from MAKER

30

Maker2 annotation pipeline

• Requirements: – Genome assembly (nucleo6de fasta file) – CDSs (ESTs or RNA-‐Seq assembly) from the same species, if possible – Protein set from a closely related species, if possible – MAKER2 pipeline from hVp://www.yandell-‐ lab.org/solware/maker.html – GeneMark-‐ES gene finder from hVp://exon.gatech.edu/license_download.cgi

31

Maker2 annotation pipeline

SNAP 2

nd make2

01 3

maker1

SNAP 1

st Genem

ark-‐ES

Run Step 0: Genemark-‐es predic6on: Elapsed 6me: 1:45:08

======================== Run Step 1:

Maker1 predic6on: Elapsed 6me: 13:39:08

======================== Run Step 2:

SNAP1 predic6on: Elapsed 6me: 13:48:30

======================== Run Step 3:

SNAP2 predic6on: Elapsed 6me: 13:50:22

======================== Run Step 4:

Maker2 predic6on: Elapsed 6me: 14:51:44

======================== Elapsed 6me of whole pipeline: 57:54:56

Run time: with cpu=4 32Mb of genome

36

MAKER PIPELINE

Predictor Genecounts Augustus 7,641 Genemark-‐ES 9,637 FgeneSH 7,302 SNAP 9,579 A@ermaker maker 7,050 non_overlapping_ab_ini6o 2,938

37

statistics of Gene model

1. Blast hits of “non_overlapping_ab_ini6o” againts nr (with E-‐value ≤ 10-‐10 )

2. Swiss-‐Prot, which is manually annotated and reviewed.

– Release 2013_10 of 16-‐Oct-‐13 of UniProtKB/Swiss-‐Prot contains 541561 sequence entries, comprising 192480382 amino acids abstracted from 223284 references.

lp://lp.uniprot.org/pub/databases/uniprot/current_release/knowledg ebase/complete/uniprot_sprot.fasta.gz

38

Add other protein datasets

Predictor Genecounts Augustus 7,641 Genemark-ES 9,637 FgeneSH 7,302 SNAP 9,549 Aftermaker maker 8,088

non_overlapping_ab_initio 1,742

39

statistics of Gene model

40

MAKER-generated annotations, shown in Apollo

Way of representing gene structure hVp://www.sequenceontology.org/gff3.shtml

46

gff3 file

hVp://modencode.oicr.on.ca/cgi-‐bin/validate_gff3_online 48

Online Validator

• MAKER's AED score

AED=0 AED=0.19

Annotation edit distance (AED) AED=0 indicates that the annotation is in perfect agreement with its evidence. AED=1 indicates a complete lack of evidence support for the annotation.

49

Prediction accuracy?

Predictor Genecounts

Augustus 7,641

Genemark-‐ES 9,637

FgeneSH 7,302

SNAP 9,549

A@ermaker

maker 8,088

50

ANNOTATION

Predictor Genecounts

Augustus 7,641

Genemark-‐ES 9,637

FgeneSH 7,302

SNAP 9,549

A@ermaker

maker 8,088

non_overlapping_ab_ini6o 1,742

51

Genome annota6on

Documents

Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation: