60

Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

  • Upload
    lamphuc

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:
Page 2: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Genome project

• Genome projects have generally  become small-scale affairs that  are often carried out by an  individual laboratory.

 • Genome annotation:

   – gene prediction & functional      annotation

 Biological  significance  

Sequence  

     Assembly                Genome  annota5on                Downstream          analysis  

2  

Page 3: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Eukaryo5c genome annota5on              Sequencing has become quick and cheap, but annota6on    has become more challenging.              Shorter read length of NGS        

   The contents of genome are o@en          terra incognita  

6  

Page 4: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Genome  annota6on  

1. General  considera6on  about  gene  and  genomes  

2. Genome  Repeat  Masking  

3. Gene  Finding  

4. Gene  annota6on  

Page 5: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  Prokaryote  versus  Eukaryote  versus  Organelle  •  Genome  size:  

– Number  of  chromosomes  – Number  of  base  pairs  – Number  of  genes  

•  GC/AT  rela6ve  content  •  Repeat  content  •  Genome  duplica6ons  and  polyploidy  •  Gene  content  

See:  Genomes,  2nd  edi5on    Terence  A  Brown.    ISBN-­‐10:  0-­‐471-­‐25046-­‐5  See  NCBI  Bookshelve:  hVp://www.ncbi.nlm.nih.gov/books/NBK21128/  

General  Variables  of  Genomes  

Page 6: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

 Eukaryote   Prokaryote  

Size  

• Large  (10  Mb  –  100,000  Mb)  

• There  is  not  generally  a  relationship  between  organism  complexity  and  its  genome  size  (many  plants  have  larger  genomes  than  human!)  

• Generally  small  (<10  Mb;  most  <  5Mb)  

• Complexity  (as  measured  by  #  of  genes  and  metabolism)  generally  proportional  to  genome  size  

Content   • Most  DNA  is  non-­‐coding   • DNA  is  “coding  gene  dense”  

Telomeres/  Centromeres  

• Present  (Linear  DNA)  • Circular  DNA,  doesn't  need  telomeres  

• Don’t  have  mitosis,  hence,  no  centromeres.  

Number  of  chromosomes  

• More  than  one,  (often)  including  those  discriminating  sexual  identity  

• Often  one,  sometimes  more,  -­‐but  plasmids,  not  true  chromosome.  

Chromatin   • Histone  bound  (which  serves  as  a  genome  regulation  point)  

• No  histones  

• Uses  supercoiling  to  pack  genome  

 

Eukaryote  versus  Prokaryote  Genomes  

Page 7: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

 Eukaryote   Prokaryote  

Genes  

• Often  have  introns  

• Intraspecific  gene  order  and  number  generally  relatively  stable    

• many  non-­‐coding  (RNA)  genes  

• There  is  NOT  generally  a  relationship  between  organism  complexity  and  gene  number  

• No  introns  

• Gene  order  and  number  may  vary  between  strains  of  a  species  

Gene  regulation  

• Promoters,  often  with  distal  long  range  enhancers/silencers,  MARS,  transcriptional  domains  

• Generally  mono-­‐cistronic  

• Promoters  

• Enhancers/silencers  rare    

• Genes  often  regulated  as  polycistronic  operons  

Repetitive  sequences  • Generally  highly  repetitive  with  genome  wide  families  from  transposable  element  propagation  

• Generally  few  repeated  sequences  

• Relatively  few  transposons  

Organelle  (subgenomes)  

• Mitochondrial  (all)  

• chloroplasts  (in  plants)  • Absent  

Eukaryote  versus  Prokaryote  Genomes  

Page 8: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  Physical:  – Amount  of  DNA  /  number  of  base  pairs  – Number  of  chromosomes/linkage  groups  –  Informa6on  resources:  

•  NCBI:  hVp://www.ncbi.nlm.nih.gov/genome  •  Animals:  hVp://www.genomesize.com/  •  Plants:  hVp://data.kew.org/cvalues/  •  Fungi:  hVp://www.zbi.ee/fungal-­‐genomesize/  

•  Gene6c:    – Number  of  genes  in  the  genome  

Gregory  TR.  2002.  Genome  size  and  developmental  complexity.    Gene$ca.  May;115(1):131-­‐46.  

Genome  Size  

Page 9: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Species   Type  of  organism   Genome  size  (kb)  

Mitochondrial  genomes  

Plasmodium  falciparum   Protozoan  (malaria  parasite)   6  

Chlamydomonas  reinhard$i   Green  alga   16  

Mus  musculus   Vertebrate  (mouse)   16  

Homo  sapiens   Vertebrate  (human)   17  

Metridium  senile   Invertebrate  (sea  anemone)   17  

Drosophila  melanogaster   Invertebrate  (fruit  fly)   19  

Chondrus  crispus   Red  alga   26  

Aspergillus  nidulans   Ascomycete  fungus   33  

Reclinomonas  americana   Protozoa   69  

Saccharomyces  cerevisiae   Yeast   75  

Suillus  grisellus   Basidiomycete  fungus   121  

Brassica  oleracea   Flowering  plant  (cabbage)   160  

Arabidopsis  thaliana   Flowering  plant  (vetch)   367  

Zea  mays   Flowering  plant  (maize)   570  

Cucumis  melo   Flowering  plant  (melon)   2500  

Chloroplast  genomes  

Pisum  sa$vum   Flowering  plant  (pea)   120  

Marchan$a  polymorpha   Liverwort   121  

Oryza  sa$va   Flowering  plant  (rice)   136  

Nico$ana  tabacum   Flowering  plant  (tobacco)   156  

Chlamydomonas  reinhard$i   Green  alga   195  

hVp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511  

Size  of  Organelle  Genomes  

Page 10: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

DOGMA is for annota5ng  plant chloroplast and animal  mitochondrial genomes.  

4  

Page 11: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Species   DNA  molecules   Size  (Mb)   Number  of  genes  Escherichia  coli  K-­‐12   One  circular  molecule   4.639   4397  

Vibrio  cholerae  El  Tor  N16961   Two  circular  molecules  

 Main  chromosome   2.961   2770  

 Megaplasmid   1.073   1115  Deinococcus  radiodurans  R1   Four  circular  molecules  

 Chromosome  1   2.649   2633   Chromosome  2   0.412   369   Megaplasmid   0.177   145   Plasmid   0.046   40  

Borrelia  burgdorferi  B31   seven  or  eight  circular  molecules,  11  linear  molecules  

 Linear  chromosome   0.911   853  

 Circular  plasmid  cp9   0.009   12  

 Circular  plasmid  cp26   0.026   29  

 Circular  plasmid  cp32*   0.032   Not  known  

 Linear  plasmid  lp17   0.017   25  

 Linear  plasmid  lp25   0.024   32  

 Linear  plasmid  lp28-­‐1   0.027   32  

 Linear  plasmid  lp28-­‐2   0.030   34  

 Linear  plasmid  lp28-­‐3   0.029   41  

 Linear  plasmid  lp28-­‐4   0.027   43  

 Linear  plasmid  lp36   0.037   54  

 Linear  plasmid  lp38   0.039   52  

 Linear  plasmid  lp54   0.054   76  

 Linear  plasmid  lp56   0.056   Not  known  

hVp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524  

Size  of  Prokaryote  Genomes  

Page 12: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Species   Genome  size  (Mb)  

Fungi  

Saccharomyces  cerevisiae   12.1  

Aspergillus  nidulans   25.4  

Protozoa  

Tetrahymena  pyriformis   190  

Invertebrates  

Caenorhabdi$s  elegans   97  

Drosophila  melanogaster   180  

Bombyx  mori  (silkworm)   490  

Strongylocentrotus  purpuratus  (sea  urchin)   845  

Locusta  migratoria  (locust)   5000  

Vertebrates  

Takifugu  rubripes  (pufferfish)   400  

Homo  sapiens   3200  

Mus  musculus  (mouse)   3300  

Plants  

Arabidopsis  thaliana  (vetch)   125  

Oryza  sa$va  (rice)   430  

Zea  mays  (maize)   2500  

Pisum  sa$vum  (pea)   4800  

Tri$cum  aes$vum  (wheat)   16  000  

Fri$llaria  assyriaca  (fri6llary)   120  000  

hSp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471  

Size  of  Eukaryote  Genomes  

Page 13: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

hVp://en.wikipedia.org/wiki/Genome_size  hVp://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes  

Genome  size  

Page 14: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Species   Ploidy   Cs   Size  (Mb)   No.    Genes  

Saccharomyces  cerevisiae   2   16   12   6,281  

Plasmodium  falciparum   2   14   23   5,509  

Caenorhabdi6s  elegans   2   6   100   21,175  

Drosophila  melanogaster   2   6   139   15,016  

Oryza  sa6va   2   12   410   30,294  

Canis  lupus  familaris   2   39   2,445   24,044  

Homo  sapiens   2   24   3,100   36,036  

Zea  mays     2   10   2,046   42,000-­‐56,000  (*)  

Protopterus  aethiopicus   ?   ?   130,000   ?  

Paris  japonica   8   40   150,000   ?  

Polychaos  dubium   ?   ?   670,000   ?  

hSp://www.ncbi.nlm.nih.gov/genome  

(*)  Haberer  et  al.,  Structure  and  architecture  of  the  maize  genome.  Plant  Physiol.  2005  Dec139(4):1612-­‐24  

Number  of  Genes  

Page 15: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  Regional  varia6ons  correlates  with  genomic  content  and  func6on  like  transposable  element  distribu6on,  gene  density,  gene  regula6on,  methyla6on,  etc.  

•  Olen  introduces  bias  in  sequencing  processes  (e.g.  library  yields,  PCR  amplifica6on,  NGS  sequencing)  

Species   GC%  Streptomyces  coelicolor  A3(2)   72  Plasmodium  falciparum   20  Arabidopsis  thaliana   36  Saccharomyces  cerevisiae   38  Arabidopsis  thaliana   36  Homo  sapiens   41  (35  –  60)  

Romiguier  et  al.  2010.  Contras5ng  GC-­‐content  dynamics  across  33  mammalian  genomes:  Rela5onship  with  life-­‐history  traits  and  chromosome  sizes.  Genome  Res.  20:  1001-­‐1009  

AT/GC  content  

Page 16: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  Large  genomes  generally  reflect  evolu6onary  expansion  of  large  families  of  repe66ve  DNA  (by  RNA/DNA  transposon  amplifica6on/inser6on,  gene6c  recombina6on)  

•  Repeats  drive  genome  muta6onal  processes:  –  Recombina6on  resul6ng  in  inser6on,  dele6on,  transloca6on,  

segmental  duplica6on  of  DNA  –  Inser6onal  mutagenesis,  possibly  including  de  novo  crea6on  of  genes  –  Insert  novel  regulatory  signals  

•  Repeats  generally  confound  genome  sequence  assembly  (especially  for  NGS,  due  to  short  reads).    Gene  annota6on  can  also  be  problema6c  as  transposons  mimic  gene  structures.  

Jurka  et  al.  2007.  Repe55ve  sequences  in  complex  genomes:  structure  and  evolu5on.  Annu  Rev  Genomics  Hum  Genet.  2007;8:241-­‐59.    

Repeat  Content  

Page 17: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  Segmental  duplica6ons  (i.e.  by  recombina6on)  –  Tandem:  direct  and  inverted  

•  Whole  genome  duplica6on  &  loss,  e.g.  •  Ancestral  vertebrate:  2  rounds  

–  HOX  gene  clusters…  

 

•  Polyploidy  -­‐  ~70%  of  all  angiosperms  – Genomic  hybridiza6on  (allopolyploids)  –  Can  lead  to  immediate  and  extensive  changes  in  gene  expression  

– Mapping  of  homeologous  gene  loci  can  be  tricky  

Dehal  P  and  Boore  JL.2005.  Two  Rounds  of  Whole  Genome  Duplica5on  in  the  Ancestral  Vertebrate.  PLoS  Biol  3(10)  :  e314.  doi:10.1371  

 Adams  and  Wendel.  2005.  Polyploidy  and  genome  evolu5on  in  plants.  Curr.  Opin.  Plant  Biol.  8(2):135–141  

Genome  Duplica6ons/Polyploidy  

Page 18: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  All  of  these  genomic  variables:  – Type  of  organism:  i.e.  prokaryote  versus  eukaryote  

– Genome  size  – GC/AT  rela6ve  content  – Repeat  content  – Genome  duplica6ons  and  polyploidy  – Gene  content  

are  important  factors  that  can  drive  the  strategy,  expected  outcome  and  efficacy  of  genome  sequence  assembly  and  annota6on.  

The  boVom  line  

Page 19: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Composition of human genome      

 Human genome    > 3000 Mb

 Gene fragments

Introns & UTRs

Genes & gene-related sequences  1200 Mb

Intergenic DNA  ~2000 Mb

Exons 48 Mb    Related

sequences  1152 Mb

Pseudogenes

Microsatellites  90Mb

 Others >500 Mb

 LINEs 640 Mb

 SINEs 420 Mb

Transposons  90Mb

genome-wide    repeats  1400 Mb

46% of human genome is repeats

LTR elements  250 Mb 7  

Page 20: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Genome  annota6on  

1. General  considera6on  about  gene  and  genomes  

2. Genome  Repeat  Masking  

3. Gene  Finding  

4. Gene  annota6on  

Page 21: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  Classic  approach:  search  against  repeat  libraries  •  RepeatMasker  

hSp://www.repeatmasker.org/  –  Uses  a  previously  compiled  library  of  repeat  families  –  Uses  (user  configured)  external  sequence  search  program  –  Computa6onally  intensive  but…  –  …the  project  web  site  also  provides  “pre-­‐masked”  genomic  data  for  many  completed  genomes,  complete  with  some  sta6s6cal  characteriza6on.  

Genomic  (DNA)  Sequence  Repeat  Masking  

Page 22: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Genome  annota6on  

Page 23: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  de  novo  iden6fica6on  and  classifica6on:  –  RECON:  hSp://www.gene5cs.wustl.edu/eddy/recon  –  RepeatGluer:  hSp://nbcr.sdsc.edu/euler/  –  PILER:  hSp://www.drive5.com/piler  

•  Repeat  databases:  –  Repbase:  hSp://www.girinst.org/repbase/index.html  –  plants:  hSp://plantrepeats.plantbiology.msu.edu/  

•  Related  algorithms:  –  “probability  clouds”  Gu  et  al.  2008.  Iden5fica5on  of  repeat  structure  in  large  genomes  using  

repeat  probability  clouds.  Anal  Biochem.  380(1):  77–83  

More  Repeat  Masking  …  

Page 24: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Genome  annota6on  

1. General  considera6on  about  gene  and  genomes  

2. Genome  Repeat  Masking  

3. Gene  Finding  

4. Gene  annota6on  

Page 25: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

25

•  Review of differences in prokaryotic and eukaryotic gene organization. "

•  Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes."

•  Appreciate HMM as powerful tool (in many areas of computational biology!)"

•  Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges."

Objec5ves  

Page 26: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  Which  genes  are  present?  •  How  did  they  get  there  (evolu6on)?  •  Are  the  genes  present  in  more  than  one  copy?  •  Which  genes  are  not  there  that  we  would  expect  to  be  present?  

•  What  order  are  the  genes  in,  and  does  this  have  any  significance?  

•  How  similar  is  the  genome  of  one  organism  to  that  of  another?  

Genome  annota6on  Ques6ons  

Page 27: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

27

•  Whole-­‐genome  annota6on  – Genome  sequence  does  not  give  you  list  of  all  genes  

•  Fully  characterizing  Yfg  (“your  favourite  gene”)  – example:  A  disease  is  associated  with  a  SNP  in  a  loca6on  in  the  human  genome.  BLAST  finds  similarity  to  a  protein  coding  gene  in  the  area,  but  its  only  similar  to  part  of  the  whole  protein.  What’s  the  whole  gene?  

Why  Gene-­‐finding?  

Page 28: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Aler  comple6ng  the  human  genome  we  faced  3  Gigabytes  of  this  

Page 29: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Not  immediately  apparent  where  the  genes  are…  

Page 30: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

30

Prokaryotes  •  High  gene  density  •  mRNA  transcrip6on-­‐  

transla6on  is  coupled    

•  Genes  are  usually  con6guous  stretches  of  coding  DNA  

•  mRNAs  olen  polycistronic                                                  gene  ____________________  

•  Low  gene  density  •  mRNA  transcribed  then  

transported  to  cytoplasm  for  transla6on  

•  Genes’  coding  DNA  olen  split  by  non-­‐coding  introns  

•  mRNAs  are  generally  monocistronic                              gene  

                 ___________  

Eukaryotes  

Great  real-­‐.me  Transcrip.on-­‐Transla.on  video:  hRp://www.youtube.com/watch?v=41_Ne5mS2ls  

ß  transcript  à    

Raw  Biological  Materials  

Page 31: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  2000:  must  be  at  least  100,000  (Rice  has  ~40,000,    C.  elegans  has  ~19,000)  

 

•  2001:  only  35,000?      

•  2005:  Ensembl  NCBI  35  release:  22,218  genes  (33,869  transcripts)    

•  2006:  Ensembl  NCBI  36  release:  23,710  protein  coding  genes,  plus  4421  RNA  genes  (48,851  transcripts)    

•  Today:  Ensembl  64  release,  Sept  2011,  is  20,900  coding  genes  +  14,266  RNA  genes  -­‐  but  with  alterna6ve  splicing  these  produce  likely  many  more…  

How  many  genes  in  human  genome?  

Page 32: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•  2000:  must  be  at  least  100,000  (Rice  has  ~40,000,    C.  elegans  has  ~19,000)  

 

•  2001:  only  35,000?      

•  2005:  Ensembl  NCBI  35  release:  22,218  genes  (33,869  transcripts)    

•  2006:  Ensembl  NCBI  36  release:  23,710  protein  coding  genes,  plus  4421  RNA  genes  (48,851  transcripts)    

•  Today:  Ensembl  64  release,  Sept  2011,  is  20,900  coding  genes  +  14,266  RNA  genes  -­‐  but  with  alterna6ve  splicing  these  produce  likely  >100,000  proteins  (178,191  currently  annotated  in  Ensembl)  

How  many  genes  in  human  genome?  

Page 33: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

1  gene  in  how  many  basepairs?...  a.  1:10,000,000  b.  1:1,000,000  c.  1:100,000  roughly  for  human  d.  1:10,000  (1:5000  for  C.  elegans)  e.  1:1000  roughly  for  most  bacteria  f.  1:100  g.  1:10  

33

Gene  density  

Page 34: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

9  

ab initio gene predictors

Page 35: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

10  

Evidence-drivable gene predictor

Page 36: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

11  

Annotation pipeline & browser

Page 37: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•      Iden6fy  repe66ve  sequences  •      Iden6fy  structural  RNA  encoding  genes  

 (by  comparison  to  known  rRNA  /  tRNA    sequences)  

•      Iden6fy  protein-­‐encoding  genes  •      Iden6fy  func6ons  of  these  genes  

12  

   Steps in genome annotation

   Iden5fying  ORFs  •      Rela6vely  easy  in  bacteria,  sequence  is  scanned  

 for  ORFs  (sequences  between  start  and  stop    codon)  of  greater  than  a  fixed  length  

•      More  complicated  in  eukaryotes  because  of    introns.  

Page 38: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Exons  and  Introns  

•        Size  distribu6on  of  exons  varies  according  to  posi6on  in    the  gene.      It  is  also  quite  different  between  plants  and    animals.  

•        Exons  are  generally  shorter  than  prokaryo6c  ORFs,  as    short  as  10  bp.  

•        Introns  can  be  incredibly  long,  with  some  human  introns    over  400,000  bp.      Minimum  size  is  about  50  bp.  

•        Many  genes  have  alternate  splicing  paVerns:  a  sequence    that  is  an  exon  in  one  6ssue  might  be  an  intron  in  another    6ssue.  

14  

Genome  annota6on  

Splicing  consensus  sequences  •      5ʹ′  splice  site  –  GU  

•      3ʹ′  splice  site  –  AG  

•      5ʹ′-­‐UACUAAC-­‐3ʹ′  sequence  between  18  to  140    bases  upstream  of  3ʹ′  splice  site  (yeast).  

•      Second  type  of  intron  (quite  rare),  5ʹ′  splice  site  –    AU,  3ʹ′  splice  site  –  AC.  

Page 39: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern.

A common machine learning approach used in gene discovery (and many other bioinformatics applications) is hidden Markov models (HMMs).

16  

ab initio gene discovery approaches

Page 40: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

   An example state diagram for an HMM for gene discovery

 begin    gene region

 start translation

donor splice  site

acceptor  splice    site

 stop translation

   end  gene region

exon final exon

initial exon 5’ UTR 3’ UTR

     intron  A,T,G,C    single exon

       Use a training set of known genes (from the same or closely related species) to determine transmission and emission probabilities.

17  

 ab initio gene discovery—HMMs

Page 41: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

 • Combine gene models with alignment to known ESTs & protein sequences  • EST sequences/RNA-Seq data used for training set/consensus gene model.

18  

Evidence based Approaches

E.g.,  tRNA,  rRNA,  miRNA,  various  other  ncRNAs    Harder  to  find  than  protein-­‐coding  genes  Why?  •  Olen  not  poly-­‐A  tailed—don’t  end  up  in  cDNA  libraries  

•  No  ORF  structure  

•  Constraint  on  sequence  divergence  at  nucleo6de  not  protein  level.  

•  How  do  we  find  these?    secondary  structure:  •  homology,  especially  alignment  of  related  species  •  experimentally  •  isola6on  through  non-­‐polyA  dependent  

•  cloning  methods  •  microarrays  

 Finding  non–protein-­‐coding  genes  

Page 42: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

             •Standard types of evidence for validation of predictions include:  

   Ø match to previously annotated cDNA  

   Ø match to EST from same organism  

   Ø similarity of nucleotide or conceptually translated protein        sequence to sequences in GenBank

 

   Ø protein structure prediction match to a PFAM domain

21  

ab initio gene discovery—validating predictions and refining gene models

Page 43: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

     

• Three commonly used measures of gene-­‐finder performance    are sensi5vity, specificity and accuracy. (Genomics, 1996).  

SN = TP / (TP + FN)    SP = TP / (TP + FP)  

 AC = (SN + SP) / 2  

 AED = 1 – AC    

Annota6on edit distance (AED)    

   

22  

How  gene  predic5on  accuracies  are  calculated  

•  Sensivity:  Sensi6vity  (SN)  is  the  frac6on  of  the  reference  feature  that  is  predicted  by  the  gene  predictor  

•  Specifity:    specificity  (SP)  is  the  frac6on  of  the  predic6on  overlapping  the  reference  feature  

•  Accuracy:  SN  and  SP  are  olen  combined  into  a  single  measure  called  accuracy  (AC)  

 

TP  =  True  posi6ve    FN  =  False  Nega6ve  

Page 44: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

SN = TP / (TP + FN)    SP = TP / (TP + FP)    AC = (SN + SP) / 2  

50 bp    50 bp  

50 bp      50 bp      50 bp  

 100 bp        100 bp    75 bp  

TP = 75+50; FN = 25+50  SN = 125/(125+75) = 0.625  FP = 0 ;SP = 125/ (125+0) = 1  AC= (0.625+1)/2 = 0.8125  

 AED = 1 – AC      Annota6on edit distance (AED)  

AED    0  0.19  

22  

How  gene  predic5on  accuracies  are  calculated  

Parenthesis  value  at  exon  level  

Page 45: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

AED=0 indicates that the annota6on is in perfect  agreement with its evidence, whereas AED=1 indicates a  complete lack of evidence support for the annota6on.  

23  

Annota6on edit distance (AED)  

Page 46: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

NATURE REVIEWS, May 2012  24  

Gene predic6on & gene annota6on  

Page 47: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

High-quality draft genome • Obtaining a high-­‐quality dral assembly is a  first achievable goal for most genome projects.  – Scaffold and con5g N50s  

• larger than gene size  

– Percent gaps  – Percent coverage  

• Genome coverage of 90–95% is generally considered to  be good, as most genomes contain a considerable  frac6on of repe66ve regions that are difficult to  sequence.  

26  

When  we  start  the  annota6on  process?  

Page 48: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

29  

MAKER  

Page 49: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Genemark-­‐ES   maker1   SNAP 1st   SNAP 2nd   make2   Annotation  result

• Repeats from RepeatMasker and the MAKER internal  RepeatRunner

 • EST alignments from both EXONERATE and BLASTN  • Protein alignments from EXONERATE and BLASTX  • ab initio gene predictions from SNAP, Augustus, FGENESH,

 and GeneMark …  • Final gene models from MAKER

30  

Maker2 annotation pipeline

Page 50: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

•    Requirements:    –  Genome  assembly  (nucleo6de  fasta  file)    –  CDSs  (ESTs  or  RNA-­‐Seq  assembly)  from  the  same      species,  if  possible    –  Protein  set  from  a  closely  related  species,  if  possible    –  MAKER2  pipeline  from  hVp://www.yandell-­‐      lab.org/solware/maker.html    –  GeneMark-­‐ES  gene  finder  from      hVp://exon.gatech.edu/license_download.cgi  

31  

Maker2 annotation pipeline

Page 51: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

SNAP 2

nd  make2  

01 3

maker1  

SNAP 1

st  Genem

ark-­‐ES  

Run Step 0:    Genemark-­‐es predic6on:    Elapsed 6me: 1:45:08  

========================  Run Step 1:  

   Maker1 predic6on:      Elapsed 6me: 13:39:08  

========================  Run Step 2:  

 SNAP1 predic6on:    Elapsed 6me: 13:48:30  

========================  Run Step 3:  

 SNAP2 predic6on:    Elapsed 6me: 13:50:22  

========================  Run Step 4:  

 Maker2 predic6on:    Elapsed 6me: 14:51:44  

========================    Elapsed 6me of whole pipeline: 57:54:56  

Run time:    with cpu=4    32Mb of genome

36  

MAKER  PIPELINE  

Page 52: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Predictor Genecounts Augustus 7,641 Genemark-­‐ES 9,637 FgeneSH 7,302 SNAP 9,579 A@ermaker maker 7,050 non_overlapping_ab_ini6o 2,938

37  

statistics of Gene model

Page 53: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

1. Blast hits of “non_overlapping_ab_ini6o”  againts nr (with E-­‐value ≤ 10-­‐10 )  

2. Swiss-­‐Prot, which is manually annotated and  reviewed.  

– Release 2013_10 of 16-­‐Oct-­‐13 of UniProtKB/Swiss-­‐Prot  contains 541561 sequence entries, comprising 192480382  amino acids abstracted from 223284 references.  

lp://lp.uniprot.org/pub/databases/uniprot/current_release/knowledg  ebase/complete/uniprot_sprot.fasta.gz  

38  

Add other protein datasets  

Page 54: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Predictor Genecounts Augustus 7,641 Genemark-ES 9,637 FgeneSH 7,302 SNAP 9,549 Aftermaker maker 8,088

non_overlapping_ab_initio 1,742

39  

statistics of Gene model

Page 55: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

40  

MAKER-generated annotations, shown in Apollo

Page 56: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

   Way of representing gene structure hVp://www.sequenceontology.org/gff3.shtml  

46  

 gff3 file

Page 57: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

hVp://modencode.oicr.on.ca/cgi-­‐bin/validate_gff3_online   48  

Online Validator

Page 58: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

   

• MAKER's AED score

AED=0      AED=0.19  

Annotation edit distance (AED)    AED=0 indicates that the annotation is in perfect agreement with its evidence. AED=1 indicates a complete lack of evidence support for the annotation.  

 49  

 Prediction accuracy?

Page 59: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Predictor Genecounts

Augustus 7,641

Genemark-­‐ES 9,637

FgeneSH 7,302

SNAP 9,549

A@ermaker

maker 8,088

50  

ANNOTATION  

Page 60: Genome - unina.it · Genome project • Genome projects have generally !become small-scale affairs that !are often carried out by an !individual laboratory. ! • Genome annotation:

Predictor Genecounts

Augustus 7,641

Genemark-­‐ES 9,637

FgeneSH 7,302

SNAP 9,549

A@ermaker

maker 8,088

non_overlapping_ab_ini6o 1,742

51  

Genome  annota6on