103
Introduction to Apollo Collaborative genome annotation editing A webinar for the i5K Research Community - Hemiptera Monica Munoz-Torres | @monimunozto Berkeley Bioinformatics Open-Source Projects (BBOP) Environmental Genomics & Systems Biology Division, Lawrence Berkeley National Laboratory i5k Pilot Project Species Calls | 9 February, 2016 http://GenomeArchitect.org

Introduction to Apollo for i5k

Embed Size (px)

Citation preview

Page 1: Introduction to Apollo for i5k

Introduction to Apollo Collaborative genome annotation editing A webinar for the i5K Research Community - Hemiptera

Monica Munoz-Torres | @monimunozto Berkeley Bioinformatics Open-Source Projects (BBOP) Environmental Genomics & Systems Biology Division, Lawrence Berkeley National Laboratory i5k Pilot Project Species Calls | 9 February, 2016

http://GenomeArchitect.org

Page 2: Introduction to Apollo for i5k

Outline

•  Today you will discover effective ways to extract valuable information about a genome through curation efforts. Apollo  Collabora've  Cura'on  and    

Interac've  Analysis  of  Genomes  

Page 3: Introduction to Apollo for i5k

After this talk you will... •  Better understand ‘curation’ in the context of genome annotation:

assembled genome à automated annotation à manual annotation

•  Become familiar with Apollo’s environment and functionality.

•  Learn to identify homologs of known genes of interest in your newly sequenced genome.

•  Learn how to corroborate and modify automatically annotated gene models using all available evidence in Apollo.

Page 4: Introduction to Apollo for i5k

Experimental design, sampling.

Comparative analyses

Official / Merged Gene Set

Manual Annotation

Automated Annotation

Sequencing Assembly

Synthesis & dissemination.

This is our focus.

Page 5: Introduction to Apollo for i5k

We must care about curation

Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild

The gene set of an organism informs a variety of studies: •  Characterization: Gene number, GC%, TEs, repeats. •  Functional assignments. •  Molecular evolution, sequence conservation. •  Gene families. •  Metabolic pathways. •  What makes an organism what it is?

What makes a bee a “bee”?

Page 6: Introduction to Apollo for i5k

Genome Curation

Identifies elements that best represent the underlying biology and eliminates elements that reflect systemic errors of automated analyses.

Assigns function through comparative analysis of similar genome elements from closely

related species using literature, databases, and experimental

data.

Apollo

Gene Ontology Resources

Page 7: Introduction to Apollo for i5k

A few things to rememberwhen conducting manual annotation

To  remember…  Biological  concepts  to  be;er  understand  manual  annota'on  

7 BIO-REFRESHER

•  KEEP  A  GLOSSARY  HANDY    from  con$g  to  splice  site  

 •  WHAT  IS  A  GENE?  

defining  your  goal  

•  TRANSCRIPTION  mRNA  in  detail  

 •  TRANSLATION  

reading  frames,  etc.  

•  GENOME  CURATION  steps  involved  

Page 8: Introduction to Apollo for i5k

The gene: a “moving target”

“The gene is a union of genomic

sequences encoding a coherent set of

potentially overlapping

functional products.”

Gerstein et al., 2007. Genome Res

Page 9: Introduction to Apollo for i5k

9

"Gene structure" by Daycd- Wikimedia Commons

BIO-REFRESHER

mRNA

•  Although of brief existence, understanding mRNAs is crucial, as they will become the center of your work.

Page 10: Introduction to Apollo for i5k

10 BIO-REFRESHER

Reading frames

v  In eukaryotes, only one reading frame per section of DNA is biologically relevant at a time: it has the potential to be transcribed into RNA and translated into protein. This is called the OPEN READING FRAME (ORF) •  ORF = Start signal + coding sequence (divisible by 3) + Stop signal

Page 11: Introduction to Apollo for i5k

11 BIO-REFRESHER

Splice sites

v  The spliceosome catalyzes the removal of introns and the ligation of flanking exons.

v  Splicing signals (from the point of view of an intron): •  One splice signal (site) on the 5’ end: usually GT (less common: GC) •  And a 3’ end splice site: usually AG •  Canonical splice sites look like this: …]5’-GT/AG-3’[…

Page 12: Introduction to Apollo for i5k

12 BIO-REFRESHER

Exons and Introns

v  Introns can interrupt the reading frame of a gene by inserting a sequence between two consecutive codons

v  Between the first and second nucleotide of a codon

v  Or between the second and third nucleotide of a codon

"Exon and Intron classes”. Licensed under Fair use via Wikipedia

Page 13: Introduction to Apollo for i5k

Predic'on  &  Annota'on  

Page 14: Introduction to Apollo for i5k

14 GENE PREDICTION & ANNOTATION

PREDICTION & ANNOTATION

v  Iden'fica'on  and  annota'on  of  genome  features:    

•  primarily  focuses  on  protein-­‐coding  genes.    •  also  iden'fies  RNAs  (tRNA,  rRNA,  long  and  small  non-­‐coding  

RNAs  (ncRNA)),  regulatory  mo'fs,  repe''ve  elements,  etc.    

•  happens  in  2  phases:  1.  Computa'on  phase    2.  Annota'on  phase  

Page 15: Introduction to Apollo for i5k

15 GENE PREDICTION & ANNOTATION

COMPUTATION PHASE

a.   Experimental  data  are  aligned  to  the  genome:  expressed  sequence  tags,  RNA-­‐sequencing  reads,  proteins  (also  from  other  species).  

         

b.   Gene  predic;ons  are  generated:      -­‐  ab  ini$o:  based  on  nucleo'de  sequence  and  composi'on    e.g.  Augustus,  GENSCAN,  geneid,  fgenesh,  etc.  

 -­‐  evidence-­‐driven:  iden'fying  also  domains  and  mo'fs    e.g.  SGP2,  JAMg,  fgenesh++,  etc.  

   

Result:  the  single  most  likely  coding  sequence,  no  UTRs,  no  isoforms.  Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174

Page 16: Introduction to Apollo for i5k

16 GENE PREDICTION & ANNOTATION

ANNOTATION PHASE

Experimental  data  (evidence)  and  predic'ons  are  synthe'zed  into  gene  annota'ons.    

Result:  gene  models  that  generally  include  UTRs,  isoforms,  evidence  trails.  

Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174

5’  UTR   3’  UTR  

Page 17: Introduction to Apollo for i5k

17

In  some  cases  algorithms  and  metrics  used  to  generate  consensus  sets  may  actually  reduce  the  accuracy  of  the  gene’s  representa'on.  

CONSENSUS GENE SETS

Gene  models  may  be  organized  into  sets  using:  v  combiners  for  automa'c  integra'on  of  predicted  sets    

e.g:  GLEAN,  EvidenceModeler  

or  v  tools  packaged  into  pipelines  

e.g:  MAKER,  PASA,  Gnomon,  Ensembl,  etc.  

GENE PREDICTION & ANNOTATION

Page 18: Introduction to Apollo for i5k

ANNOTATIONneeds some refinement

No one is perfect, least of all automated annotation. 18

New  technologies  bring  new  challenges:    •  Assembly  errors  can  cause  fragmented  

annota'ons  •  Limited  coverage  makes  precise  

iden'fica'on  a  difficult  task  

Image: www.BroadInstitute.org

Page 19: Introduction to Apollo for i5k

MANUAL ANNOTATIONimproving predictions

Precise  elucida;on  of  biological  features  encoded  in  the  genome  requires  careful  

examina;on  and  review.    

Schiex  et  al.  Nucleic  Acids  2003  (31)  13:  3738-­‐3741  

Automated Predictions

Experimental Evidence

Manual Annotation – to the rescue. 19

cDNAs,  HMM  domain  searches,  RNAseq,  genes  from  other  species.  

Page 20: Introduction to Apollo for i5k

GENOME CURATIONan inherently collaborative task

GENE PREDICTION & ANNOTATION 20

So  many  sequences,  not  enough  hands.  

Apis  mellifera  |  Alexander  Wild  |  www.alexanderwild.com  

Page 21: Introduction to Apollo for i5k

We have provided continuous training and support for hundreds of geographically dispersed scientists to conduct manual annotations efforts in order to recover coding sequences in agreement with all available biological evidence.

21

Lessons learned

APOLLO

•  Collaborative work distills invaluable knowledge.

•  A little training goes a long way! Wet lab scientists can easily learn to maximize the generation of accurate, biologically supported gene models.

Page 22: Introduction to Apollo for i5k

Apollo  

Page 23: Introduction to Apollo for i5k

APOLLO: versatile genome annotation editing •  Apollo is a web-based genome annotation editor, integrated with JBrowse

•  Supports real time collaboration & generates analysis-ready data

USER-CREATED ANNOTATIONS

EVIDENCE TRACKS

ANNOTATOR PANEL

Page 24: Introduction to Apollo for i5k

BECOMING ACQUAINTED WITH APOLLO 24

General process of curation

1.  Select  or  find  a  region  of  interest,  e.g.  scaffold.  2.  Select  appropriate  evidence  tracks  to  review  the  gene  model.  

3.  Determine  whether  a  feature  in  an  exis'ng  evidence  track  will  provide  a  reasonable  gene  model  to  start  working.  

4.  If  necessary,  adjust  the  gene  model.  

5.  Check  your  edited  gene  model  for  integrity  and  accuracy  by  comparing  it  with  available  homologs.  

6.   Comment  and  finish.  

Page 25: Introduction to Apollo for i5k

Apollo - version at i5K Workspace@NAL

25 4. Becoming Acquainted with Web Apollo.

25

The  Sequence  Selec'on  Window  

Page 26: Introduction to Apollo for i5k

Sort

Apollo - version at i5K Workspace@NAL

26

“Old  Track  Select  Page”  

4. Becoming Acquainted with Web Apollo.

26

Page 27: Introduction to Apollo for i5k

27

APOLLOannotation editing environment

BECOMING ACQUAINTED WITH APOLLO

Color  by  CDS  frame,  toggle  strands,  set  color  scheme  and  highlights.  

-­‐  Upload  evidence  files  (GFF3,  BAM,  BigWig),  -­‐  combina;on  track    -­‐  sequence  search  track  

Query  the  genome  using  BLAT.  

Naviga'on  and  zoom.  

Search  for  a  gene  model  or  a  scaffold.  

Get  coordinates  and  “rubber  band”  selec'on  for  zooming.  

Login  

User-­‐created  annota'ons.   New  

annotator  panel.  

Evidence  Tracks  

Stage  and  cell-­‐type  specific  transcrip'on  data.  

 h;p://genomearchitect.org/web_apollo_user_guide    

Page 28: Introduction to Apollo for i5k

28 | 28 BECOMING ACQUAINTED WITH APOLLO

USER NAVIGATION

Annotator  panel.  

•  Choose  appropriate  evidence  from  list  of  “Tracks”  on  annotator  panel.      

•  Select  &  drag  elements  from  evidence  track  into  the  ‘User-­‐created  Annota$ons’  area.    

•  Hovering  over  annota'on  in  progress  brings  up  an  informa'on  pop-­‐up.  

•  Crea'ng  a  new  annota'on  

Page 29: Introduction to Apollo for i5k

Adding a gene model

Page 30: Introduction to Apollo for i5k

Adding a gene model

Page 31: Introduction to Apollo for i5k

Adding a gene model

Page 32: Introduction to Apollo for i5k

Editing functionality

Page 33: Introduction to Apollo for i5k

Editing functionality Example: Adding an exon supported by experimental data

•  RNAseq reads show evidence in support of a transcribed product that was not predicted. •  Add exon by dragging up one of the RNAseq reads.

Page 34: Introduction to Apollo for i5k

Editing functionality Example: Adjusting exon boundaries supported by experimental data

Page 35: Introduction to Apollo for i5k

Cura'ng  with  Apollo  

Page 36: Introduction to Apollo for i5k

36 | 36

USER NAVIGATION

BECOMING ACQUAINTED WITH APOLLO

•  ‘Zoom  to  base  level’  reveals  the  DNA  Track.  

Page 37: Introduction to Apollo for i5k

37 | 37

USER NAVIGATION

BECOMING ACQUAINTED WITH APOLLO

•  Color  exons  by  CDS  from  the  ‘View’  menu.  

Page 38: Introduction to Apollo for i5k

38 |

Zoom  in/out  with  keyboard:  shio  +  arrow  keys  up/down  

38

USER NAVIGATION

BECOMING ACQUAINTED WITH APOLLO

•  Toggle  reference  DNA  sequence  and  transla;on  frames  in  forward  strand.  Toggle  models  in  either  direc'on.  

Page 39: Introduction to Apollo for i5k

annota'ng  simple  cases  

Page 40: Introduction to Apollo for i5k

“Simple  case”:      -­‐  the  predicted  gene  model  is  correct  or  nearly  correct,  and    

 -­‐  this  model  is  supported  by  evidence  that  completely  or  mostly  agrees  with  the  predic'on.    

 -­‐  evidence  that  extends  beyond  the  predicted  model  is  assumed  to  be  non-­‐coding  sequence.    

 

The  following  are  simple  modifica'ons.    

 

40

ANNOTATING SIMPLE CASES

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Page 41: Introduction to Apollo for i5k

•  A   confirma'on   box  will   warn   you   if   the   receiving   transcript   is   not   on   the  same  strand  as  the  feature  where  the  new  exon  originated.    

•  Check  ‘Start’  and  ‘Stop’  signals  aoer  each  edit.  

41

ADDING EXONS

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Page 42: Introduction to Apollo for i5k

If  transcript  alignment  data  are  available  &  extend  beyond  your  original  annota'on,    you  may  extend  or  add  UTRs.    

1.  Right  click  at  the  exon  edge  and  ‘Zoom  to  base  level’.    

2.  Place  the  cursor  over  the  edge  of  the  exon  un$l  it  becomes  a  black  arrow  then  click  and  drag  the  edge  of  the  exon  to  the  new  coordinate  posi'on  that  includes  the  UTR.    

42

ADDING UTRs

To  add  a  new  spliced  UTR  to  an  exis'ng    annota'on  also  follow  the  procedure  for  adding  an  exon.  

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Page 43: Introduction to Apollo for i5k

To  modify  an  exon  boundary  and  match  data   in   the   evidence   tracks:   select  both   the   [offending]   exon   and   the  feature  with  the  expected  boundary,  then  right  click  on  the  annota'on  to  select   ‘Set   3’   end’   or   ‘Set   5’   end’   as  appropriate.  

 

In  some  cases  all  the  data  may  disagree  with  the  annota'on,  in  other  cases  some  data  support  the  annota'on  and  some  of  the  

data  support  one  or  more  alterna've  transcripts.  Try  to  annotate  as  many  alterna've  transcripts  as  are  well  supported  by  the  data.  

43

MATCHING EXON BOUNDARY TO EVIDENCE

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Page 44: Introduction to Apollo for i5k

Non-­‐canonical  splice  sites  flags.   Double  click:  selec'on  of  feature  and  sub-­‐features  

Evidence  Tracks  Area  

‘User-­‐created  Annota$ons’  Track  

Edge-­‐matching  

Apollo’s  edi'ng  logic  (brain):    §  selects  longest  ORF  as  CDS  §  flags  non-­‐canonical  splice  sites  

44

ORFs AND SPLICE SITES

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Page 45: Introduction to Apollo for i5k

Non-­‐canonical  splices  are  indicated  by  an   orange   circle   with   a   white  exclama'on  point   inside,  placed  over  the  edge  of  the  offending  exon.    

Canonical  splice  sites:  

3’-­‐…exon]GA  /  TG[exon…-­‐5’  

5’-­‐…exon]GT  /  AG[exon…-­‐3’  reverse  strand,  not  reverse-­‐complemented:  

forward  strand  

45

SPLICE SITES

Zoom  to  review  non-­‐canonical  splice  site  warnings.  Although  these  may  not  always  have  to  be  corrected  (e.g  GC  donor),  they  should  be  flagged  with  a  comment.    

Exon/intron  splice  site  error  warning  

Curated  model  

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Page 46: Introduction to Apollo for i5k

Apollo  calculates  the  longest  possible  open  reading  frame  (ORF)  that  includes  canonical  ‘Start’  and  ‘Stop’  signals  within  the  predicted  exons.    

If  ‘Start’  appears  to  be  incorrect,  modify  it  by  selec'ng  an  in-­‐frame  ‘Start’  codon  further  up  or  downstream,  depending  on  evidence  (proteins,  RNAseq).      

It  may  be  present  outside  the  predicted  gene  model,  within  a  region  supported  by  another  evidence  track.    

In  very  rare  cases,  the  actual  ‘Start’  codon  may  be  non-­‐canonical  (non-­‐ATG).    

46

‘Start’ AND ‘Stop’ SITES

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Page 47: Introduction to Apollo for i5k

1.  Two  exons  from  different  tracks  sharing  the  same  start/end  coordinates  display  a  red  bar  to  indicate  matching  edges.  

2.  Selec'ng  the  whole  annota'on  or  one  exon  at  a  'me,  use  this  edge-­‐matching  func'on  and  scroll  along  the  length  of  the  annota'on,  verifying  exon  boundaries  against  available  data.    Use  square  [  ]  brackets  to  scroll  from  exon  to  exon.  User  curly  {  }  brackets  to  scroll  from  annota'on  to  annota'on.  

3.  Check  if  cDNA  /  RNAseq  reads  lack  one  or  more  of  the  annotated  exons  or  include  addi'onal  exons.    

47

CHECKING EXON INTEGRITY

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Page 48: Introduction to Apollo for i5k

annota'ng  complex  cases  

Page 49: Introduction to Apollo for i5k

Evidence  may  support  joining  two  or  more  different  gene  models.    Warning:  protein  alignments  may  have  incorrect  splice  sites  and  lack  non-­‐conserved  regions!    

1.  In  ‘User-­‐created  Annota<ons’  area  shio-­‐click  to  select  an  intron  from  each  gene  model  and  right  click  to  select  the  ‘Merge’  op'on  from  the  menu.    

2.  Drag  suppor'ng  evidence  tracks  over  the  candidate  models  to  corroborate  overlap,  or  review  edge  matching  and  coverage  across  models.  

3.  Check  the  resul'ng  transla'on  by  querying  a  protein  database  e.g.  UniProt,  NCBI  nr.  Add  comments  to  record  that  this  annota'on  is  the  result  of  a  merge.  

49

Red  lines  around  exons:  ‘edge-­‐matching’  allows  annotators  to  confirm  whether  the  evidence  is  in  agreement  without  examining  each  exon  at  the  base  level.  

COMPLEX CASES merge two gene predictions on the same scaffold

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

Page 50: Introduction to Apollo for i5k

One  or  more  splits  may  be  recommended  when:    -­‐  different  segments  of  the  predicted  protein  align  to  two  or  more  different  gene  families    -­‐  predicted  protein  doesn’t  align  to  known  proteins  over  its  en're  length  -­‐  Transcript  data  may  support  a  split,  but  first  verify  whether  they  are  alterna've  transcripts.    

50

COMPLEX CASES split a gene prediction

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

Page 51: Introduction to Apollo for i5k

DNA  Track  

‘User-­‐created  Annota;ons’  Track  

51

COMPLEX CASES annotate frameshifts and correct single-base errors

Always  remember:  when  annota'ng  gene  models  using  Apollo,  you  are  looking  at  a  ‘frozen’  version  of  the  genome  assembly  and  you  will  not  be  able  to  modify  the  assembly  itself.  

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

Page 52: Introduction to Apollo for i5k

52

COMPLEX CASES correcting selenocysteine containing proteins

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

Page 53: Introduction to Apollo for i5k

53

COMPLEX CASES correcting selenocysteine containing proteins

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

Page 54: Introduction to Apollo for i5k

1.  Apollo  allows  annotators  to  make  single  base  modifica'ons  or  frameshios  that  are  reflected  in  the  sequence  and  structure  of  any  transcripts  overlapping  the  modifica'on.  These  manipula'ons  do  NOT  change  the  underlying  genomic  sequence.    

2.  If  you  determine  that  you  need  to  make  one  of  these  changes,  zoom  in  to  the  nucleo'de  level  and  right  click  over  a  single  nucleo'de  on  the  genomic  sequence  to  access  a  menu  that  provides  op'ons  for  crea'ng  inser'ons,  dele'ons  or  subs'tu'ons.    

3.  The  ‘Create  Genomic  Inser<on’  feature  will  require  you  to  enter  the  necessary  string  of  nucleo'de  residues  that  will  be  inserted  to  the  right  of  the  cursor’s  current  loca'on.  The  ‘Create  Genomic  Dele<on’  op'on  will  require  you  to  enter  the  length  of  the  dele'on,  star'ng  with  the  nucleo'de  where  the  cursor  is  posi'oned.  The  ‘Create  Genomic  Subs<tu<on’  feature  asks  for  the  string  of  nucleo'de  residues  that  will  replace  the  ones  on  the  DNA  track.  

4.  Once  you  have  entered  the  modifica'ons,  Apollo  will  recalculate  the  corrected  transcript  and  protein  sequences,  which  will  appear  when  you  use  the  right-­‐click  menu  ‘Get  Sequence’  op'on.  Since  the  underlying  genomic  sequence  is  reflected  in  all  annota'ons  that  include  the  modified  region  you  should  alert  the  curators  of  your  organisms  database  using  the  ‘Comments’  sec'on  to  report  the  CDS  edits.    

5.  In  special  cases  such  as  selenocysteine  containing  proteins  (read-­‐throughs),  right-­‐click  over  the  offending/premature  ‘Stop’  signal  and  choose  the  ‘Set  readthrough  stop  codon’  op'on  from  the  menu.  

 54

COMPLEX CASES annotating frameshifts and correcting single-base errors & selenocysteines

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

Page 55: Introduction to Apollo for i5k

55 | 55

USER NAVIGATION

BECOMING ACQUAINTED WITH APOLLO

•  Information Editor

Page 56: Introduction to Apollo for i5k

56

The  Annota'on  Informa;on  Editor  

56

USER NAVIGATION

BECOMING ACQUAINTED WITH APOLLO

Page 57: Introduction to Apollo for i5k

57

The  Annota'on  Informa;on  Editor  

•  Add  PubMed  IDs  •  Include  GO  terms  as  appropriate  

from  any  of  the  three  ontologies  •  Write  comments  sta'ng  how  you  

have  validated  each  model.  

57

USER NAVIGATION

BECOMING ACQUAINTED WITH APOLLO

Page 58: Introduction to Apollo for i5k

58 | 58

USER NAVIGATION

BECOMING ACQUAINTED WITH APOLLO

•  Keeping track of each edit

Page 59: Introduction to Apollo for i5k

59

Annota'ons,  annota'on  edits,  and  History:  stored  in  a  centralized  database.  

59

USER NAVIGATION

BECOMING ACQUAINTED WITH APOLLO

Page 60: Introduction to Apollo for i5k

Follow  the  checklist  un'l  you  are  happy  with  the  annota'on!  

And  remember  to…  –  comment  to  validate  your  annota'on,  even  if  you  made  no  changes  to  an  exis'ng  model.  Think  of  comments  as  your  vote  of  confidence.    

–  or  add  a  comment  to  inform  the  community  of  unresolved  issues  you  think  this  model  may  have.  

60 | 60

Always  Remember:  Apollo  cura'on  is  a  community  effort  so  please  use  comments  to  communicate  the  reasons  for  your    

annota'on.  Your  comments  will  be  visible  to  everyone.  

COMPLETING THE ANNOTATION

BECOMING ACQUAINTED WITH APOLLO

Page 61: Introduction to Apollo for i5k

Checklist  

Page 62: Introduction to Apollo for i5k

•  Check  ‘Start’  and  ‘Stop’  sites.  

•  Check    splice  sites:  most  splice  sites  display  these  residues  …]5’-­‐GT/AG-­‐3’[…  

•  Check  if  you  can  annotate  UTRs,  for  example  using  RNA-­‐Seq  data:  – align  it  against  relevant  genes/gene  family  – blastp  against  NCBI’s  RefSeq  or  nr  

•  Check  for  gaps  in  the  genome.  

•  Addi'onal  func'onality  may  be  necessary:  – merging  2  gene  predic'ons  -­‐  same  scaffold  – merging  2  gene  predic'ons  -­‐  different  scaffolds    

– spli`ng  a  gene  predic'on  – annota'ng  frameshias  – annota'ng  selenocysteines,  correc'ng  single-­‐base  and  other  assembly  errors,  etc.  

62 | 62

•  Add:  –  Important  project  informa'on  in  the  form  of  

comments  –  IDs  from  public  databases  e.g.  GenBank  (via  

DBXRef),  gene  symbol(s),  common  name(s),  synonyms,  top  BLAST  hits,  orthologs  with  species  names,  and  everything  else  you  can  think  of,  because  you  are  the  expert.  

–  Comments  about  the  kinds  of  changes  you  made  to  the  gene  model  of  interest,  if  any.    

–  Any  appropriate  func'onal  assignments,  e.g.  via  BLAST,  RNA-­‐Seq  data,  literature  searches,  etc.  

CHECKLIST for accuracy and integrity

MANUAL ANNOTATION CHECKLIST

Page 63: Introduction to Apollo for i5k

Genome  cura'on  with  i5k  

Page 64: Introduction to Apollo for i5k

64 i5K Workspace@NAL

The collaborative curation process at i5k

1.  A  computa'onally  predicted  consensus  gene  set  has  been  generated  using  mul'ple  lines  of  evidence;  e.g.  HVIT_v0.5.3-­‐Models  

 2.  i5K  Projects  will  integrate  consensus  computa'onal  predic'ons  with  

manual  annota'ons  to  produce  an  updated  Official  Gene  Set  (OGS):  

Achtung!  •  If  it’s  not  on  either  track,  it  won’t  make  the  OGS!  •  If  it’s  there  and  it  shouldn’t,  it  will  s'll  make  the  OGS!  

Page 65: Introduction to Apollo for i5k

65

The ‘Replace Models’ rules

65

BECOMING ACQUAINTED WITH APOLLO http://tinyurl.com/apollo-i5k-replace

Page 66: Introduction to Apollo for i5k

66 i5K Workspace@NAL

3.  In  some  cases  algorithms  and  metrics  used  to  generate  consensus  sets  may  actually  reduce  the  accuracy  of  the  gene’s  representa'on.  Use  your  judgment,  try  choosing  a  different  model  to  begin  the  annota'on.  

4.   Isoforms:  drag  original  and  alterna'vely  spliced  form  to  ‘User-­‐created  Annota<ons’  area.  

5.  If  an  annota'on  needs  to  be  removed  from  the  consensus  set,  drag  it  to  the  ‘User-­‐created  Annota<ons’  area  and  label  as  ‘Delete’  on  the  Informa$on  Editor.  

6.  Overlapping  interests?  Collaborate  to  reach  agreement.  

7.  Follow  guidelines  for  i5K  Pilot  Species  Projects,  at  h;p://goo.gl/LRu1VY  

The collaborative curation process at i5k

Page 67: Introduction to Apollo for i5k

Example  

Page 68: Introduction to Apollo for i5k

What’s new?... finding inspiration in PubMed.

Example 68

“Molecular analysis of bed bug populations from across the USA and Europe found that >80% and >95% of the respective populations contained V419L and/or L925I mutations in the voltage-gated sodium channel gene, indicating widespread distribution of target-site-based pyrethroid resistance.”

Homalodisca vitripennis | Alexander Wild | www.alexanderwild.com Halyomorpha halys | Fondazione Edmund Mach - Italy

Now for our species of interest. . .

Page 69: Introduction to Apollo for i5k

Example

Example 69

 Cura'on  example  using  the  Hyalella  azteca  genome  (amphipod  crustacean).  

Page 70: Introduction to Apollo for i5k

What do we know about this genome?

•  Currently  publicly  available  data  at  NCBI:  •  >37,000    nucleo'de  seqsà  scaffolds,  mitochondrial  genes  •  344    amino  acid  seqsà  mitochondrion  •  47    ESTs  •  0      conserved  domains  iden'fied  •  0    “gene”  entries  submi;ed    

•  Data  at  i5K  Workspace@NAL  (annota'on  hosted  at  USDA)    -­‐  10,832  scaffolds:  23,288  transcripts:  12,906  proteins  

Example 70

Page 71: Introduction to Apollo for i5k

PubMed Search: what’s new?

Example 71

Page 72: Introduction to Apollo for i5k

PubMed Search: what’s new?

Example 72

“Ten  popula'ons  (3  cultures,  7  from  California  water  bodies)  differed  by  at  least  550-­‐fold  in  sensi;vity  to  pyrethroids.”    

“By  sequencing  the  primary  pyrethroid  target  site,  the  voltage-­‐gated  sodium  channel  (vgsc),  we  show  that  point  muta'ons  and  their  spread  in  natural  popula'ons  were  responsible  for  differences  in  pyrethroid  sensi'vity.”  

“The  finding  that  a  non-­‐target  aqua'c  species  has  acquired  resistance  to  pes'cides  used  only  on  terrestrial  pests  is  troubling  evidence  of  the  impact  of  chronic  pes;cide  transport  from  land-­‐based  applica'ons  into  aqua'c  systems.”  

Page 73: Introduction to Apollo for i5k

How many sequences are there, publicly available, for our gene of interest?

Example 73

•  Para,  (voltage-­‐gated  sodium  channel  alpha  subunit;  Nasonia  vitripennis).    

•  NaCP60E  (Sodium  channel  protein  60  E;  D.  melanogaster).  –  MF:  voltage-­‐gated  ca'on  channel  ac'vity  (IDA,  GO:0022843).  

–  BP:  olfactory  behavior  (IMP,  GO:0042048),  sodium  ion  transmembrane  transport  (ISS,GO:0035725).  

–  CC:  voltage-­‐gated  sodium  channel  complex  (IEA,  GO:0001518).  

And  what  do  we  know  about  them?  

Page 74: Introduction to Apollo for i5k

Retrieving sequences for a sequence similarity search.

Example 74

>vgsc-­‐Segment3-­‐DomainII  RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQDGQMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR

Page 75: Introduction to Apollo for i5k

BLAT search

input  

Example 75

>vgsc-­‐Segment3-­‐DomainII  RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQDGQMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR

Page 76: Introduction to Apollo for i5k

BLAT search

results  

Example 76

•  High-­‐scoring  segment  pairs  (hsp)  are  listed  in  tabulated  format.  

•  Clicking  on  one  line  of  results  sends  you  to  those  coordinates.  

Page 77: Introduction to Apollo for i5k

BLAST at i5K heps://i5k.nal.usda.gov/blast

Example 77

>vgsc-­‐Segment3-­‐DomainII  RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQDGQMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR

Page 78: Introduction to Apollo for i5k

BLAST at i5K heps://i5k.nal.usda.gov/blast  

Example 78

Page 79: Introduction to Apollo for i5k

BLAST at i5K: hsps  in  “BLAST+  Results”  track  

Example 79

Page 80: Introduction to Apollo for i5k

Creating a new gene model: drag and drop

Example 80

•  Apollo  automa'cally  calculates  longest  ORF.    

•  In  this  case,  ORF  includes  the  high-­‐scoring  segment  pairs  (hsp),  marked  here  in  blue.  

•  Note  that  gene  is  transcribed  from  reverse  strand.  

Page 81: Introduction to Apollo for i5k

Available Tracks

Example 81

Page 82: Introduction to Apollo for i5k

Get Sequence

Example 82

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Page 83: Introduction to Apollo for i5k

Also, flanking sequences (other gene models) vs. NCBI nr

Example 83

In  this  case,  two  gene  models  upstream,  at  5’  end.  

BLAST  hsps  

Page 84: Introduction to Apollo for i5k

Review alignments

Example 84

HaztTmpM006234  

HaztTmpM006233  

HaztTmpM006232  

Page 85: Introduction to Apollo for i5k

Hypothesis for vgsc gene model

Example 85

Page 86: Introduction to Apollo for i5k

Editing: merge the three models

Example 86

Merge  by  dropping  an  exon  or  gene  model  onto  another.  

Merge  by  selec'ng  two  exons  (holding  down  “Shio”)  and  using  the  right  click  menu.  

or…  

Page 87: Introduction to Apollo for i5k

Result of merging the gene models:

Example 87

Page 88: Introduction to Apollo for i5k

Editing: correct offending splice site

Example 88

Modify  exon  /  intron  boundary:    -­‐  Drag  the  end  of  the  

exon  to  the  nearest  canonical  splice  site.  

 

or    

-­‐  Use  right-­‐click  menu.  

Page 89: Introduction to Apollo for i5k

Editing: set translation start

Example 89

Page 90: Introduction to Apollo for i5k

Editing: delete exon not supported by evidence

Example 90

Delete  first  exon  from  HaztTmpM006233  

Page 91: Introduction to Apollo for i5k

Editing: add an exon supported by RNAseq

Example 91

•  RNAseq  reads  show  evidence  in  support  of  transcribed  product,  which  was  not  predicted.  •  Add  exon  at  coordinates  97946-­‐98012  by  dragging  up  one  of  the  RNAseq  reads.  

Page 92: Introduction to Apollo for i5k

Editing: adjust offending splice site using evidence

Example 92

Page 93: Introduction to Apollo for i5k

Editing: adjust other boundaries supported by evidence

Example 93

Page 94: Introduction to Apollo for i5k

Finished model

Example 94

Corroborate  integrity  and  accuracy  of  the  model:    -­‐  Start  and  Stop  -­‐  Exon  structure  and  splice  sites  …]5’-­‐GT/AG-­‐3’[…  -­‐  Check  the  predicted  protein  product  vs.  NCBI  nr,  UniProt,  etc.  

Page 95: Introduction to Apollo for i5k

Information Editor

•  DBXRefs:  e.g.  NP_001128389.1,  N.  vitripennis,  RefSeq  

•  PubMed  iden'fier:  PMID:  24065824  

•  Gene  Ontology  IDs:  GO:0022843,  GO:0042048,  GO:0035725,  GO:0001518.  

•  Comments  

•  Name,  Symbol  

•  Approve  /  Delete  radio  bu;on  

Example 95

Comments  (if  applicable)  

Page 96: Introduction to Apollo for i5k

Go  play!  

Page 97: Introduction to Apollo for i5k

PUBLIC DEMO 97 | 97

APOLLO ON THE WEBinstructions

At  i5K  1.  Register  for  access  to  Apollo  at  the  i5K  Workspace@NAL  at  

h;ps://i5k.nal.usda.gov/web-­‐apollo-­‐registra'on    

2.  Contact  the  coordinator  for  each  species  community  to  receive  more  informa'on  about  how  to  contribute.  Contact  info  is  available  on  each  organism’s  page.    

Page 98: Introduction to Apollo for i5k

PUBLIC DEMO 98 | 98

APOLLO ON THE WEBinstructions

Public  Honey  bee  demo  available  at:    h;p://GenomeArchitect.org/WebApolloDemo      

Username:  [email protected]    

Password:  demo  

Page 99: Introduction to Apollo for i5k

APOLLOdemonstration

PUBLIC DEMO 99

Demonstra'on  video  is  available  at    h;ps://youtu.be/VgPtAP_fvxY  

Page 100: Introduction to Apollo for i5k

OUTLINE

Apollo  Collabora've  Cura'on  and    Interac've  Analysis  of  Genomes  

100 OUTLINE

•  BIO-­‐REFRESHER  biological  concepts  for  cura'on  

•  ANNOTATION  automa'c  predic'ons  

•  MANUAL  ANNOTATION  necessary,  collabora've  

 •  APOLLO  

advancing  collabora've  cura'on    •  EXAMPLE  

demos  

Page 101: Introduction to Apollo for i5k

Apollo Development

Nathan Dunn Eric Yao

Christine Elsik’s Lab, University of Missouri

Suzi Lewis Principal Investigator

BBOP

Moni Munoz-Torres Colin Diesh Deepak Unni

JBrowse. Ian Holmes’ Lab University of California, Berkeley

Page 102: Introduction to Apollo for i5k

•  Berkeley Bioinformatics Open-source Projects (BBOP), Berkeley Lab: Apollo and Gene Ontology teams. Suzanna E. Lewis (PI).

•  § Christine G. Elsik (PI). University of Missouri. •  * Ian Holmes (PI). University of California Berkeley. •  Arthropod genomics community & i5K Steering

Committee. •  Stephen Ficklin, GenSAS, Washington State University •  Apollo is supported by NIH grants 5R01GM080203

from NIGMS, and 5R01HG004483 from NHGRI. Also supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231

•  For your attention, thank you!

Apollo Nathan Dunn Colin Diesh § Deepak Unni §

Gene Ontology

Chris Mungall

Seth Carbon

Heiko Dietze

BBOP

Learn more about Apollo at http://GenomeArchitect.org

Thank you!

NAL at USDA

Monica Poelchau

Mei-Ju Chen

Christopher Childers

Gary Moore

HGSC at BCM

fringy Richards

Kim Worley

JBrowse Eric Yao *

Page 103: Introduction to Apollo for i5k