35
UNIVERSITY OF CALIFORNIA

Web Apollo Tutorial for Medfly Research Community

Embed Size (px)

DESCRIPTION

This is an introduction tutorial for the Medfly research community.

Citation preview

Page 1: Web Apollo Tutorial for Medfly Research Community

UNIVERSITY OF CALIFORNIA

Page 2: Web Apollo Tutorial for Medfly Research Community

An introduction to Web Apollo. A webinar for the Ceratitis capitata research community.

Monica Munoz-Torres, PhD | @monimunozto Berkeley Bioinformatics Open-Source Projects (BBOP)

Genomics Division, Lawrence Berkeley National Laboratory 15 July, 2014

UNIVERSITY OF CALIFORNIA

Page 3: Web Apollo Tutorial for Medfly Research Community

Outline 1.  What is Web Apollo?:

• Definition & working concept.

2.  Our Experience With Community Based Curation.

3.  The Manual Annotation Process.

4.  Becoming acquainted with Web Apollo.

An introduction to Web Apollo. A webinar for the Ceratitis capitata research community.

Outline 3

Page 4: Web Apollo Tutorial for Medfly Research Community

During this webinar you will:

•  Learn to identify homologs of known genes of interest in your newly sequenced genome.

•  Become familiar with the environment and functionality of the Web Apollo genome annotation editing tool.

•  Receive a brief introduction to the resources available for the Ceratitis capitata genome.

Footer 4

Page 5: Web Apollo Tutorial for Medfly Research Community

What is Web Apollo? •  Web Apollo is a web-based, collaborative genomic

annotation editing platform. We  need  annota)on  edi)ng  tools  to  modify  and  refine  the  precise  loca)on  and  structure  of  the  genome  elements  that  predic)ve  algorithms  cannot  yet  resolve  automa)cally.

5 1. What is Web Apollo?

Find more about Web Apollo at http://GenomeArchitect.org

and Genome Biol 14:R93. (2013).

Page 6: Web Apollo Tutorial for Medfly Research Community

Brief history of Apollo*:

a. Desktop: one person at a time editing a specific region, annotations saved in local files; slowed down collaboration. b. Java Web Start: users saved annotations directly to a centralized database; potential issues with stale annotation data remained.

1. What is Web Apollo? 6

Biologists could finally visualize computational analyses and experimental evidence from genomic features and build manually-curated consensus gene structures. Apollo became a very popular, open source tool (insects, fish, mammals, birds, etc.).

*

Page 7: Web Apollo Tutorial for Medfly Research Community

Web Apollo •  Browser-based tool integrated with JBrowse.

•  Two new tracks: “Annotation” and “DNA Sequence”

•  Allows for intuitive annotation creation and editing, with gestures and pull-down menus to create and modify transcripts and exons structures, insert comments (CV, freeform text), etc.

•  Customizable look & feel.

•  Edits in one client are instantly pushed to all other clients: Collaborative!

1. What is Web Apollo? 7

Page 8: Web Apollo Tutorial for Medfly Research Community

Working Concept

In the context of gene manual annotation, curation tries to find the best examples and/or eliminate most errors.

To conduct manual annotation efforts: Gather and evaluate all available evidence

using quality-control metrics to corroborate or modify automated annotation predictions.

Perform sequence similarity searches (phylogenetic framework) and use literature and public databases to: • Predict functional assignments from experimental data.

• Distinguish orthologs from paralogs, and classify gene membership in families and networks.

2. In our experience. 8

Automated gene models

Evidence: cDNAs, HMM domain searches, alignments with assemblies or

genes from other species.

Manual annotation & curation

Page 9: Web Apollo Tutorial for Medfly Research Community

Dispersed, community-based gene manual annotation efforts. We continuously train and support

hundreds of geographically dispersed scientists from many research communities, to perform biologically supported manual annotations using Web Apollo.

– Gate keepers and monitoring. – Written tutorials. – Training workshops and geneborees. – Personalized user support.

2. In our experience. 9

Page 10: Web Apollo Tutorial for Medfly Research Community

What we have learned.

Harvesting expertise from dispersed researchers who assigned functions to predicted and curated peptides we have developed more interactive and responsive tools, as well as better visualization, editing, and analysis capabilities.

10 2. In our experience.

http://people.csail.mit.edu/fredo/PUBLI/Drawing/

Page 11: Web Apollo Tutorial for Medfly Research Community

Collaborative Efforts Improved Automated Annotations*

In many cases, automated annotations have been improved (e.g: Apis mellifera. Elsik et al. BMC Genomics 2014, 15:86).

Also, learned of the challenges of newer sequencing technologies, e.g.: – Frameshifts and indel errors – Split genes across scaffolds – Highly repetitive sequences

To face these challenges, we train annotators in recovering coding sequences in agreement with all available biological evidence.

11 2. In our experience.

Page 12: Web Apollo Tutorial for Medfly Research Community

It is helpful to work together. Scientific community efforts bring together domain-specific and natural history expertise that would otherwise remain disconnected.

Breaking down large amounts of data into manageable portions and mobilizing groups of researchers to extract the most accurate representation of the biology from all available data distills invaluable knowledge from genome analysis.

12 2. In our experience.

Page 13: Web Apollo Tutorial for Medfly Research Community

Understanding the evolution of sociality Comparing the genomes of 7 species of ants

contributed to a better understanding of the evolution and organization of insect societies at the molecular level.

Insights drawn mainly from six core aspects of ant biology:

1.  Alternative morphological castes 2.  Division of labor 3.  Chemical Communication 4.  Alternative social organization 5.  Social immunity 6.  Mutualism

13

Libbrecht et al. 2012. Genome Biology 2013, 14:212

2. In our experience.

Atta cephalotes (above) and Harpegnathos saltator. ©alexanderwild.com

Groups of communities continue to guide our efforts.

Page 14: Web Apollo Tutorial for Medfly Research Community

A little training goes a long way!

With the right tools, wet lab scientists make exceptional curators who can easily learn to maximize the generation of accurate, biologically supported gene models.

14 2. In our experience.

Page 15: Web Apollo Tutorial for Medfly Research Community

Manual Annotation

How do we get there?

15

Assembly Manual

annotation Experimental

validation Automated Annotation

In a genome sequencing project…

3. How do we get there?

Page 16: Web Apollo Tutorial for Medfly Research Community

Gene Prediction

Identification of protein-coding genes, tRNAs, rRNAs, regulatory motifs, repetitive elements (masked), etc.

- Ab initio (DNA composition): Augustus, GENSCAN, geneid, fgenesh

- Homology-based: E.g: SGP2, fgenesh++

16

Nucleic Acids 2003 vol. 31 no. 13 3738-3741

3. How do we get there?

Page 17: Web Apollo Tutorial for Medfly Research Community

Gene Annotation Integration of data from prediction tools to generate a

consensus set of predictions or gene models. •  Models may be organized using:

-  automatic integration of predicted sets; e.g: GLEAN -  packaging necessary tools into pipeline; e.g: MAKER

•  All available biological evidence (e.g. transcriptomes) further informs the annotation process.

17 3. How do we get there?

In some cases algorithms and metrics used to generate consensus sets may actually reduce the accuracy of the gene’s representation; in such cases it is usually better to use an ab initio model to create a new annotation.

Page 18: Web Apollo Tutorial for Medfly Research Community

Manual Genome Annotation

•  Identifies elements that best represent the underlying biology.

•  Eliminates elements that reflect the systemic errors of automated genome analyses.

•  Determines functional roles through comparative analysis of well-studied, phylogenetically similar genome elements using literature, databases, and the researcher’s experience.

18 3. How do we get there?

Page 19: Web Apollo Tutorial for Medfly Research Community

Curation Process is Necessary

1.  A computationally predicted consensus gene set is generated using multiple lines of evidence.

2.  Manual annotation takes place.

3.  Ideally consensus computational predictions will be integrated with manual annotations to produce an updated Official Gene Set (OGS).

Otherwise, “incorrect and incomplete genome annotations will poison every experiment that uses them”.

- M. Yandell.

19 3. How do we get there?

Page 20: Web Apollo Tutorial for Medfly Research Community

The Collaborative Curation Process at i5K

1) A computationally predicted consensus gene set has been generated using multiple lines of evidence; e.g. JAMg Consensus Gene Set v1. 2) i5K Projects will integrate consensus computational predictions with manual annotations to produce an updated Official Gene Set (OGS):

»  If it’s not on either track, it won’t make the OGS! »  If it’s there and it shouldn’t, it will still make the OGS!

20 3. How do we get there?

Page 21: Web Apollo Tutorial for Medfly Research Community

Consensus set: reference and start point

•  In some cases algorithms and metrics used to generate consensus sets may actually reduce the accuracy of the gene’s representation; e.g. use Augustus model instead to create a new annotation.

•  Isoforms: drag original and alternatively spliced form to ‘User-created Annotations’ area.

•  If an annotation needs to be removed from the consensus set, drag it to the ‘User-created Annotations’ area and label as ‘Delete’ on Information Editor.

•  Overlapping interests? Collaborate to reach agreement. •  Follow guidelines for i5K Pilot Species Projects as shown at

http://goo.gl/LRu1VY and download the MedFly Annotation guide from http://goo.gl/YY0tNw

21 3. How do we get there?

Page 22: Web Apollo Tutorial for Medfly Research Community

Web Apollo

Page 23: Web Apollo Tutorial for Medfly Research Community

Sort

Web Apollo

23

The Sequence Selection Window

4. Becoming Acquainted with Web Apollo.

23

Page 24: Web Apollo Tutorial for Medfly Research Community

Navigation tools: pan and zoom Search box: go

to a scaffold or a gene model.

Grey bar of coordinates indicates location. You can also select here in order to zoom to a sub-region.

‘View’: change color by CDS, toggle strands, set highlight.

‘File’: Upload your own evidence: GFF3, BAM, BigWig, VCF*. Add combination and sequence search tracks.

‘Tools’: Use BLAT to query the genome with a protein or DNA sequence.

Available Tracks

Evidence Tracks Area

‘User-created Annotations’ Track

Login

Web Apollo

24

Graphical User Interface (GUI) for editing annotations

4. Becoming Acquainted with Web Apollo.

Page 25: Web Apollo Tutorial for Medfly Research Community

Flags non-canonical splice sites.

Selection of features and sub-features

Edge-matching

Evidence Tracks Area

‘User-created Annotations’ Track

The editing logic in the server: §  selects longest ORF as CDS §  flags non-canonical splice sites

25

Web Apollo

4. Becoming Acquainted with Web Apollo.

25

Page 26: Web Apollo Tutorial for Medfly Research Community

DNA Track

‘User-created Annotations’ Track

Web Apollo

26 4. Becoming Acquainted with Web Apollo.

§  There are two new kinds of tracks for: §  annotation editing §  sequence alteration editing

Page 27: Web Apollo Tutorial for Medfly Research Community

Web Apollo

27

Annotations, annotation edits, and History: stored in a centralized database.

4. Becoming Acquainted with Web Apollo.

27

Page 28: Web Apollo Tutorial for Medfly Research Community

Web Apollo

28 4. Becoming Acquainted with Web Apollo.

28

•  DBXRefs •  PubMed IDs •  GO terms •  Comments

The Information Editor

Page 29: Web Apollo Tutorial for Medfly Research Community

Additional Functionality In addition to protein-coding gene annotation that you know and love.

•  Non-coding genes: ncRNAs, miRNAs, repeat regions, and TEs

•  Sequence alterations (less coverage = more fragmentation)

•  Visualization of stage and cell-type specific transcription data as coverage plots, heat maps, and alignments

29 4. Becoming Acquainted with Web Apollo.

29

Page 30: Web Apollo Tutorial for Medfly Research Community

Webservices & additional tools

•  Alignments - Jalview

•  BLAST - blastp

•  Signal Peptide – search using signalP.

•  Just_Annotate_My_proteins: Pick a Gene Ontology, Enzyme, KEGG, etc term and it gives you a list

of genes that have a significant Hidden Markov Model alignment to a SwissProt protein (i.e. only real proteins that have been validated) and that has real experimental evidence (i.e. from the literature) for that term.

The search is conservative and does not allow IEA evidence codes to avoid possibly propagating annotation errors. However, the search is run twice: first every annotated gene is searched against SwissProt. Then a profile alignment is created with the good matches and searched again.

Footer 30

Page 31: Web Apollo Tutorial for Medfly Research Community

1.  Select a chromosomal region of interest, e.g. scaffold. 2.  Select appropriate evidence tracks. 3.  Determine whether a feature in an existing evidence track will

provide a reasonable gene model to start working. -  If yes: select and drag the feature to the ‘User-created

Annotations’ area, creating an initial gene model. If necessary use editing functions to adjust the gene model.

-  If not: let’s talk. 4.  Check your edited gene model for integrity and accuracy by

comparing it with available homologs.

4. Becoming Acquainted with Web Apollo

General Process of Curation

31 |

Always remember: when annotating gene models using Web Apollo, you are looking at a ‘frozen’ version of the genome assembly and you will not be able to modify the assembly itself.

31

Page 32: Web Apollo Tutorial for Medfly Research Community

There are a number of ways to find the gene region you wish to annotate. It depends what you’re starting with: a)  The protein sequence from another species b)  Sequence from a similar gene c)  You provided Alexie with golden genes and he provided back alignments d)  You provided Alexie with high quality proteins and/or gene family alignments (multi or

single species) and he created domain annotations.

So how do I start curating!?

Option 1 – You have a sequence but don’t know where it is in this genome 1.  You will need to BLAT it 2.  If protein-based BLAT doesn’t find it, you can BLAST it 3.  You can use the i5k BLAST server here :

http://i5k.nal.usda.gov/blastn 4.  Or you can use any other tool, for example Geneious

Option 2 – the genome has already been annotated with your sequences and you have an ID 1.  In other words, someone has told you where to look: if you give Alexie

profile alignments of your favorite gene family we could do that for you.

2.  Type the ID in the Search box of Web Apollo •  Web Apollo autocompletes using a case-insensitive search

anchored on the left-hand side of the word e.g. so HaGR will show all hagr objects (up to 10)

3.  Choose one of the gene and click Go You can do that with Domains, Alignments or Gene names provided to you.

Option 3 – Get genes based on a GO / EC etc term This is a fun, new tool Alexie has made, called Just_Annotate_My_proteins.

Page 33: Web Apollo Tutorial for Medfly Research Community

Example Live Demonstration using the Apis mellifera genome.

Example 33

A public Honey Bee Web Apollo Demo is available at http://genomearchitect.org/WebApolloDemo

Page 34: Web Apollo Tutorial for Medfly Research Community

Arthropodcentric Thanks! AgriPest Base FlyBase Hymenoptera Genome Database VectorBase

Acromyrmex echinatior Acyrthosiphon pisum

Apis mellifera Atta cephalotes

Bombus terrestris Camponotus floridanus

Helicoverpa armigera Linepithema humile

Manduca sexta Mayetiola destructor Nasonia vitripennis

Pogonomyrmex barbatus Solenopsis invicta

Tribolium castaneum… and you!

34

34

Page 35: Web Apollo Tutorial for Medfly Research Community

Thanks! •  Berkeley Bioinformatics Open-source Projects

(BBOP), Berkeley Lab: Web Apollo and Gene Ontology teams. Suzanna E. Lewis (PI).

•  Christine G. Elsik (PI). § University of Missouri.

•  Ian Holmes (PI). * University of California Berkeley.

•  Arthropod genomics community, i5K Steering Committee, Alexie Papanicolaou at CSIRO, Monica Poelchau at USDA/NAL, fringy Richards at HGSC-BCM, Oliver Niehuis at 1KITE http://www.1kite.org/, BGI, and the Honey Bee Genome Sequencing Consortium.

•  Web Apollo is supported by NIH grants 5R01GM080203 from NIGMS, and 5R01HG004483 from NHGRI, and by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

•  Insect images used with permission: http://AlexanderWild.com and O. Niehuis.

•  For your attention, thank you!

Thank you. 35

Web Apollo

Gregg Helt

Ed Lee

Colin Diesh §

Deepak Unni §

Rob Buels *

Gene Ontology

Chris Mungall

Seth Carbon

Heiko Dietze

BBOP

Web Apollo: http://GenomeArchitect.org

GO: http://GeneOntology.org

i5K: http://arthropodgenomes.org/wiki/i5K