Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Thei5kWorkspace@NAL:apan-ArthropodaGenomeDatabase
ChrisChildersandMonicaPoelchauUSDA-ARS,NationalAgriculturalLibrary
Outline
• Backgroundandoverview• Whyjointhei5kWorkspace?• Whatdoweneedforaproject?• Whatwedowithyourdata?• Whatdon’twedowithyourdata?• Ournewsystemforsubmittingprojectsanddata
Background
• Thei5kinitiativetaskeditselfwithcoordinatingthesequencingandassemblyof5000insectorrelatedarthropodgenomes
• Internationalefforttoprioritize insectgenomesforsequencing;provideguidelinesforgenomesequencingandcuration;andseekfunding.
• Thei5kWorkspace@NAL isavailabletohelpanyi5k(arthropod)projectwithgenomehostingneeds
• Researchplan• Generatematerialfor
sequencing• Genomesequencing• Genomeassembly• Automated
annotationofgenomeassembly
• Biologicalinsights/Publication
GenomeProjectTrajectory
• ManualCuration• Officialgeneset
(OGS)generation• Genomeproject
maintenance
WorkspaceProjectBasics
• Thei5kWorkspacecentersaroundprojects.• Aprojectisacollectionofdatabasedonthegenomeassemblyofanarthropod
• Alldataisusedinthecontextofthegenomeassembly
• Eachprojecthasaprojectcoordinator.• Servesasthepointofcontactforquestionsabouttheproject
• Mainresponsibility:approveorrejectnewApollousers
• All ofourdataisuser-submitted
Whyjointhei5kWorkspace?
• Gainaccesstoalargediversecommunity• Adiversityoforganisms
• 58speciesandcounting• 20%ofthearthropodswithgenomeassembliesatNCBI
• Largeusercommunitywithmanydifferentinterests• Peopleversedinthebiologyofspecificsystems• Expertsinaspeciesorgroupofspecies
• Acommoninterfaceforaccessingdata,toolsandsearch• Detailedpoliciesondataandprojectmanagement
• Helpfulifyouhavedatamanagementrequirements• Datamanagement
• https://i5k.nal.usda.gov/data-management-policy• Long-termprojectmanagement
• https://i5k.nal.usda.gov/long-term-i5k-workspace-project-management
Whatdoweneedforaproject?
• Yourprojectmetadata• Informationaboutyourorganism• Metadataforsubmitteddatafiles(themorethebetter)
• Whattoolsormethodswereused• Softwareversionsandoptionsset• Whenandwherethedataweregenerated• Otherinformation(locationcollected,life-stage,etc.)
• Yourdatafiles• GenomeassemblyneedstobeinGenBank/ENA/DDBJ• Datashouldbeopenaccess(noprivaterepositories)• Additionaldatasetsneedtobemappedtothesameassembly
Whatdowedowithyourdata?
• Createresources• Organismandgenepages• Datadownloads
• Integrateyourdatawithourtools• Genomebrowser• BLAST,Clustal,HMMer• Apolloforgenecuration
• Offerpostcurationservices• AnnotationQCandOfficialGeneSet(OGS)Creation• Updategenepages,Apollo,BLASTwithOGS
Submission
‘Frozen’genomeassembly
Automatedannotations
Ancillarydatafiles (e.g.RNA-Seq alignments)
ToolsOrganismInformation
Page
Bulkdatadownloads
Tutorials
CustomBLASTinterface
Apollomanualcurationtool
JBrowse genomebrowser
Services
Manualannotationqualitycontrol
Officialgenesetgeneration
https://i5k.nal.usda.gov/Workspace@NAL
HMMer Clustal
Resources
Challenges
Non-standarddataformatting
Failuretosubmitallmetadata(ex:sampleorigin;
analysismethods)
Whatdon’twedowithyourdata?
• Computationallyintenseanalysessuchas• Geneprediction• RawRNAseqmapping
• Wearenotalong-termarchiveorrepository• NCBI• AgDataCommons• DryadDigitalRepository• CyVerse Datacommons• Manyotheroptionsavailable
Criteriaforstartingaproject
• Youneedtohaveanarthropod genomeassembly,accessionedbyNCBI(oranotherINSDCmember)
• UsingGenBank's accessionnumbersavoidsconfusionaboutassemblyversion
• TheGenBank contaminationscreenimprovestheassemblyquality
• Usingastableassemblyisbeneficialforthelabor-intensivecommunityannotationprocess
Otherthingstoconsiderbeforesubmitting• Alldatasubmittedtothei5kWorkspaceispublic.
• However,wedostatewhetherFt.Lauderdale/Torontoagreementsofdatasharingshouldapply
• Isyourgenomean‘orphan’,oristhereanothersuitabledatabase?
• Wecanhostgenomesthatarealreadyhostedelsewhere,andactivelycommunicatewithotherdatabaseproviders
• Allmanualannotationeffortsneedtobeatonedatabase
Gettinganaccount
• Applyforadatasetsubmissionaccount:https://i5k.nal.usda.gov/register/project-dataset/account
• Onceyouraccountisapproved,youcansubmitprojects,assembliesorotherdatasets
Startani5kWorkspaceProject
• Login• https://i5k.nal.usda.gov/user
• Frommenu,select’Data->Submitdata->Requestanewi5kWorkspaceProject’
• https://i5k.nal.usda.gov/datasets/request-project
• We’llreviewyoursubmissionandwillgetintouchwithyou
Submityourgenomeassembly
• Allinformationsubmittedthroughthisformwillbere-formattedfordisplayatthei5kWorkspace(exceptforemailaddressandfilechecksum)
• Frommenu,select‘Data->Submitdata->Submitagenomeassembly’
• https://i5k.nal.usda.gov/datasets/assembly-data
Submitgenepredictions
• Allinformationsubmittedthroughthisformwillbere-formattedfordisplayatthei5kWorkspace(exceptforemailaddressandfilechecksum)
• Undermenubar,select‘Data->Submitdata->SubmitGenePredictions’
• https://i5k.nal.usda.gov/datasets/gene-prediction
Submitmappeddatasets
• Allinformationsubmittedthroughthisformwillbere-formattedfordisplayatthei5kWorkspace(exceptforemailaddressandfilechecksum)
• Undermenubar,select‘Data->Submitdata->SubmitaMappedDataset’
• https://i5k.nal.usda.gov/datasets/mapped
Sendusyourfiles
• Therearecurrentlyfive waystosharefileswithus:1. Useourdatasubmissionforms2. Transmitthefileviaftp (onlyforfiles<2Gb)3. Emailittous(forfiles<25Mbonly)4. ProvideuswithaURL,ifavailable5. UploadthefiletoCyVerse andsharewithour
organization,“NALBioinformatics”• Wepreferthatyoushareyourfileswithusviaourdatasubmissionforms.
• Formoreinformation,seehttps://i5k.nal.usda.gov/content/sharing-files-us
OtherresourcesattheNAL:theAgDataCommons
• HostsanydatasetfundedbytheUSDA
• Landingpage• CitableDOI• https://data.nal.usda.gov/• 9i5kdatasetsalreadyavailable
Needmoreinformation?i5kWorkspace@NAL:• https://i5k.nal.usda.gov/• https://github.com/NAL-i5K/
Thei5kinitiative:• Newwebsite:http://i5k.github.io/
OfficialGeneSetcreationatthei5kWorkspace
OfficialGeneSetcreationatthei5kWorkspace• OfficialGeneSetdefinition• OurOGSgenerationprocess
• Manualandcommunityannotation• Qualitycontrol• Merge• Release
• ExamplesandfuturedirectionsoftheOGSgenerationprocess
TheOfficialGeneSet– whatisit?
• Loosedefinition:Thebestknownrepresentationofgenemodelsforagenomeassembly
• Whenthei5kWorkspacegeneratesanOGS,thisisamergebetweenonegeneset(usuallycomputationallypredicted),andasetofmanuallyvalidatedannotations(usuallyfromtheApollosoftware)
WhygenerateanOfficialGeneSet?• Thisdependsonyourgenomecommunity’sneeds.• Ifseveralgroupswanttoperformdownstreamanalyses,ithelpstohaveanauthoritative‘referencegeneset’foryourcommunity,ratherthanmultiplecompetinggenesets
OurOGSgenerationprocess
• Newpublicversionofprogramisavailable:https://github.com/NAL-i5K/GFF3toolkit (Mei-JuChen,Li-MeiChiang)
• Thefullprocessistime-consuming,butwearegenerallyavailabletoperformOGSgenerationfori5kWorkspaceprojects
1. Manual annotation (via Apollo)
2. Error checking Curator fixes
3. Merge with one
designated gene set
4. Release Official
Gene Set
Manual annotation
freeze
1.ManualandcommunityannotationWhatismanualannotation?• Manualreviewandimprovementofanexistinggeneprediction
• Often,butnotalways:drawingonexternalevidence(e.g.RNA-Seq,cDNA,genesfromotherspecies)toimproveacomputationallypredictedgenemodel
Structuralannotation– e.g.modifyexons
Functionalannotation– e.g.addname
1.ManualandcommunityannotationWhymanuallyannotate?• “Incorrectannotationspoisoneveryexperimentthatmakesuseofthem…Worsestill,thepoisonspreadsbecauseincorrectannotationsfromoneorganismareoftenunknowinglyusedbyotherprojectstohelpannotatetheirowngenomes.”
• Yandell andEnce 2012,doi:10.1038/nrg3174• Linkgenemodelstoexistingliteratureandontologies,providingricherdata
• Onecurrent‘model’ofthegenomepaperoftendrawsheavilyfrominsightsconfirmedbymanualannotation
1.Manualandcommunityannotation• Whatiscommunityannotation?
• Scientistscollectivelyexamineandimprovegenemodels(usuallycomputationallypredicted)
• Communityannotationatthei5kWorkspace:• Accesstoalargecommunityofcurators• Tutorials,guidelines,webinars• Registrationmechanismfornewannotators• One-on-onesupport• Over400registeredannotatorshavecuratedover10,000genemodelsusingtheApollosoftware
1.Manualandcommunityannotation– i5kpilotexampleNumberofcuratorsperorganism.Communitysizevariesamongorganisms.
Numberoforganismspercurator.35%ofcuratorsworkedonmorethanoneorganism
1.Manualandcommunityannotation– i5kpilotexample
• Threeorganismsthatcompletedthemanualannotationprocesshadtoperformsimilaramountsofstructuralannotationstocomputationallypredictedgeneannotations
• Computationallypredictedgenesoftenhaveinaccurategenestructures
• Communityannotationcaneffectivelyimprovegenesets
organismTotalnumberofmanually
annotatedmodels
Proportionofmanuallyannotatedmodels with
structuralchanges
Anoplophora glabripennis6 1144 0.75
Cimex lectularius7 1354 0.76
Oncopeltus fasciatus 1518 0.76
2.OGSgeneration– QualityControl• Manualcurationcanintroducemanyerrors,evenusingstandardsoftwarepackages(e.g.Apollo)
• QCprogramidentifiescommonformatting errorsfromthemanualcurationprocess
• Github repo:https://github.com/NAL-i5K/GFF3toolkit
• Identifiesover50errortypes• Anotherin-housepipelinecorrectsmanyoftheseerrors
2.OGSgeneration– QualityControl• Requiressomemanualreview– can’tbecompletelyautomated
• e.g.didyounameyourgenemodel‘test’or‘Contig277’?
• Notethati5kWorkspacestaffaren’t‘curators’inthetraditionalsense– wedonotreviewthebiologicalvalidityofanyofthecommunity-annotatedmodels.
• ThedegreeofmanualreviewofcommunityannotationsishigherifOfficialGeneSetsaretobesubmittedtoNCBI
2.OGSgeneration– QualityControl• Diaphorina citri example(Database,doi:10.1093/database/bax032)
• Firstroundofcorrectionsforcommunitycuration:• 513errorsin587manuallyannotatedmodels• 397oftheseerrorsneededcuratorfeedback
• Secondroundofcorrections:• 15errorsneededannotatorfeedback
Error checking Curator fixes
3.OGSgeneration– Merge
TheGFF3toolkitMergeprogramcanidentifywhichgenemodelsinthe‘reference’genesetshouldbereplacedbygenemodelsinasecondgeneset(i.e.themanuallyannotatedmodels)via‘auto-assignment’)
Referencegene
Manuallyannotatedgene
3.OGSgeneration– Merge
• Auto-assignmentusesbothsequencesimilarityandcoordinateoverlap
• ExtractCDSandpre-mRNAsequencesfrommRNAfeaturesfrombothgenesets.
• Useblastn todeterminewhichsequencesfromthemodifiedandreferencegenesetaligntoeachotherintheircodingsequence.
• Theseparametersareused:-evalue 1e-10-penalty-15-ungapped• Iftwomodelspassthealignmentstep,checkthatmatchedmodelsalsohavecoordinateoverlap
• Adda’ReplaceTag'withtheIDofeachoverlappingmodeltothemodifiedgeneset.
• Ifnoreferencemodeloverlapswithanewmodel,thentheprogramwilladd'replace=NA'.
3.OGSgeneration– Merge
• TheprogramdeterminesmergeactionsbasedontheReplaceTags:1. deletion2. simplereplacement3. newaddition4. splitreplacement5. mergereplacement
• Modelsfrommodifiedmanualannotationsreplacemodelsfromreferenceannotationsbasedonmergeactionsinstep2.
Referencegene
UpdatedgeneMergereplacement
3.OGSgeneration– Merge
• Diaphorina citri example(Database,doi:10.1093/database/bax032)1. #genesdeleted:12. #geneswithsimplereplacement:4373. #genes added:724. #genes split:385. #genes merged:316. TotalnumberofgenesinOGS:20,217
3.OGSgeneration– Merge
• Othersoftwaretoolscanbeusedtomergegenesets
• Combinertoolsthatuse‘weights’fordifferentinputannotations,e.g.
• EVidenceModeler (EVM,https://evidencemodeler.github.io/)• Glean(https://sourceforge.net/projects/glean-gene/)
• Otheroverlap-basedreplacementtools,e.g Bedtoolsintersect(http://bedtools.readthedocs.io/en/latest/)
4.OGSgeneration– ReleaseOGS
• GeneratenewormaintainoldgenemodelIDs• Establishreleasedatewithgenomecoordinator• Generatefasta files• Addtoi5kWorkspace@NAL database• *SubmittoNCBIifrequestedbygenomecoordinator*
CompletedOGSprojectsusingi5kWorkspace’spipeline• Diaphorina citri OGSv1.0• Frankliniella occidentalis OGSv1.0• Hyalella azteca OGSv1.0• Oncopeltus fasciatus OGSv1.2• Athalia rosae OGSv1.0• Orussus abietinus OGSv1.0• Leptinotarsa decemlineata OGSv1.0
Futureupdates
• Currentimprovments:• GFF3toolkitsupportforQCandmergeofnon-codingtranscripts(Li-MeiChiang)
• Futurework:• Improvemethodsformergingmulti-isoformmodels• ImproveQCprocess– howtoimprovecommunicationsabouterrorswithannotators
Questions?
i5kWorkspace@NAL:• https://i5k.nal.usda.gov/• https://github.com/NAL-i5K/• GFF3toolkitissuetracker:https://github.com/NAL-i5K/GFF3toolkit/issues
• Email:[email protected]
Thankyou!TheNALTeam
• Yu-yu Lin
• ChaitanyaGutta
• Li-MeiChiang
• YiHsiao
• GaryMoore
• SusanMcCarthy
I5kWorkspacealumni
• Chien-Yueh Lee
• HanLin
• Jun-WeiLin
• Vijaya Tsavatapalli
• Mei-Ju Chen
• Chao-ITuan
i5kWorkspace@NAL advisorycommittee
• i5kCoordinatingCommittee• i5kPilotProject• Apollo&JBrowse DevelopmentTeams• GMOD/Tripalcommunity
• Allofourusersandcontributors!
OGSgeneration– theGFF3toolkit
TheReplacedModelsfield
• Weusetheinformationinthisfieldtogenerateamerged,non-redundantgenesetfromthemanuallycuratedmodelsandtheofficialorprimarygeneset
• Yourofficialorprimarygenesetislistedinthecategoryfieldofthetrackselector
• Ifyoudon’tknowwhatyourproject’sgenesetis,contactus!
https://i5k.nal.usda.gov/apollo-replaced-models-field-explanations-and-examples
ReplacedModelsfield
Communityannotationlifecycle(endgoal:OGS)Genome
sequencing,assemblyandannotation
Communitybuilding:
Conferencecallsandtraining
Manualannotationvia
Apollo
Manualannotation‘freeze’
GeneralQC(NAL)
OfficialGeneSetgeneration(Merge
ofmanualannotationsand
referencegeneset)