Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
IntroductiontoMicrobialSequencing
Matthew L. SettlesGenome Center Bioinformatics Core
University of California, [email protected]; [email protected]
Generalrulesforpreparingandexperiment/samples
• Preparemoresamplesthenyouaregoingtoneed,i.e.expectsomewillbeofpoorquality,orfail
• Preparationstagesshouldoccuracrossallsamplesatthesametime(orascloseaspossible)andbythesameperson
• Spendtimepracticinganewtechniquetoproducethehighestqualityproductyoucan,reliably
• QualityshouldbeestablishedusingFragmentanalysistraces(pseudo-gelimages,RNARIN>7.0)
• DNA/RNAshouldnotbedegraded• 260/280ratiosforRNAshouldbeapproximately2.0and260/230shouldbebetween2.0and2.2.Valuesover1.8areacceptable
• QuantityshouldbedeterminedwithaFluorometer,suchasaQubit.
Samplepreparation
Inhighthroughputbiologicalwork(Microarrays,Sequencing,HTGenotyping,etc.),whatmayseemlikesmalltechnical
detailsintroducedduringsampleextraction/preparationcanleadtolargechanges,ortechnicalbias,inthedata.
Nottosaythisdoesn’toccurwithsmallerscaleanalysissuchasSangersequencingorqRT-PCR,buttheydobecomemoreapparent(seenonaglobalscale)andmaycausesignificant
issuesduringanalysis.
BeConsistent
BECONSISTENTACROSSALLSAMPLES!!!
Illuminasequencing
• IlluminaSBSTargetRegionP5BC2
BC1P7
Read1(50- 300bp)
Read2(50-300bp)
BC1(8bp) BC2(8bp)
Insertsize
Fragmentlength
IlluminaMISEQSEQUENCINGhttps://www.illumina.com/systems/sequencing-platforms/miseq/specifications.html
IlluminaHiSeq Sequencinghttps://www.illumina.com/systems/sequencing-platforms/hiseq-3000-4000/specifications.html
• ThefirstandmostbasicquestionishowmanybasepairsofsequencedatawillIgetFactorstoconsiderare:
• 1.Numberofreadsbeingsequenced• 2.Readlength(ifpairedconsiderthenasindividuals)• 3.Numberofsamplesbeingsequenced• 4.Expectedpercentageofusabledata
• Thenumberofreadsandreadlengthdataarebestobtainedfromthemanufacturer’swebsite(searchforspecifications)andalwaysusethelowerendoftheestimate.
SequencingDepth
GenomicCoverage
Onceyouhavethenumberofbasepairspersampleyoucanthendetermineexpectedcoverage
Factorstoconsiderthenare:1. Lengthofthegenome2. Anyextra-genomicsequence(ie mitochondria,virus,plasmids,etc.).For
bacteriainparticular,thesecanbecomeasignificantpercentage
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒𝑠𝑎𝑚𝑝𝑙𝑒 =
𝑟𝑒𝑎𝑑𝐿𝑒𝑛𝑔𝑡ℎ ∗ 𝑛𝑢𝑚𝑅𝑒𝑎𝑑𝑠 ∗ 0.8𝑛𝑢𝑚𝑆𝑎𝑚𝑝𝑙𝑒𝑠 ∗num.lanes
𝑇𝑜𝑡𝑎𝑙𝐺𝑒𝑛𝑜𝑚𝑖𝑐𝐶𝑜𝑛𝑡𝑒𝑛𝑡
Considerations(whenaliteraturesearchturnsupnothing)• Proportionthatishost(non-microbialgenomiccontent)• Proportionthatismicrobial(genomiccontentofinterest)• Numberofspecies• Genomesizeofeachspecies• Relativeabundanceofeachspecies
Metagenomics Sequencing
Thebackoftheenvelopecalculation
𝑛𝑢𝑚𝑅𝑒𝑎𝑑𝑠𝑠𝑎𝑚𝑝𝑙𝑒 =
𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 ∗ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝐺𝑒𝑛𝑜𝑚𝑒𝑆𝑖𝑧𝑒𝑅𝑒𝑎𝑑𝐿𝑒𝑛 ∗ 𝐷𝑖𝑙𝑢𝑡𝑖𝑜𝑛𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (1 − ℎ𝑜𝑠𝑡𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛) ∗
10.8
SequencingDepth– Countingbasedexperiments
• Coverageisdetermineddifferentlyfor”Counting”basedexperiments(RNAseq,amplicons,etc.)whereanexpectednumberofreadspersampleistypicallymoresuitable.
• ThefirstandmostbasicquestionishowmanyreadspersamplewillIgetFactorstoconsiderare(perlane):1.Numberofreadsbeingsequenced2.Numberofsamplesbeingsequenced3.Expectedpercentageofusabledata4.Numberoflanesbeingsequenced
IJKLMMKNOPJ
= IJKLM.MJQRJSTJL∗U.VMKNOPJM.OWWPJL
*num.lanes
• Readlength,orSEvsPE,doesnotfactorintosequencingdepth.
AmpliconSequencing(Communities,genotyping)
Considerations• Numberofreadsbeingsequenced• Proportionthatisdiversitysample(e.g.PhiX)• Numberofsamplesbeingpooledintherun
Thebackoftheenvelopecalculation𝑟𝑒𝑎𝑑𝑠𝑠𝑎𝑚𝑝𝑙𝑒 =
𝑟𝑒𝑎𝑑𝑠_𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑑 ∗ 1 − 𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒𝑛𝑢𝑚_𝑠𝑎𝑚𝑝𝑙𝑒𝑠
example102,000𝑠𝑎𝑚𝑝𝑙𝑒 =
18𝑒6 ∗ 1 − 0.15150Recommendations
• Illumina‘recommends’100Kpersample• I’veused30Kpersamplehistorically,othersarefinewith3Kpersample• Reallyshouldhaveasmanyreadsasyourexperimentneeds
HowMuch?CommunityRarefactioncurves
• ’Deep’sequenceanumberoftestsamplesamplicons:~1M+reads.metagenomics:1fullHiSeq lane
• Plotrarefactionscurvesoforganismidentification,todetermineifsaturationisachieved
Metagenomics assembly
Todetermineifyou’vesequenced‘enough’tore-assemble‘most’ofthecommunitymember’sgeneticcontent,looktowhatisleftover- proportionally
Ampliconsvs.Metagenomics
• Metagenomics• Shotgunlibrariesintendedtosequencerandomgenomicsequencesfromtheentirebacterialcommunity.
• Canbecostlypersample($500tomultithousandspersample)• Betterresolutionandsensitivitytocharacterizethesample• Duetocost,canonlydorelativelyfewsamples
• Ampliconcommunityprofiling• Sequenceonlyoneregionsofonegene(e.g.16s,ITS,LSU)• Cheappersample(atscale,downto$20/sample)• Duetocost,candomanyhundredsofsamplesmakemoreglobalinferences
• TaxonomicIdentification• Ampliconbased(e.g.16svariableregions)• ShotgunMetagenomics
• FunctionalCharacterization• ShotgunMetagenomics• ShotgunMetatranscriptomics (active)
• GenomeAssembly,FunctionandVariation• ShotgunMetagenomics• ShotgunMetatranscriptomics
CommunitySequencingDesigns
• DNA/RNAextractionandQA/QC(Bioanalyzer/Gels)• Metatranscriptomes:EnrichmentofRNAofinterestandRNAlibrarypreparation
• LibraryQA/QC(Bioanalyzer andQubit)• Pooling($10/library)
• Metagenomes:DNAlibrarypreparation• LibraryQA/QC(Bioanalyzer andQubit)• Pooling($10/library)
• CommunityProfiling:PCRreactions• LibraryQA/QC(Bioanalyzer andQubit/microplatereader)• Pooling
• Sequencing(NumberofLanes/runs)• Bioinformatics(Generalruleistoestimatethesameamountasdatageneration,i.e.doubleyourbudget)
http://dnatech.genomecenter.ucdavis.edu/prices/
CostEstimation
BioinformaticsCosts
Bioinformaticsincludes:1.Storageofdata2.Accessanduseofcomputationalresourcesandsoftware3.SystemAdministrationtime4.BioinformaticsDataAnalysistime5.Backandforthconsultation/analysistoextractbiologicalmeaning
Ruleofthumb:Bioinformaticscanandshouldcostasmuch(sometimesmore)asthecostofdatageneration.
CostEstimation• Amplicons
• 384Samples• Amplicongeneration($20/sample)=$383/sample=$4,596
• SequencingPE300,target30Kreadspersample• Bioinformatics
• Metagenome• 12samples(DNA)• Expectations:HostProportion40%,useaveragegenomesizeofeColi,Targetthe1%andcoverageof20
• SequencingPE150• Bionformatics
TakeHomes
• Experienceand/orliteraturesearches(otherpeoplesexperiences)willprovidethebestjustificationforestimatesonneededdepth.
• ‘Longer’readsarebetterthanshortreads.• Paired-endreadsaremoreusefulthansingle-endreads• Librariescanbesequencedagain,sodoapilot,performapreliminaryanalysis,thensequencemoreaccordingly.