Introduction to Microbial Sequencing€¢Quality should be established using Fragment analysis...

Preview:

Citation preview

IntroductiontoMicrobialSequencing

Matthew L. SettlesGenome Center Bioinformatics Core

University of California, Davissettles@ucdavis.edu; bioinformatics.core@ucdavis.edu

Generalrulesforpreparingandexperiment/samples

• Preparemoresamplesthenyouaregoingtoneed,i.e.expectsomewillbeofpoorquality,orfail

• Preparationstagesshouldoccuracrossallsamplesatthesametime(orascloseaspossible)andbythesameperson

• Spendtimepracticinganewtechniquetoproducethehighestqualityproductyoucan,reliably

• QualityshouldbeestablishedusingFragmentanalysistraces(pseudo-gelimages,RNARIN>7.0)

• DNA/RNAshouldnotbedegraded• 260/280ratiosforRNAshouldbeapproximately2.0and260/230shouldbebetween2.0and2.2.Valuesover1.8areacceptable

• QuantityshouldbedeterminedwithaFluorometer,suchasaQubit.

Samplepreparation

Inhighthroughputbiologicalwork(Microarrays,Sequencing,HTGenotyping,etc.),whatmayseemlikesmalltechnical

detailsintroducedduringsampleextraction/preparationcanleadtolargechanges,ortechnicalbias,inthedata.

Nottosaythisdoesn’toccurwithsmallerscaleanalysissuchasSangersequencingorqRT-PCR,buttheydobecomemoreapparent(seenonaglobalscale)andmaycausesignificant

issuesduringanalysis.

BeConsistent

BECONSISTENTACROSSALLSAMPLES!!!

Illuminasequencing

• IlluminaSBSTargetRegionP5BC2

BC1P7

Read1(50- 300bp)

Read2(50-300bp)

BC1(8bp) BC2(8bp)

Insertsize

Fragmentlength

IlluminaMISEQSEQUENCINGhttps://www.illumina.com/systems/sequencing-platforms/miseq/specifications.html

IlluminaHiSeq Sequencinghttps://www.illumina.com/systems/sequencing-platforms/hiseq-3000-4000/specifications.html

• ThefirstandmostbasicquestionishowmanybasepairsofsequencedatawillIgetFactorstoconsiderare:

• 1.Numberofreadsbeingsequenced• 2.Readlength(ifpairedconsiderthenasindividuals)• 3.Numberofsamplesbeingsequenced• 4.Expectedpercentageofusabledata

• Thenumberofreadsandreadlengthdataarebestobtainedfromthemanufacturer’swebsite(searchforspecifications)andalwaysusethelowerendoftheestimate.

SequencingDepth

GenomicCoverage

Onceyouhavethenumberofbasepairspersampleyoucanthendetermineexpectedcoverage

Factorstoconsiderthenare:1. Lengthofthegenome2. Anyextra-genomicsequence(ie mitochondria,virus,plasmids,etc.).For

bacteriainparticular,thesecanbecomeasignificantpercentage

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒𝑠𝑎𝑚𝑝𝑙𝑒 =

𝑟𝑒𝑎𝑑𝐿𝑒𝑛𝑔𝑡ℎ ∗ 𝑛𝑢𝑚𝑅𝑒𝑎𝑑𝑠 ∗ 0.8𝑛𝑢𝑚𝑆𝑎𝑚𝑝𝑙𝑒𝑠 ∗num.lanes

𝑇𝑜𝑡𝑎𝑙𝐺𝑒𝑛𝑜𝑚𝑖𝑐𝐶𝑜𝑛𝑡𝑒𝑛𝑡

Considerations(whenaliteraturesearchturnsupnothing)• Proportionthatishost(non-microbialgenomiccontent)• Proportionthatismicrobial(genomiccontentofinterest)• Numberofspecies• Genomesizeofeachspecies• Relativeabundanceofeachspecies

Metagenomics Sequencing

Thebackoftheenvelopecalculation

𝑛𝑢𝑚𝑅𝑒𝑎𝑑𝑠𝑠𝑎𝑚𝑝𝑙𝑒 =

𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 ∗ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝐺𝑒𝑛𝑜𝑚𝑒𝑆𝑖𝑧𝑒𝑅𝑒𝑎𝑑𝐿𝑒𝑛 ∗ 𝐷𝑖𝑙𝑢𝑡𝑖𝑜𝑛𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (1 − ℎ𝑜𝑠𝑡𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛) ∗

10.8

SequencingDepth– Countingbasedexperiments

• Coverageisdetermineddifferentlyfor”Counting”basedexperiments(RNAseq,amplicons,etc.)whereanexpectednumberofreadspersampleistypicallymoresuitable.

• ThefirstandmostbasicquestionishowmanyreadspersamplewillIgetFactorstoconsiderare(perlane):1.Numberofreadsbeingsequenced2.Numberofsamplesbeingsequenced3.Expectedpercentageofusabledata4.Numberoflanesbeingsequenced

IJKLMMKNOPJ

= IJKLM.MJQRJSTJL∗U.VMKNOPJM.OWWPJL

*num.lanes

• Readlength,orSEvsPE,doesnotfactorintosequencingdepth.

AmpliconSequencing(Communities,genotyping)

Considerations• Numberofreadsbeingsequenced• Proportionthatisdiversitysample(e.g.PhiX)• Numberofsamplesbeingpooledintherun

Thebackoftheenvelopecalculation𝑟𝑒𝑎𝑑𝑠𝑠𝑎𝑚𝑝𝑙𝑒 =

𝑟𝑒𝑎𝑑𝑠_𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑑 ∗ 1 − 𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦_𝑠𝑎𝑚𝑝𝑙𝑒𝑛𝑢𝑚_𝑠𝑎𝑚𝑝𝑙𝑒𝑠

example102,000𝑠𝑎𝑚𝑝𝑙𝑒 =

18𝑒6 ∗ 1 − 0.15150Recommendations

• Illumina‘recommends’100Kpersample• I’veused30Kpersamplehistorically,othersarefinewith3Kpersample• Reallyshouldhaveasmanyreadsasyourexperimentneeds

HowMuch?CommunityRarefactioncurves

• ’Deep’sequenceanumberoftestsamplesamplicons:~1M+reads.metagenomics:1fullHiSeq lane

• Plotrarefactionscurvesoforganismidentification,todetermineifsaturationisachieved

Metagenomics assembly

Todetermineifyou’vesequenced‘enough’tore-assemble‘most’ofthecommunitymember’sgeneticcontent,looktowhatisleftover- proportionally

Ampliconsvs.Metagenomics

• Metagenomics• Shotgunlibrariesintendedtosequencerandomgenomicsequencesfromtheentirebacterialcommunity.

• Canbecostlypersample($500tomultithousandspersample)• Betterresolutionandsensitivitytocharacterizethesample• Duetocost,canonlydorelativelyfewsamples

• Ampliconcommunityprofiling• Sequenceonlyoneregionsofonegene(e.g.16s,ITS,LSU)• Cheappersample(atscale,downto$20/sample)• Duetocost,candomanyhundredsofsamplesmakemoreglobalinferences

• TaxonomicIdentification• Ampliconbased(e.g.16svariableregions)• ShotgunMetagenomics

• FunctionalCharacterization• ShotgunMetagenomics• ShotgunMetatranscriptomics (active)

• GenomeAssembly,FunctionandVariation• ShotgunMetagenomics• ShotgunMetatranscriptomics

CommunitySequencingDesigns

• DNA/RNAextractionandQA/QC(Bioanalyzer/Gels)• Metatranscriptomes:EnrichmentofRNAofinterestandRNAlibrarypreparation

• LibraryQA/QC(Bioanalyzer andQubit)• Pooling($10/library)

• Metagenomes:DNAlibrarypreparation• LibraryQA/QC(Bioanalyzer andQubit)• Pooling($10/library)

• CommunityProfiling:PCRreactions• LibraryQA/QC(Bioanalyzer andQubit/microplatereader)• Pooling

• Sequencing(NumberofLanes/runs)• Bioinformatics(Generalruleistoestimatethesameamountasdatageneration,i.e.doubleyourbudget)

http://dnatech.genomecenter.ucdavis.edu/prices/

CostEstimation

BioinformaticsCosts

Bioinformaticsincludes:1.Storageofdata2.Accessanduseofcomputationalresourcesandsoftware3.SystemAdministrationtime4.BioinformaticsDataAnalysistime5.Backandforthconsultation/analysistoextractbiologicalmeaning

Ruleofthumb:Bioinformaticscanandshouldcostasmuch(sometimesmore)asthecostofdatageneration.

CostEstimation• Amplicons

• 384Samples• Amplicongeneration($20/sample)=$383/sample=$4,596

• SequencingPE300,target30Kreadspersample• Bioinformatics

• Metagenome• 12samples(DNA)• Expectations:HostProportion40%,useaveragegenomesizeofeColi,Targetthe1%andcoverageof20

• SequencingPE150• Bionformatics

TakeHomes

• Experienceand/orliteraturesearches(otherpeoplesexperiences)willprovidethebestjustificationforestimatesonneededdepth.

• ‘Longer’readsarebetterthanshortreads.• Paired-endreadsaremoreusefulthansingle-endreads• Librariescanbesequencedagain,sodoapilot,performapreliminaryanalysis,thensequencemoreaccordingly.

Recommended