Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 1 of 15
Qualitycontrol
Sequencingtechnologiesarenotperfectandthequalitycontrol(QC)isanessentialsteptoensurethatthedatausedfordownstreamanalysisisnotcompromisedoflow‐qualitysequences,sequenceartifacts,orsequencecontaminationthatmightleadtoerroneousconclusions.TheeasiestwayofQCislookingatsummarystatisticsofthedata.Therearedifferentprogramsthatcanproducethosestatistics.Webapplicationsallowuserstoeasilyshareanddiscusstheresultswithotherpeoplewithouttransferringlargedatafiles.ThefollowingQCstepsareimplementedinandallgraphicsgeneratedbyPRINSEQ(http://prinseq.sourceforge.net).
Content:
• Necessaryresources• UploadingdatatothePRINSEQwebversion
• Numberandlengthofsequences• Basequalities
• GCcontent
• Poly‐A/Ttails• Ambiguousbases
• Sequenceduplications
• Sequencecomplexity• Tagsequences
• Sequencecontamination• Assemblyqualitymeasures
• References
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 2 of 15
Necessaryresources
Hardware
ComputerconnectedtotheInternet
Software
Up‐to‐dateWebbrowser(Firefox,Safari,Chrome,InternetExplorer,…)
Files
FASTAfilewithsequencedataQUALfilewithqualityscores(ifavailable)
FASTQfile(asalternativeformat)
UploadingdatatothePRINSEQwebversion
TouploadanewdatasetinFASTAandQUALformat(orFASTQformat)toPRINSEQ,followthesesteps:
1. Gotohttp://prinseq.sourceforge.net
2. Clickon“UsePRINSEQ”inthetopmenuontheright(thelatestPRINSEQwebversionshouldload)
3. Clickon“Uploadnewdata”
4. SelectyourFASTAandQUALfilesoryourFASTQfileandclick“Submit”
Afterthedataisparsedandprocessedsuccessfully,theuserinterfacewillshowamenuontheleftandamessageinthemainpanelasshownbelow.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 3 of 15
Notes
Afterclickingthesubmitbutton,astatusbar(notprogressbar)willbedisplayeduntilthefileuploadiscompleted.Duringthedataprocessing,severalprogressbarswillshowtheprogressofthedataparsingandstatisticscalculationsteps.
Possibleproblems
1. ThePRINSEQwebinterfacedoesnotload/isnotvisible.
Youonlyseethisandnothingelsehappens:
Solution:MakesurethatyouhaveJavaScriptactivatedinyourbrowser,asthisisrequiredtoloadandusePRINSEQ’swebinterface.
2. Theuploadstatusbardoesnotdisappear.
Afterclickingonthesubmitbuttonyouseethisanditdoesnotdisappear:
Solution:Thefirstthingtocheckisifthefileisstilluploading.Theeasiestwaytodothisisbycheckingtheloadingiconinyourbrowser.
Ifyouseetheloadingicon(right)insteadofthePRINSEQicon(left),yourfileisstilluploadingandyoushouldgiveitmoretime.IfyouseethePRINSEQiconinsteadoftheloadingicon,yourfiledidnotuploadcompletelyandthiscausedanerror.IfyouhaveaslowconnectiontotheInternetortrytouploadlargefiles,theconnectiontothewebservercantimeoutbeforetheuploadwascompleted.Ifyoudidnotuploadcompressedfiles,trytocompressyourfileswithanyofthesupportedcompressionalgorithms(ZIP,GZIP,…).Inrarecases,theissuecanalsobecausedbycertainFirefoxpluginsorextensions.Ifpossible,useanalternativebrowsertotestifthiswasthecase.Ifthebrowsercausedtheproblem,updatingFirefoxandtheplugins/extensiontothelatestversionmightsolvetheproblem.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 4 of 15
Numberandlengthofsequences
Checkthosenumberstomakesureitmatchesapproximatelythemanufacturerestimates.Ifyournumbersareofftoomuch,checktherawdataandfilterstatisticsin"454BaseCallerMetrics"and"454QualityFilterMetrics".
Lengthdistribution
Thelengthdistributioncanbeusedasqualitymeasureforthesequencingrun.Youwouldexpectanormaldistributionforthebestresult.However,mostsequencingresultsshowaslowlyincreasingandthenasteepfallingdistribution.TheplotsinPRINSEQmarkthemeanlength(M)andthelengthforoneandtwostandarddeviations(1SDand2SD),whichcanhelptodecidewheretosetlengththresholdsforthedatapreprocessing.Ifanyofthesequencesislongerthan100bp,thelengthswillbebinnedintheplotsgeneratedbyPRINSEQ.Thenumberofsequencesforeachbinisthenshowninsteadofthenumberofsequencesforasinglelength(valuesmightthereforebebiggerthanshowninthetablefornon‐binnedlengths).
Thefollowingtwodatasetshaveapproximatelythesamenumberofsequences,howeverthelengthdistributionslookdifferent.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 5 of 15
Bothdistributionshavethehighestnumberofsequencesaround500bp,butforthefirstdatasetthemeanofthesequencelengthsishigherandthestandarddeviationislower.Acertainnumberofshorterreadsmightbeexpected,butifthesamplecontainedmainlylongerfragments,thisnumbershouldbelow.
Assumingthatbothsamplescontainedenoughfragmentsofatleast500bpandallfragmentsweresequencedwiththesamenumberofcycles(sequencingflows),wewouldexpectthatthemajorityofthesequenceswouldhaveapproximatelythesamelength.Thehigheramountofshorterreadsintheseconddatasetsuggeststhatthosereadsmighthavebeenoflowerqualityandweretrimmedduringthesignalprocessing.Ifthesamplecontainedmanyshortfragments,theshorterreadsmightbefromthosefragmentsandnotoflowerquality.
Minimumandmaximumreadlength
SequencesintheSFFfilescanbeasshortas40bp(shortersequencesarefilteredduringsignalprocessing).Formultiplexedsamples,theMIDtrimmedsequencescanbeasshortat28bp(assuminga12bpMIDtag).Suchshortsequencescancauseproblemsduring,forexample,databasesearchestofindsimilarsequences.Shortsequencesaremorelikelytomatchatarandompositionbychancethanlongersequencesandmaythereforeresultinfalsepositivefunctionalortaxonomicalassignments.Furthermore,shortsequencesarelikelytobequalitytrimmedduringthesignal‐processingstepandoflowerqualitywithpossiblesequencingerrors.
Insomecases,sequencescanbemuchlongerthanseveralstandarddeviationsabovethemeanlength(e.g.1,500+bpfora500bpmeanlengthwitha100bpstandarddeviation).Thosesequencesshouldbeusedwithcautionastheylikelycontainlongstretchesofhomopolymerrunsasinthefollowingexample.Homopolymersareaknownissueofpyrosequencingtechnologiessuchas454/Roche[1].
aactttaaccttttaaaacccccttaaaaaaactttaaaccccgtaaaccccccgggttt ttttttaaaaaaccgttttttacgggggtttaccccgttttaccggggttttgggggttt taaaaaaaacgggttttaaacgggttaacccccgggttttccgggggtttaaaaagtttt tttaaacgggggttttcccgtaaaaaaaaaaccccgtttaaaaaaaggggttaaaaaaaa aaggggttaaccccccggggtttaaaaaaaaccttttttttttttaaaaaaaacgttttt tttttttaaaaggggttttttttacgggggtaaacgggggggttaaaaaaaaaccccccc cggggggttttaaaaaaaaaacccccggttttaaaaaaccccgttttaacccctttaaaa aaaaaacgggggggttttaaaaaaaaaagggggttttttttttttaaaaacccgttttta aaaaccccccgttttttaacccgggttaaaccccccccgggggggtaaaacccccccccc ggggtaaccccctttttttaaaacccccccccgttttttacccgggggtttttacccccg gggggggtaaaaaaacggggggtttttttttttttaaaaccggggttttttttttttaaa ccccggtttttaaaaaccggtttttaccccggggggttttacccccgggggggggttttt aaaacccccggtttaaaactttaaaaacccgggtaaccccggggttttaaaaaaaaaaaa aaaaccccccccgttaaaaaaaaaaaacccgttttttttttaaaaaaaacccccccccgg ttttaaaaccccccccgggggtttttaccccggggttttaaaaaaaacccgtttaaaaaa accgggttttttaaaggggttttaaacccccccccc
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 6 of 15
Basequalities
AsforSangersequencing,next‐generationsequencersproducedatawithlinearlydegradingqualityacrosstheread.Thequalityscoresfor454/RochesequencersarePhred‐basedsinceversion1.1.03,rangingfrom0to40.Phredvaluesarelog‐scaled,whereaqualityscoreof10representsa1in10chanceofanincorrectbasecallandaqualityscoreof20representsa1in100chanceofanincorrectbasecall.
InPRINSEQ,thequalityscoresareplottedacrossthereadsusingboxplots.Thex‐axisindicatestheabsolutepositionifallreadsarenolongerthan100bpandtherelativeposition(in%ofreadlength)ifanyreadislongerthan100bp.Fordatasetswithanyreadlongerthan100bp,asecondplotshowsbinnedqualityvaluestokeepitsabsolutepositions.Thisplotishelpfultoidentifyqualityscoresattheendoflongerreads,whichwouldotherwisebegroupedwiththeendsoftheshorterreads.ThefollowingexampleshowsthequalityscoresacrossthereadlengthforfragmentssequencedwithGSFLXusingtheTitaniumkit.Thesequenceswithlowqualityscoresattheendsshouldbetrimmedduringdatapreprocessing.
Inadditiontothedecreaseinqualityacrosstheread,regionswithhomopolymerstretcheswilltendtohavelowerqualityscores.Huseetal.[1]foundthatsequenceswithanaveragescorebelow25hadmoreerrorsthanthosewithhigheraverages.Therefore,itishelpfultotakealookattheaverage(ormean)qualityscores.PRINSEQprovidesaplotthatshowsthedistributionofsequencemeanqualityscoresofadataset,asshownbelow.Themajorityofthesequencesshouldhavehighmeanqualityscores.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 7 of 15
Lowqualitysequencescancauseproblemsduringdownstreamanalysis.Mostassemblersoralignersdonottakeintoaccountqualityscoreswhenprocessingthedata.Theerrorsinthereadscancomplicatetheassemblyprocessandmightcausemisassembliesormakeanassemblyimpossible.
GCcontent
TheGCcontentdistributionofmostsamplesshouldfollowanormaldistribution.Insomecases,abi‐modaldistributioncanbeobserved,especiallyformetagenomicdatasets.TheGCcontentplotinPRINSEQmarksthemeanGCcontent(M)andtheGCcontentforoneandtwostandarddeviations(1SDand2SD).ThiscanhelptodecidewheretosettheGCcontentthresholds,ifaGCcontentfilterwillbeapplied.Theplotcanalsobeusedtofindthethresholdsorrangetoselectsequencesfromabi‐modaldistribution.
PolyA/Ttails
Poly‐A/TtailsareconsideredrepeatsofAsorTsatthesequenceends.InPRINSEQ,theminimumlengthofatailis5bpandsequencesthatcontainonlyAsorTsarecountedforbothends.Asmallnumberoftailscanoccurevenaftertrimmingpoly‐A/Ttails.Forexample,asequencethatendswithAAAAATTTTTandthathasbeentrimmedforthepoly‐Twillstillcontainthepoly‐A.
Trimmingpoly‐A/Ttailscanreducethenumberoffalsepositivesduringdatabasesearches,aslongtailstendtoalignwelltosequenceswithlowcomplexityorsequenceswithtails(e.g.viralsequences)inthedatabase.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 8 of 15
Ambiguousbases
SequencescancontaintheambiguousbaseNforpositionsthatcouldnotbeidentifiedasaparticularbase.AhighnumberofNscanbeasignforalowqualitysequenceorevendataset.Ifnoqualityscoresareavailable,thesequencequalitycanbeinferredfromthepercentofNsfoundinasequenceordataset.Huseetal.[1]foundthatthepresenceofanyambiguousbasecallswasasignforoverallpoorsequencequality.
Ambiguousbasescancauseproblemsduringdownstreamanalysis.AssemblerssuchasVelvetandalignerssuchasSHAHA2orBWAusea2‐bitencodingsystemtorepresentnucleotides,asitoffersaspaceefficientwaytostoresequences.Forexample,thenucleotidesA,C,GandTmightbe2‐bitencodedas00,01,10and11.The2‐bitencoding,however,onlyallowstostorethefournucleotidesandanyadditionalambiguousbasecannotberepresented.Thedifferentprogramsdealwiththeproblemindifferentways.Someprogramsreplaceambiguousbaseswitharandombase(e.g.BWA[2])andotherswithafixedbase(e.g.SHAHA2andVelvetreplaceNswithAs[3]).Thiscanresultinmisassembliesorfalsemappingofsequencestoareferencesequenceandtherefore,sequenceswithahighnumberofNsshouldberemovedbeforedownstreamanalysis.
Sequenceduplications
Realorartificial?Assumingarandomsamplingofthegenomicmaterialinanenvironmentsuchasinmetagenomicstudies,readsshouldnotstartatthesamepositionandhavethesameerrors(atleastnotinthenumbersthattheyhavebeenobservedinmostmetagenomes).Gomez‐Alvarezetal.[5]investigatedtheprobleminmoredetailanddidnotfindaspecificpatternorlocationonthesequencingplatethatcouldexplaintheduplications.
Duplicatescanarisewhentherearetoofewfragmentspresentatanystagepriortosequencing,especiallyduringanyPCRstep.Furthermore,thetheoreticalideaofonemicro‐reactorcontainingonebeadfor454/Rochesequencingdoesnotalwaystranslateintopracticewheremanybeadscanbefoundinasinglemicro‐reactor.Unfortunately,artificialduplicatesaredifficulttodistinguishfromexactlyoverlappingreadsthatnaturallyoccurwithindeepsequencesamples.
Thenumberofexpectedsequenceduplicateshighlydependsonthedepthofthelibrary,thetypeoflibrarybeingsequenced(wholegenome,transcriptome,16S,metagenome,...),andthesequencingtechnologyused.Thesequenceduplicatescanbedefinedusingdifferentmethods.Exactduplicatesareidenticalsequencecopies,whereas5'or3'duplicatesaresequencesthatareidenticalwiththe5'or3'endofalongersequence.Consideringthedouble‐strandednatureofDNA,duplicatescouldalsobeconsideredsequencesthatareidenticalwiththereversecomplementofanothersequence.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 9 of 15
ThedifferentplotsinPRINSEQcanbehelpfultoinvestigatethedegreeofsequenceduplicationsinadataset.Thefollowingplotshowsthenumberofsequenceduplicatesfordifferentlengths.Thedistributionofduplicatesshouldbesimilartothelengthdistributionofthedataset.Thenumberof5’duplicatesishigherforshortersequences(asobservedintheexamplebelow),suggestingthatexactsequenceduplicatesmayhavebeentrimmedduringsignalprocessing.
Thenumberofexactduplicatesisoftenhigherthanthenumberof5’and3’duplicatesasinthefollowingexample.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 10 of 15
PRINSEQoffersadditionalplotstoinvestigatethesequenceduplicatesfromdifferentpointsofview.Theplotshowingthesequenceduplicationlevels(withnumberofsequenceswithoneduplicate,twoduplicates,threeduplicates,…)canbeusedtoidentifythedistributionofduplicates(e.g.domanysequenceshaveonlyafewduplicates).Theplotshowingthehighestnumberofduplicatesforasinglesequence(top100)canhelptoindentifyifonlyafewsequenceshavemanyduplicates(e.g.asaresultofspecificPCRamplification)andwhatthehighestduplicationnumbersare.
Dependingonthedatasetanddownstreamanalysis,itshouldbeconsideredtofiltersequenceduplicates.ThemainpurposeofremovingduplicatesistomitigatetheeffectsofPCRamplificationbiasintroducedduringlibraryconstruction.Inaddition,removingduplicatescanresultincomputationalbenefitsbyreducingthenumberofsequencesthatneedtobeprocessedandbyloweringthememoryrequirements.Sequenceduplicatescanalsoimpactabundanceorexpressionmeasuresandcanresultinfalsevariant(SNP)calling.Theexamplebelowshowsthealignmentofsequencesagainstareferencesequence(gray).Thesequenceduplicates(startingatthesameposition)suggestapossiblyfalsefrequencyofbaseCatthepositionmarkedinbold.
Keepinmindthatthenumberofsequenceduplicatesalsodependsontheexperiment.Forshort‐readdatasetswithhighcoveragesuchasinultra‐deepsequencingorgenomere‐sequencingdatasets,eliminatingsingletonscanpresentaneasywayofdramaticallyreducingthenumberoferror‐pronereads.
Sequencecomplexity
Genomesequencescanexhibitintervalswithlow‐complexity,whichmaybepartofthesequencedatasetwhenusingrandomsamplingtechniques.Low‐complexitysequencesaredefinedashavingcommonlyfoundstretchesofnucleotideswithlimitedinformationcontent(e.g.thedinucleotiderepeatCACACACACA).Suchsequencescanproducealargenumberofhigh‐scoringbutbiologicallyinsignificantresultsindatabasesearches.Thecomplexityofasequencecanbeestimatedusingmanydifferentapproaches.PRINSEQcalculatesthesequencecomplexityusingtheDUSTandEntropyapproachesastheypresenttwocommonlyusedexamples.
...ACCACACGTGTTGTGTACATGAACACAGTATATGAGCATACAGAT...
TGAACACAGTCTATGAGCATACAGAT...
TGAACACAGTCTATGAGCATACAGAT...
TGAACACAGTCTATGAGCATACAGAT...
TGAACACAGTCTATGAGCATACAGAT...
TGAACACAGTCTATGAGCATACAGAT...
GTGTACATGAACACAGTATATGAGCATACAGAT...
GTGTTGTGTACATGAACACAGTATATGAGCATACAGAT...
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 11 of 15
TheDUSTapproachisadaptedfromthealgorithmusedtomasklow‐complexityregionsduringBLASTsearchpreprocessing[6].Thescoresarecomputedbasedonhowoftendifferenttrinucleotidesoccurandarescaledfrom0to100.Higherscoresimplylowercomplexityandcomplexityscoresabove7canbeconsideredlow‐complexity.Asequenceofhomopolymerrepeats(e.g.TTTTTTTTT)hasascoreof100,ofdinucleotiderepeats(e.g.TATATATATA)hasascorearound49,andoftrinucleotiderepeats(e.g.TAGTAGTAGTAG)hasascorearound32.
TheEntropyapproachevaluatestheentropyoftrinucleotidesinasequence.Theentropyvaluesarescaledfrom0to100andlowerentropyvaluesimplylowercomplexity.Asequenceofhomopolymerrepeats(e.g.TTTTTTTTT)hasanentropyvalueof0,ofdinucleotiderepeats(e.g.TATATATATA)hasavaluearound16,andoftrinucleotiderepeats(e.g.TAGTAGTAGTAG)hasavaluearound26.Sequenceswithanentropyvaluebelow70canbeconsideredlow‐complexity.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 12 of 15
Tagsequences
Tagsequencesareartifactsattheendsofsequencereadssuchasmultiplexidentifiers,adapters,andprimersequencesthatwereintroducedduringpre‐amplificationwithprimer‐basedmethods.Thebasefrequenciesacrossthereadspresentaneasywaytocheckfortagsequences.Ifthedistributionseemsuneven(highfrequenciesforcertainbasesoverseveralpositions),itcouldindicatesomeresidualtagsequences.Thefollowingthreeexamplesshowthebasefrequenciesofdatasetswithnotagsequence,multiplexidentifier(MID)tagsequence,andwholetranscriptomeamplified(WTA)tagsequence.
ThosetagsequenceshouldbetrimmedusingaprogramsuchasTagCleaner(http://tagcleaner.sourceforge.net)[4].Theinputtoanysuchtrimmingprogramshouldbeuntrimmedreads(e.g.notqualitytrimmed),asthiswillalloweasierandmoreaccurateidentificationoftagsequences.PRINSEQcanbeusedaftertagsequencetrimmingtocheckifthetagswereremovedsufficiently.Inadditiontothefrequencyplots,PRINSEQestimatesifthedatasetcontainstagsequences.Theprobabilitiesforatagsequenceatthe5’‐or3’‐endrequireacertainnumberofsequences(10,000shouldbesufficient).Apercentagebelow40%doesnotalwayssuggestatagsequence,especiallyifitcannotbeobservedfromthebasefrequencies.Theestimationdoesnotworkforsequencedatasetsthattargetasingleloci(e.g.16S)andshouldonlybeusedforrandomlysequencedsamplessuchasmetagenomes.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 13 of 15
Sequencecontamination
SequencesobtainedfromimpurenucleicacidpreparationsmaycontainDNAfromsourcesotherthanthesample.Thosesequencecontaminationsareaseriousconcerntothequalityofthedatausedfordownstreamanalysis,possiblycausingerroneousconclusions.ThedinucleotideoddsratiosascalculatedbyPRINSEQusetheinformationcontentinthesequencesofadatasetandcanbeusedtoidentifypossiblycontamination[7].Furthermore,dinucleotideabundanceshavebeenshowntocapturethemajorityofvariationingenomesignaturesandcanbeusedtocompareametagenometoothermicrobialorviralmetagenomes.PRINSEQusesprincipalcomponentanalysis(PCA)togroupmetagenomesfromsimilarenvironmentsbasedondinucleotideabundances.Thiscanhelptoinvestigateifthecorrectsamplewassequenced,asviralandmicrobialmetagenomesshowdistinctpatterns.Assamplesmightbeprocessedusingdifferentprotocolsorsequencedusingdifferenttechniques,thisfeatureshouldbeusedwithcaution.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 14 of 15
ThePCAplotsinPRINSEQshowhowtheusermetagenome(representedbyareddot)groupswithothermetagenomes(bluedots).Sincetheplotsaregeneratedformicrobialandviralmetagenomesseparately,theyaremarkedwithanMorV(topleftcorner).Thepercentagesinparenthesisshowtheexplainedvariationinthefirstandsecondprincipalcomponent.Theplotsaregeneratedusingpreprocesseddatafrompublishedmetagenomesthatweresequencedusingthe454/Rochesequencingplatform.Ifsequencescontaintagsequencesoraretargetedtoacertainloci(e.g.16S),thisapproachwillnotbeabletogrouptheuserdatatometagenomesfromthesameenvironment.Theplotaboveshowshowamicrobialmetagenomemightberelatedtoothermicrobialmetagenomes.(Thisplotsuggestthatthemetagenomeislikelyamarinemetagenomesampledinacoastalregion.)
Thefollowingplotsshowhowaviralmetagenomedoesnotgroupwiththemicrobialmetagenomes(left)butcloselywithothermosquitometagenomes(right).
PRINSEQadditionalliststhedinucleotiderelativeabundanceoddsratiosfortheuploadeddataset.AnomaliesintheoddsratioscanbeusedtoidentifydiscrepanciesinmetagenomessuchashumanDNAcontamination(depressionoftheCGdinucleotidefrequency).
Assemblyqualitymeasures
TheNxxcontigsizeisaweightedmedianthatisdefinedasthelengthofthesmallestcontigCinthesortedlistofallcontigswherethecumulativelengthfromthelargestcontigtocontigCisatleastxx%ofthetotallength(sumofcontiglengths).Replacexxbythepreferredvaluesuchas90togettheN90contigsize.ThehighertheNxxvalue,thehighertherateoflongercontigsandthebetterthedataset.Ifthedatasetdoesnotcontaincontigsorscaffolds,thisinformationcanbeignored.
Robert Schmieder - [email protected] Quality control with PRINSEQ
Last modified: January 16, 2011 Page 15 of 15
References
1.HuseS,HuberJ,MorrisonH,SoginM,WelchD:AccuracyandqualityofmassivelyparallelDNApyrosequencing.GenomeBiology2007,8:R143.
2.LiH,DurbinR:FastandaccurateshortreadalignmentwithBurrowsWheelertransform.Bioinformatics2009,25:1754‐1760.
3.LiH,HomerN:Asurveyofsequencealignmentalgorithmsfornextgenerationsequencing.BriefBioinform2010.
4.SchmiederR,LimYW,RohwerF,EdwardsR:TagCleaner:Identificationandremovaloftagsequencesfromgenomicandmetagenomicdatasets.BMCBioinformatics2010,11:341.
5.Gomez‐AlvarezV,TealTK,SchmidtTM:Systematicartifactsinmetagenomesfromcomplexmicrobialcommunities.ISMEJ2009,3:1314‐1317.
6.MorgulisA,GertzEM,SchäfferAA,AgarwalaR:AfastandsymmetricDUSTimplementationtomasklowcomplexityDNAsequences.J.Comput.Biol2006,13:1028‐1040.
7.WillnerD,ThurberRV,RohwerF:Metagenomicsignaturesof86microbialandviralmetagenomes.Environ.Microbiol2009.