Quality control with PRINSEQprinseq.sourceforge.net/Quality_control_with_PRINSEQ.pdf · 2011-01-16 · Quality control Sequencing technologies are not perfect and the quality control

Robert Schmieder - [email protected] Quality control with PRINSEQ

Last modified: January 16, 2011 Page 1 of 15

Qualitycontrol

Sequencingtechnologiesarenotperfectandthequalitycontrol(QC)isanessentialsteptoensurethatthedatausedfordownstreamanalysisisnotcompromisedoflow‐qualitysequences,sequenceartifacts,orsequencecontaminationthatmightleadtoerroneousconclusions.TheeasiestwayofQCislookingatsummarystatisticsofthedata.Therearedifferentprogramsthatcanproducethosestatistics.Webapplicationsallowuserstoeasilyshareanddiscusstheresultswithotherpeoplewithouttransferringlargedatafiles.ThefollowingQCstepsareimplementedinandallgraphicsgeneratedbyPRINSEQ(http://prinseq.sourceforge.net).

Content:

• Necessaryresources• UploadingdatatothePRINSEQwebversion

• Numberandlengthofsequences• Basequalities

• GCcontent

• Poly‐A/Ttails• Ambiguousbases

• Sequenceduplications

• Sequencecomplexity• Tagsequences

• Sequencecontamination• Assemblyqualitymeasures

• References



Necessaryresources

Hardware

ComputerconnectedtotheInternet

Software

Up‐to‐dateWebbrowser(Firefox,Safari,Chrome,InternetExplorer,…)

Files

FASTAfilewithsequencedataQUALfilewithqualityscores(ifavailable)

FASTQfile(asalternativeformat)

UploadingdatatothePRINSEQwebversion

TouploadanewdatasetinFASTAandQUALformat(orFASTQformat)toPRINSEQ,followthesesteps:

1. Gotohttp://prinseq.sourceforge.net

2. Clickon“UsePRINSEQ”inthetopmenuontheright(thelatestPRINSEQwebversionshouldload)

3. Clickon“Uploadnewdata”

4. SelectyourFASTAandQUALfilesoryourFASTQfileandclick“Submit”

Afterthedataisparsedandprocessedsuccessfully,theuserinterfacewillshowamenuontheleftandamessageinthemainpanelasshownbelow.



Notes

Afterclickingthesubmitbutton,astatusbar(notprogressbar)willbedisplayeduntilthefileuploadiscompleted.Duringthedataprocessing,severalprogressbarswillshowtheprogressofthedataparsingandstatisticscalculationsteps.

Possibleproblems

1. ThePRINSEQwebinterfacedoesnotload/isnotvisible.

Youonlyseethisandnothingelsehappens:

Solution:MakesurethatyouhaveJavaScriptactivatedinyourbrowser,asthisisrequiredtoloadandusePRINSEQ’swebinterface.

2. Theuploadstatusbardoesnotdisappear.

Afterclickingonthesubmitbuttonyouseethisanditdoesnotdisappear:

Solution:Thefirstthingtocheckisifthefileisstilluploading.Theeasiestwaytodothisisbycheckingtheloadingiconinyourbrowser.

Ifyouseetheloadingicon(right)insteadofthePRINSEQicon(left),yourfileisstilluploadingandyoushouldgiveitmoretime.IfyouseethePRINSEQiconinsteadoftheloadingicon,yourfiledidnotuploadcompletelyandthiscausedanerror.IfyouhaveaslowconnectiontotheInternetortrytouploadlargefiles,theconnectiontothewebservercantimeoutbeforetheuploadwascompleted.Ifyoudidnotuploadcompressedfiles,trytocompressyourfileswithanyofthesupportedcompressionalgorithms(ZIP,GZIP,…).Inrarecases,theissuecanalsobecausedbycertainFirefoxpluginsorextensions.Ifpossible,useanalternativebrowsertotestifthiswasthecase.Ifthebrowsercausedtheproblem,updatingFirefoxandtheplugins/extensiontothelatestversionmightsolvetheproblem.



Numberandlengthofsequences

Checkthosenumberstomakesureitmatchesapproximatelythemanufacturerestimates.Ifyournumbersareofftoomuch,checktherawdataandfilterstatisticsin"454BaseCallerMetrics"and"454QualityFilterMetrics".

Lengthdistribution

Thelengthdistributioncanbeusedasqualitymeasureforthesequencingrun.Youwouldexpectanormaldistributionforthebestresult.However,mostsequencingresultsshowaslowlyincreasingandthenasteepfallingdistribution.TheplotsinPRINSEQmarkthemeanlength(M)andthelengthforoneandtwostandarddeviations(1SDand2SD),whichcanhelptodecidewheretosetlengththresholdsforthedatapreprocessing.Ifanyofthesequencesislongerthan100bp,thelengthswillbebinnedintheplotsgeneratedbyPRINSEQ.Thenumberofsequencesforeachbinisthenshowninsteadofthenumberofsequencesforasinglelength(valuesmightthereforebebiggerthanshowninthetablefornon‐binnedlengths).

Thefollowingtwodatasetshaveapproximatelythesamenumberofsequences,howeverthelengthdistributionslookdifferent.



Bothdistributionshavethehighestnumberofsequencesaround500bp,butforthefirstdatasetthemeanofthesequencelengthsishigherandthestandarddeviationislower.Acertainnumberofshorterreadsmightbeexpected,butifthesamplecontainedmainlylongerfragments,thisnumbershouldbelow.

Assumingthatbothsamplescontainedenoughfragmentsofatleast500bpandallfragmentsweresequencedwiththesamenumberofcycles(sequencingflows),wewouldexpectthatthemajorityofthesequenceswouldhaveapproximatelythesamelength.Thehigheramountofshorterreadsintheseconddatasetsuggeststhatthosereadsmighthavebeenoflowerqualityandweretrimmedduringthesignalprocessing.Ifthesamplecontainedmanyshortfragments,theshorterreadsmightbefromthosefragmentsandnotoflowerquality.

Minimumandmaximumreadlength

SequencesintheSFFfilescanbeasshortas40bp(shortersequencesarefilteredduringsignalprocessing).Formultiplexedsamples,theMIDtrimmedsequencescanbeasshortat28bp(assuminga12bpMIDtag).Suchshortsequencescancauseproblemsduring,forexample,databasesearchestofindsimilarsequences.Shortsequencesaremorelikelytomatchatarandompositionbychancethanlongersequencesandmaythereforeresultinfalsepositivefunctionalortaxonomicalassignments.Furthermore,shortsequencesarelikelytobequalitytrimmedduringthesignal‐processingstepandoflowerqualitywithpossiblesequencingerrors.

Insomecases,sequencescanbemuchlongerthanseveralstandarddeviationsabovethemeanlength(e.g.1,500+bpfora500bpmeanlengthwitha100bpstandarddeviation).Thosesequencesshouldbeusedwithcautionastheylikelycontainlongstretchesofhomopolymerrunsasinthefollowingexample.Homopolymersareaknownissueofpyrosequencingtechnologiessuchas454/Roche[1].

aactttaaccttttaaaacccccttaaaaaaactttaaaccccgtaaaccccccgggttt ttttttaaaaaaccgttttttacgggggtttaccccgttttaccggggttttgggggttt taaaaaaaacgggttttaaacgggttaacccccgggttttccgggggtttaaaaagtttt tttaaacgggggttttcccgtaaaaaaaaaaccccgtttaaaaaaaggggttaaaaaaaa aaggggttaaccccccggggtttaaaaaaaaccttttttttttttaaaaaaaacgttttt tttttttaaaaggggttttttttacgggggtaaacgggggggttaaaaaaaaaccccccc cggggggttttaaaaaaaaaacccccggttttaaaaaaccccgttttaacccctttaaaa aaaaaacgggggggttttaaaaaaaaaagggggttttttttttttaaaaacccgttttta aaaaccccccgttttttaacccgggttaaaccccccccgggggggtaaaacccccccccc ggggtaaccccctttttttaaaacccccccccgttttttacccgggggtttttacccccg gggggggtaaaaaaacggggggtttttttttttttaaaaccggggttttttttttttaaa ccccggtttttaaaaaccggtttttaccccggggggttttacccccgggggggggttttt aaaacccccggtttaaaactttaaaaacccgggtaaccccggggttttaaaaaaaaaaaa aaaaccccccccgttaaaaaaaaaaaacccgttttttttttaaaaaaaacccccccccgg ttttaaaaccccccccgggggtttttaccccggggttttaaaaaaaacccgtttaaaaaa accgggttttttaaaggggttttaaacccccccccc



Basequalities

AsforSangersequencing,next‐generationsequencersproducedatawithlinearlydegradingqualityacrosstheread.Thequalityscoresfor454/RochesequencersarePhred‐basedsinceversion1.1.03,rangingfrom0to40.Phredvaluesarelog‐scaled,whereaqualityscoreof10representsa1in10chanceofanincorrectbasecallandaqualityscoreof20representsa1in100chanceofanincorrectbasecall.

InPRINSEQ,thequalityscoresareplottedacrossthereadsusingboxplots.Thex‐axisindicatestheabsolutepositionifallreadsarenolongerthan100bpandtherelativeposition(in%ofreadlength)ifanyreadislongerthan100bp.Fordatasetswithanyreadlongerthan100bp,asecondplotshowsbinnedqualityvaluestokeepitsabsolutepositions.Thisplotishelpfultoidentifyqualityscoresattheendoflongerreads,whichwouldotherwisebegroupedwiththeendsoftheshorterreads.ThefollowingexampleshowsthequalityscoresacrossthereadlengthforfragmentssequencedwithGSFLXusingtheTitaniumkit.Thesequenceswithlowqualityscoresattheendsshouldbetrimmedduringdatapreprocessing.

Inadditiontothedecreaseinqualityacrosstheread,regionswithhomopolymerstretcheswilltendtohavelowerqualityscores.Huseetal.[1]foundthatsequenceswithanaveragescorebelow25hadmoreerrorsthanthosewithhigheraverages.Therefore,itishelpfultotakealookattheaverage(ormean)qualityscores.PRINSEQprovidesaplotthatshowsthedistributionofsequencemeanqualityscoresofadataset,asshownbelow.Themajorityofthesequencesshouldhavehighmeanqualityscores.



Lowqualitysequencescancauseproblemsduringdownstreamanalysis.Mostassemblersoralignersdonottakeintoaccountqualityscoreswhenprocessingthedata.Theerrorsinthereadscancomplicatetheassemblyprocessandmightcausemisassembliesormakeanassemblyimpossible.

GCcontent

TheGCcontentdistributionofmostsamplesshouldfollowanormaldistribution.Insomecases,abi‐modaldistributioncanbeobserved,especiallyformetagenomicdatasets.TheGCcontentplotinPRINSEQmarksthemeanGCcontent(M)andtheGCcontentforoneandtwostandarddeviations(1SDand2SD).ThiscanhelptodecidewheretosettheGCcontentthresholds,ifaGCcontentfilterwillbeapplied.Theplotcanalsobeusedtofindthethresholdsorrangetoselectsequencesfromabi‐modaldistribution.

PolyA/Ttails

Poly‐A/TtailsareconsideredrepeatsofAsorTsatthesequenceends.InPRINSEQ,theminimumlengthofatailis5bpandsequencesthatcontainonlyAsorTsarecountedforbothends.Asmallnumberoftailscanoccurevenaftertrimmingpoly‐A/Ttails.Forexample,asequencethatendswithAAAAATTTTTandthathasbeentrimmedforthepoly‐Twillstillcontainthepoly‐A.

Trimmingpoly‐A/Ttailscanreducethenumberoffalsepositivesduringdatabasesearches,aslongtailstendtoalignwelltosequenceswithlowcomplexityorsequenceswithtails(e.g.viralsequences)inthedatabase.



Ambiguousbases

SequencescancontaintheambiguousbaseNforpositionsthatcouldnotbeidentifiedasaparticularbase.AhighnumberofNscanbeasignforalowqualitysequenceorevendataset.Ifnoqualityscoresareavailable,thesequencequalitycanbeinferredfromthepercentofNsfoundinasequenceordataset.Huseetal.[1]foundthatthepresenceofanyambiguousbasecallswasasignforoverallpoorsequencequality.

Ambiguousbasescancauseproblemsduringdownstreamanalysis.AssemblerssuchasVelvetandalignerssuchasSHAHA2orBWAusea2‐bitencodingsystemtorepresentnucleotides,asitoffersaspaceefficientwaytostoresequences.Forexample,thenucleotidesA,C,GandTmightbe2‐bitencodedas00,01,10and11.The2‐bitencoding,however,onlyallowstostorethefournucleotidesandanyadditionalambiguousbasecannotberepresented.Thedifferentprogramsdealwiththeproblemindifferentways.Someprogramsreplaceambiguousbaseswitharandombase(e.g.BWA[2])andotherswithafixedbase(e.g.SHAHA2andVelvetreplaceNswithAs[3]).Thiscanresultinmisassembliesorfalsemappingofsequencestoareferencesequenceandtherefore,sequenceswithahighnumberofNsshouldberemovedbeforedownstreamanalysis.

Sequenceduplications

Realorartificial?Assumingarandomsamplingofthegenomicmaterialinanenvironmentsuchasinmetagenomicstudies,readsshouldnotstartatthesamepositionandhavethesameerrors(atleastnotinthenumbersthattheyhavebeenobservedinmostmetagenomes).Gomez‐Alvarezetal.[5]investigatedtheprobleminmoredetailanddidnotfindaspecificpatternorlocationonthesequencingplatethatcouldexplaintheduplications.

Duplicatescanarisewhentherearetoofewfragmentspresentatanystagepriortosequencing,especiallyduringanyPCRstep.Furthermore,thetheoreticalideaofonemicro‐reactorcontainingonebeadfor454/Rochesequencingdoesnotalwaystranslateintopracticewheremanybeadscanbefoundinasinglemicro‐reactor.Unfortunately,artificialduplicatesaredifficulttodistinguishfromexactlyoverlappingreadsthatnaturallyoccurwithindeepsequencesamples.

Thenumberofexpectedsequenceduplicateshighlydependsonthedepthofthelibrary,thetypeoflibrarybeingsequenced(wholegenome,transcriptome,16S,metagenome,...),andthesequencingtechnologyused.Thesequenceduplicatescanbedefinedusingdifferentmethods.Exactduplicatesareidenticalsequencecopies,whereas5'or3'duplicatesaresequencesthatareidenticalwiththe5'or3'endofalongersequence.Consideringthedouble‐strandednatureofDNA,duplicatescouldalsobeconsideredsequencesthatareidenticalwiththereversecomplementofanothersequence.



ThedifferentplotsinPRINSEQcanbehelpfultoinvestigatethedegreeofsequenceduplicationsinadataset.Thefollowingplotshowsthenumberofsequenceduplicatesfordifferentlengths.Thedistributionofduplicatesshouldbesimilartothelengthdistributionofthedataset.Thenumberof5’duplicatesishigherforshortersequences(asobservedintheexamplebelow),suggestingthatexactsequenceduplicatesmayhavebeentrimmedduringsignalprocessing.

Thenumberofexactduplicatesisoftenhigherthanthenumberof5’and3’duplicatesasinthefollowingexample.



PRINSEQoffersadditionalplotstoinvestigatethesequenceduplicatesfromdifferentpointsofview.Theplotshowingthesequenceduplicationlevels(withnumberofsequenceswithoneduplicate,twoduplicates,threeduplicates,…)canbeusedtoidentifythedistributionofduplicates(e.g.domanysequenceshaveonlyafewduplicates).Theplotshowingthehighestnumberofduplicatesforasinglesequence(top100)canhelptoindentifyifonlyafewsequenceshavemanyduplicates(e.g.asaresultofspecificPCRamplification)andwhatthehighestduplicationnumbersare.

Dependingonthedatasetanddownstreamanalysis,itshouldbeconsideredtofiltersequenceduplicates.ThemainpurposeofremovingduplicatesistomitigatetheeffectsofPCRamplificationbiasintroducedduringlibraryconstruction.Inaddition,removingduplicatescanresultincomputationalbenefitsbyreducingthenumberofsequencesthatneedtobeprocessedandbyloweringthememoryrequirements.Sequenceduplicatescanalsoimpactabundanceorexpressionmeasuresandcanresultinfalsevariant(SNP)calling.Theexamplebelowshowsthealignmentofsequencesagainstareferencesequence(gray).Thesequenceduplicates(startingatthesameposition)suggestapossiblyfalsefrequencyofbaseCatthepositionmarkedinbold.

Keepinmindthatthenumberofsequenceduplicatesalsodependsontheexperiment.Forshort‐readdatasetswithhighcoveragesuchasinultra‐deepsequencingorgenomere‐sequencingdatasets,eliminatingsingletonscanpresentaneasywayofdramaticallyreducingthenumberoferror‐pronereads.

Sequencecomplexity

Genomesequencescanexhibitintervalswithlow‐complexity,whichmaybepartofthesequencedatasetwhenusingrandomsamplingtechniques.Low‐complexitysequencesaredefinedashavingcommonlyfoundstretchesofnucleotideswithlimitedinformationcontent(e.g.thedinucleotiderepeatCACACACACA).Suchsequencescanproducealargenumberofhigh‐scoringbutbiologicallyinsignificantresultsindatabasesearches.Thecomplexityofasequencecanbeestimatedusingmanydifferentapproaches.PRINSEQcalculatesthesequencecomplexityusingtheDUSTandEntropyapproachesastheypresenttwocommonlyusedexamples.

...ACCACACGTGTTGTGTACATGAACACAGTATATGAGCATACAGAT...

TGAACACAGTCTATGAGCATACAGAT...





GTGTACATGAACACAGTATATGAGCATACAGAT...

GTGTTGTGTACATGAACACAGTATATGAGCATACAGAT...



TheDUSTapproachisadaptedfromthealgorithmusedtomasklow‐complexityregionsduringBLASTsearchpreprocessing[6].Thescoresarecomputedbasedonhowoftendifferenttrinucleotidesoccurandarescaledfrom0to100.Higherscoresimplylowercomplexityandcomplexityscoresabove7canbeconsideredlow‐complexity.Asequenceofhomopolymerrepeats(e.g.TTTTTTTTT)hasascoreof100,ofdinucleotiderepeats(e.g.TATATATATA)hasascorearound49,andoftrinucleotiderepeats(e.g.TAGTAGTAGTAG)hasascorearound32.

TheEntropyapproachevaluatestheentropyoftrinucleotidesinasequence.Theentropyvaluesarescaledfrom0to100andlowerentropyvaluesimplylowercomplexity.Asequenceofhomopolymerrepeats(e.g.TTTTTTTTT)hasanentropyvalueof0,ofdinucleotiderepeats(e.g.TATATATATA)hasavaluearound16,andoftrinucleotiderepeats(e.g.TAGTAGTAGTAG)hasavaluearound26.Sequenceswithanentropyvaluebelow70canbeconsideredlow‐complexity.



Tagsequences

Tagsequencesareartifactsattheendsofsequencereadssuchasmultiplexidentifiers,adapters,andprimersequencesthatwereintroducedduringpre‐amplificationwithprimer‐basedmethods.Thebasefrequenciesacrossthereadspresentaneasywaytocheckfortagsequences.Ifthedistributionseemsuneven(highfrequenciesforcertainbasesoverseveralpositions),itcouldindicatesomeresidualtagsequences.Thefollowingthreeexamplesshowthebasefrequenciesofdatasetswithnotagsequence,multiplexidentifier(MID)tagsequence,andwholetranscriptomeamplified(WTA)tagsequence.

ThosetagsequenceshouldbetrimmedusingaprogramsuchasTagCleaner(http://tagcleaner.sourceforge.net)[4].Theinputtoanysuchtrimmingprogramshouldbeuntrimmedreads(e.g.notqualitytrimmed),asthiswillalloweasierandmoreaccurateidentificationoftagsequences.PRINSEQcanbeusedaftertagsequencetrimmingtocheckifthetagswereremovedsufficiently.Inadditiontothefrequencyplots,PRINSEQestimatesifthedatasetcontainstagsequences.Theprobabilitiesforatagsequenceatthe5’‐or3’‐endrequireacertainnumberofsequences(10,000shouldbesufficient).Apercentagebelow40%doesnotalwayssuggestatagsequence,especiallyifitcannotbeobservedfromthebasefrequencies.Theestimationdoesnotworkforsequencedatasetsthattargetasingleloci(e.g.16S)andshouldonlybeusedforrandomlysequencedsamplessuchasmetagenomes.



Sequencecontamination

SequencesobtainedfromimpurenucleicacidpreparationsmaycontainDNAfromsourcesotherthanthesample.Thosesequencecontaminationsareaseriousconcerntothequalityofthedatausedfordownstreamanalysis,possiblycausingerroneousconclusions.ThedinucleotideoddsratiosascalculatedbyPRINSEQusetheinformationcontentinthesequencesofadatasetandcanbeusedtoidentifypossiblycontamination[7].Furthermore,dinucleotideabundanceshavebeenshowntocapturethemajorityofvariationingenomesignaturesandcanbeusedtocompareametagenometoothermicrobialorviralmetagenomes.PRINSEQusesprincipalcomponentanalysis(PCA)togroupmetagenomesfromsimilarenvironmentsbasedondinucleotideabundances.Thiscanhelptoinvestigateifthecorrectsamplewassequenced,asviralandmicrobialmetagenomesshowdistinctpatterns.Assamplesmightbeprocessedusingdifferentprotocolsorsequencedusingdifferenttechniques,thisfeatureshouldbeusedwithcaution.



ThePCAplotsinPRINSEQshowhowtheusermetagenome(representedbyareddot)groupswithothermetagenomes(bluedots).Sincetheplotsaregeneratedformicrobialandviralmetagenomesseparately,theyaremarkedwithanMorV(topleftcorner).Thepercentagesinparenthesisshowtheexplainedvariationinthefirstandsecondprincipalcomponent.Theplotsaregeneratedusingpreprocesseddatafrompublishedmetagenomesthatweresequencedusingthe454/Rochesequencingplatform.Ifsequencescontaintagsequencesoraretargetedtoacertainloci(e.g.16S),thisapproachwillnotbeabletogrouptheuserdatatometagenomesfromthesameenvironment.Theplotaboveshowshowamicrobialmetagenomemightberelatedtoothermicrobialmetagenomes.(Thisplotsuggestthatthemetagenomeislikelyamarinemetagenomesampledinacoastalregion.)

Thefollowingplotsshowhowaviralmetagenomedoesnotgroupwiththemicrobialmetagenomes(left)butcloselywithothermosquitometagenomes(right).

PRINSEQadditionalliststhedinucleotiderelativeabundanceoddsratiosfortheuploadeddataset.AnomaliesintheoddsratioscanbeusedtoidentifydiscrepanciesinmetagenomessuchashumanDNAcontamination(depressionoftheCGdinucleotidefrequency).

Assemblyqualitymeasures

TheNxxcontigsizeisaweightedmedianthatisdefinedasthelengthofthesmallestcontigCinthesortedlistofallcontigswherethecumulativelengthfromthelargestcontigtocontigCisatleastxx%ofthetotallength(sumofcontiglengths).Replacexxbythepreferredvaluesuchas90togettheN90contigsize.ThehighertheNxxvalue,thehighertherateoflongercontigsandthebetterthedataset.Ifthedatasetdoesnotcontaincontigsorscaffolds,thisinformationcanbeignored.



References

1.HuseS,HuberJ,MorrisonH,SoginM,WelchD:AccuracyandqualityofmassivelyparallelDNApyrosequencing.GenomeBiology2007,8:R143.

2.LiH,DurbinR:FastandaccurateshortreadalignmentwithBurrowsWheelertransform.Bioinformatics2009,25:1754‐1760.

3.LiH,HomerN:Asurveyofsequencealignmentalgorithmsfornextgenerationsequencing.BriefBioinform2010.

4.SchmiederR,LimYW,RohwerF,EdwardsR:TagCleaner:Identificationandremovaloftagsequencesfromgenomicandmetagenomicdatasets.BMCBioinformatics2010,11:341.

5.Gomez‐AlvarezV,TealTK,SchmidtTM:Systematicartifactsinmetagenomesfromcomplexmicrobialcommunities.ISMEJ2009,3:1314‐1317.

6.MorgulisA,GertzEM,SchäfferAA,AgarwalaR:AfastandsymmetricDUSTimplementationtomasklowcomplexityDNAsequences.J.Comput.Biol2006,13:1028‐1040.

7.WillnerD,ThurberRV,RohwerF:Metagenomicsignaturesof86microbialandviralmetagenomes.Environ.Microbiol2009.

Documents

Quality control with PRINSEQprinseq.sourceforge.net/Quality_control_with_PRINSEQ.pdf · 2011-01-16 · Quality control Sequencing technologies are not perfect and the quality control