Mining hidden information from your 454 data using modular and database
oriented methods
Joachim De Schrijver
Short introduction on 454 sequencing Variant Identification pipeline Possibilities of a DB oriented pipeline Examples
◦ Coverage◦ Improving PCR◦ Fast Q assessment◦ Homopolymers
Overview
Roche/454 GS-FLX sequencing:◦ Pyrosequencing◦ ± 400,000 reads/run◦ Average length: 200-250bp
Applications:◦ Resequencing: Variant identification◦ De novo (genome) sequencing: Assembly of new
regions, plasmids or entire genomes Standard Software:
◦ Variants: Amplicon Variant Analyzer (AVA)◦ Assembly: Standard 454 assembler
Introduction (i)
Standard software◦ + Easy to use◦ + reproducible results on similar datasets◦ + GUI (graphical user interface)◦ - No answer for ‘non-standard’ questions
Methylation experiments Different types of experiments grouped together …
◦ - What about ‘hidden’ information? Homopolymer error rates Quality score ~ length of sequenced read ‘Multirun’ information …
Introduction (ii)
Modular and database oriented pipeline
Modular:◦ Efficient planning◦ Scalable
Database (DB):◦ No loss of data◦ Grouping several
runs together
Variant Identification Pipeline (i)
Basic idea: Data is processed and stored in DB. Results (reports) are calculated ‘on the fly’ using the DB data.◦ Fast & efficient◦ Calculations only happen once◦ Everybody can access the database without risk of
data modification◦ Reporting is independent from the dataprocessing
Paper: De Schrijver et al. 2009. Analysing 454 sequences with a modular and database oriented Variant Identification Pipeline
Variant Identification pipeline (ii)
VIP originally developed for variant identification
Now being used in:◦ Amplicon resequencing◦ De novo shotgun◦ Methylation ◦ ~ solexa experiments
‘Hidden’ data can be extracted using intelligent querying strategies
Results per lane/Multiplex MID/run…
Possibilities of a DB oriented pipeline
Coverage can be calculated per◦ Lane◦ MID◦ Amplicon◦ Base position
Assessment of errors (PCR dropouts vs. human errors)
Example: Detailed coverage
1 2 3 4 5 6 7 8 9 10 11 120.00%2.00%4.00%6.00%8.00%
10.00%12.00%14.00%
MID frequency (unmapped)
Amplicon Resequencing experiment
Goal: Variant identification Length distributions
◦ Mapped◦ Unmapped◦ ‘Short’ mapped
Additional length separation + Improved PCR
Result: Improved efficiency
Example: Improving PCR
Can the length of a homopolymer be assessed using the Q score?
Yes, when homopolymer length < 6bp
Example: Homopolymers
Fast assessment of the quality of a run
Example: Q assessment
1 27 53 79 10513115718320923526128731333936505
1015202530354045
Q value ~ position
Q v
alue
0 50 100 150 200 250 30005
101520253035404550
Q value ~ position
Lab work OK Errors in lab work
Biobix – UgentWim Van CriekingeTim De MeyerGeert TrooskensTom VandekerkhoveLeander Van NesteGerben Mensschaert
CMG – UZ GentJo VandesompeleJan HellemansFilip PattynSteve LefeverKim DeleeneerJean-Pierre Renard
Acknowledgements NXT-GNT
Paul CouckeSofie BekaertFilip Van NieuwerburghDieter DeforceWim Van CriekingeJo Vandesompele
Questions ?