Upload
eben
View
27
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Mining hidden information from your 454 data using modular and database oriented methods. Joachim De Schrijver. Overview. Short introduction on 454 sequencing Variant Identification pipeline Possibilities of a DB oriented pipeline Examples Coverage Improving PCR Fast Q assessment - PowerPoint PPT Presentation
Citation preview
Mining hidden information from your 454 data using modular and database
oriented methods
Joachim De Schrijver
Short introduction on 454 sequencing Variant Identification pipeline Possibilities of a DB oriented pipeline Examples
◦ Coverage◦ Improving PCR◦ Fast Q assessment◦ Homopolymers
Overview
Roche/454 GS-FLX sequencing:◦ Pyrosequencing◦ ± 400,000 reads/run◦ Average length: 200-250bp
Applications:◦ Resequencing: Variant identification◦ De novo (genome) sequencing: Assembly of new
regions, plasmids or entire genomes Standard Software:
◦ Variants: Amplicon Variant Analyzer (AVA)◦ Assembly: Standard 454 assembler
Introduction (i)
Standard software◦ + Easy to use◦ + reproducible results on similar datasets◦ + GUI (graphical user interface)◦ - No answer for ‘non-standard’ questions
Methylation experiments Different types of experiments grouped together …
◦ - What about ‘hidden’ information? Homopolymer error rates Quality score ~ length of sequenced read ‘Multirun’ information …
Introduction (ii)
Modular and database oriented pipeline
Modular:◦ Efficient planning◦ Scalable
Database (DB):◦ No loss of data◦ Grouping several
runs together
Variant Identification Pipeline (i)
Basic idea: Data is processed and stored in DB. Results (reports) are calculated ‘on the fly’ using the DB data.◦ Fast & efficient◦ Calculations only happen once◦ Everybody can access the database without risk of
data modification◦ Reporting is independent from the dataprocessing
Paper: De Schrijver et al. 2009. Analysing 454 sequences with a modular and database oriented Variant Identification Pipeline
Variant Identification pipeline (ii)
VIP originally developed for variant identification
Now being used in:◦ Amplicon resequencing◦ De novo shotgun◦ Methylation ◦ ~ solexa experiments
‘Hidden’ data can be extracted using intelligent querying strategies
Results per lane/Multiplex MID/run…
Possibilities of a DB oriented pipeline
Coverage can be calculated per◦ Lane◦ MID◦ Amplicon◦ Base position
Assessment of errors (PCR dropouts vs. human errors)
Example: Detailed coverage
1 2 3 4 5 6 7 8 9 10 11 120.00%2.00%4.00%6.00%8.00%
10.00%12.00%14.00%
MID frequency (unmapped)
Amplicon Resequencing experiment
Goal: Variant identification Length distributions
◦ Mapped◦ Unmapped◦ ‘Short’ mapped
Additional length separation + Improved PCR
Result: Improved efficiency
Example: Improving PCR
Can the length of a homopolymer be assessed using the Q score?
Yes, when homopolymer length < 6bp
Example: Homopolymers
Fast assessment of the quality of a run
Example: Q assessment
1 27 53 79 10513115718320923526128731333936505
1015202530354045
Q value ~ position
Q v
alue
0 50 100 150 200 250 30005
101520253035404550
Q value ~ position
Lab work OK Errors in lab work
Biobix – UgentWim Van CriekingeTim De MeyerGeert TrooskensTom VandekerkhoveLeander Van NesteGerben Mensschaert
CMG – UZ GentJo VandesompeleJan HellemansFilip PattynSteve LefeverKim DeleeneerJean-Pierre Renard
Acknowledgements NXT-GNT
Paul CouckeSofie BekaertFilip Van NieuwerburghDieter DeforceWim Van CriekingeJo Vandesompele
Questions ?