Upload
miguel-galves
View
166
Download
2
Tags:
Embed Size (px)
DESCRIPTION
New Strategy to detect SNPs
Citation preview
New Strategy to detect SNPs
Miguel Galves
José Augusto Quitzau
Zanoni Dias
Scylla Bioinformatics –Brazil{miguel,jquitzau,zanoni}@scylla.com.br
Agenda
Introduction HIV Dataset Detection Strategy Trimming Procedure Base-Calling Strategies Filter Algorithm Consensus Algorithm Tests Protocol Results Discussion
Introduction
Polymorphism: set of base pair locus at which different alleles exists in individuals in some population– The second most frequent allele must appear in
at least 1% of the individuals SNP: polymorphism in a single base pair
position SNP discovery is very important to
understand complex diseases
HIV Dataset
HIV genetic sequences:– 1302 bp– Well-conserved region
35 batches from 35 individuals:– 6 PCR reads, with average size of 690bp– 1 validated sequence, with manually annotated
SNPs
HIV Reference Sequence
Detection Strategy: Survey
Trimming Procedure Base-Calling Correction SNPs Filter Batch Consensus Algorithm
Trimming Procedure
Low Quality Ends filtering Converts phred’s quality sequence to error
probability sequence: Q = -10 x log10(p)
Subtract 0.05 from all values (Q=13) Maximum Score Subsequence Algorithm
Base Calling: Area Ratio
The base calling is made in 5 Steps:1. Chromatogram area delimitation
2. Peak search
3. Choice of the nearest peaks
4. Calculation of the nearest peaks area
5. Calculation of the polymorphic/reference peak area
If the calculated ratio is above a certain threshold, the point is considered a polymorphism.
Base Calling: Area Delimitation
Base Calling: Peak Identification
Base Calling: Average Height Ratio
Almost the same steps:1. Chromatogram area delimitation2. Peak search3. Choice of the nearest peaks4. Calculation of the nearest peaks average height5. Calculation of the polymorphic/reference peak average
height.
Again, if the calculated ratio is above a certain threshold, the point is considered a polymorphism.
Base Calling: Peak Identification
Filter Algorithm
Analyzes each sequence Uses a window based algorithm to eliminate
adjacents SNPs– Window size: 11 bases– Empirical score system assigned to polymorphism
in the window
Consensus Algorithm
Rule-based algorithm– Empirical rules
Analyzes the whole cross section to define a consensus– Take account of nucleotide frequencies and
qualities
Do not create N symbols, nor tri-allelic polymorphisms.
Consensus Algorithm: Example
Sequence 1 A25 C30 C18 C30 A21
Sequence 2 A30 C25 C15 C25 A16
Sequence 3 - M18 A9 C30 -
Sequence 4 - - S12 G17 T18
Consensus A M S S W
Tests Protocol: Third Party Packages
Two external packages used to compare our results:– Polybayes: SNP detection tool based on Bayesian
Methods– Polyphred: SNP detection tool based on chromatogram
analysis
ACE file (contig and consensus) created for each batch using phrap
ACE file analyzed by Polyphred and Polybayes Results viewed with consed
Tests Protocol: Our strategy
Reads trimmed using Maximum Subsequence Algorithm
Base-calling analysis and correction using algorithms describe previously
SNP filtering Multiple alignment
– Reference sequence as anchor
Consensus creation
Third Party Results: Polybayes
Polybayes detected SNPs in only 2 batches out of 35
Batch Existing SNPs
Detected SNPs
Correct SNPs
False Positives
False Negatives
Batch 13 12 1 1 0 11
Batch 15 5 1 0 1 5
Third Party Results: Polyphred
Polyphred detected SNPs in only 4 batches out of 35
Batch Existing SNPs
Detected SNPs
Correct SNPs
False Positives
False Negatives
Batch 07 10 1 0 1 10
Batch 14 4 3 0 3 4
Batch 32 26 1 0 1 26
Batch 35 15 8 1 7 14
Trimming Results
Reads average size:– Before trimming: 690.15bp– After trimming: 374.74bp– Reduction of 45%
Reference sequence average base coverage– Before trimming: 2.69– After trimming: 1.77
Results: True Positive (%) x batch
Results: False Negative (%) x batch
Results: False Positive (%) x batch
Results: Summary
Polybayes Polyphred Area Avg. Height
Avg SD Avg SD Avg SD Avg SD
TP 0.3 1.4 0.2 1.1 75.4 19.2 52.6 21.5
FN 99.7 1.4 99.8 1.1 23.2 18.4 45.6 21.7
DP 0.0 0.0 0.0 0.0 1.4 4.3 1.8 4.0
FP 2.9 16.9 11.1 31.3 393.9 312.3 554.4 511.3
TP + FN + DP = 100%
Discussion
Polybayes and Polyphred need large sets of data to produces good results
Our algorithm produces quite satisfactory results taking into account data characteristics:
– Low average coverage– High amount of low quality bases– High amount of polymorphisms (virus DNA)
Area Ratio strategy produces better results than Average Height strategy
Future Work
Test the algorithms whith larger batches, whith higher average coverage, to improve consensus algorithm
Reproduce the experiments using genetic sequences of more conserved life forms, such as mammals
Acknowledgments