New Strategy to detect SNPs

New Strategy to detect SNPs

Miguel Galves

José Augusto Quitzau

Zanoni Dias

Scylla Bioinformatics –Brazil{miguel,jquitzau,zanoni}@scylla.com.br

Agenda

Introduction HIV Dataset Detection Strategy Trimming Procedure Base-Calling Strategies Filter Algorithm Consensus Algorithm Tests Protocol Results Discussion

Introduction

Polymorphism: set of base pair locus at which different alleles exists in individuals in some population– The second most frequent allele must appear in

at least 1% of the individuals SNP: polymorphism in a single base pair

position SNP discovery is very important to

understand complex diseases

HIV Dataset

HIV genetic sequences:– 1302 bp– Well-conserved region

35 batches from 35 individuals:– 6 PCR reads, with average size of 690bp– 1 validated sequence, with manually annotated

SNPs

HIV Reference Sequence

Detection Strategy: Survey

Trimming Procedure Base-Calling Correction SNPs Filter Batch Consensus Algorithm

Trimming Procedure

Low Quality Ends filtering Converts phred’s quality sequence to error

probability sequence: Q = -10 x log10(p)

Subtract 0.05 from all values (Q=13) Maximum Score Subsequence Algorithm

Base Calling: Area Ratio

The base calling is made in 5 Steps:1. Chromatogram area delimitation

2. Peak search

3. Choice of the nearest peaks

4. Calculation of the nearest peaks area

5. Calculation of the polymorphic/reference peak area

If the calculated ratio is above a certain threshold, the point is considered a polymorphism.

Base Calling: Area Delimitation

Base Calling: Peak Identification

Base Calling: Average Height Ratio

Almost the same steps:1. Chromatogram area delimitation2. Peak search3. Choice of the nearest peaks4. Calculation of the nearest peaks average height5. Calculation of the polymorphic/reference peak average

height.

Again, if the calculated ratio is above a certain threshold, the point is considered a polymorphism.

Base Calling: Peak Identification

Filter Algorithm

Analyzes each sequence Uses a window based algorithm to eliminate

adjacents SNPs– Window size: 11 bases– Empirical score system assigned to polymorphism

in the window

Consensus Algorithm

Rule-based algorithm– Empirical rules

Analyzes the whole cross section to define a consensus– Take account of nucleotide frequencies and

qualities

Do not create N symbols, nor tri-allelic polymorphisms.

Consensus Algorithm: Example

Sequence 1 A25 C30 C18 C30 A21

Sequence 2 A30 C25 C15 C25 A16

Sequence 3 - M18 A9 C30 -

Sequence 4 - - S12 G17 T18

Consensus A M S S W

Tests Protocol: Third Party Packages

Two external packages used to compare our results:– Polybayes: SNP detection tool based on Bayesian

Methods– Polyphred: SNP detection tool based on chromatogram

analysis

ACE file (contig and consensus) created for each batch using phrap

ACE file analyzed by Polyphred and Polybayes Results viewed with consed

Tests Protocol: Our strategy

Reads trimmed using Maximum Subsequence Algorithm

Base-calling analysis and correction using algorithms describe previously

SNP filtering Multiple alignment

– Reference sequence as anchor

Consensus creation

Third Party Results: Polybayes

Polybayes detected SNPs in only 2 batches out of 35

Batch Existing SNPs

Detected SNPs

Correct SNPs

False Positives

False Negatives

Batch 13 12 1 1 0 11

Batch 15 5 1 0 1 5

Third Party Results: Polyphred

Polyphred detected SNPs in only 4 batches out of 35

Batch Existing SNPs

Detected SNPs

Correct SNPs

False Positives

False Negatives

Batch 07 10 1 0 1 10

Batch 14 4 3 0 3 4

Batch 32 26 1 0 1 26

Batch 35 15 8 1 7 14

Trimming Results

Reads average size:– Before trimming: 690.15bp– After trimming: 374.74bp– Reduction of 45%

Reference sequence average base coverage– Before trimming: 2.69– After trimming: 1.77

Results: True Positive (%) x batch

Results: False Negative (%) x batch

Results: False Positive (%) x batch

Results: Summary

Polybayes Polyphred Area Avg. Height

Avg SD Avg SD Avg SD Avg SD

TP 0.3 1.4 0.2 1.1 75.4 19.2 52.6 21.5

FN 99.7 1.4 99.8 1.1 23.2 18.4 45.6 21.7

DP 0.0 0.0 0.0 0.0 1.4 4.3 1.8 4.0

FP 2.9 16.9 11.1 31.3 393.9 312.3 554.4 511.3

TP + FN + DP = 100%

Discussion

Polybayes and Polyphred need large sets of data to produces good results

Our algorithm produces quite satisfactory results taking into account data characteristics:

– Low average coverage– High amount of low quality bases– High amount of polymorphisms (virus DNA)

Area Ratio strategy produces better results than Average Height strategy

Future Work

Test the algorithms whith larger batches, whith higher average coverage, to improve consensus algorithm

Reproduce the experiments using genetic sequences of more conserved life forms, such as mammals

Acknowledgments

Science

New Strategy to detect SNPs