View
215
Download
1
Tags:
Embed Size (px)
Citation preview
May 25, 2004 2004 GSU Biotech Symposium 1
Minimum PCR Primer Set Selection with Amplification Length and
Uniqueness Constraints
Ion Mandoiu
University of Connecticut
CS&E Department
May 25, 2004 2004 GSU Biotech Symposium 2
Combinatorial Optimization Applications in Bioinformatics
• Fast growing number of applications– Dynamic Programming & Integer Programming in sequence alignment
– TSP and Euler paths in DNA sequencing
– Integer Programming in Haplotype inference
– Integer Programming & approximation algorithms for efficient pathogen identification (string barcoding)
– …
May 25, 2004 2004 GSU Biotech Symposium 3
High-Thrughput Assay Design• New source of combinatorial problems
– Microarray probe selection
– Mask design for Affy arrays
– Universal tag arrays
– Self-assembling microarrays
– Quality control
– …
– This talk: Multiplex PCR primer set selection
• Optimization goals– Improved speed
– High reliability
– Reduced COST
May 25, 2004 2004 GSU Biotech Symposium 6
Primer Pair Selection Problem
L
Forward primer
Reverse primer
amplification locus
3'
3'
5'
5'
L
• Given:
• Genomic sequence around amplification locus
• Primer length k
• Amplification upperbound L
• Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperatures, secondary structure, cross hybridization, etc.)
May 25, 2004 2004 GSU Biotech Symposium 7
Motivation for Primer Set Selection (1)
• Spotted microarray synthesis [Fernandes and Skiena’02]– Need unique pair for each amplification product, but
primers can be reused to minimize cost
– Potential to reduce #primers from O(n) to O(n1/2) for n products
May 25, 2004 2004 GSU Biotech Symposium 8
Motivation for Primer Set Selection (2)
• SNP Genotyping– Thousands of SNPs that must genotyped using
hybridization based methods (e.g., SBE)
– Selective PCR amplification needed to improve accuracy of detection steps (whole-genome amplification not appropriate)
– No need for unique amplification!
– Primer minimization is critical• Fewer primers to buy
• Fewer multiplex PCR reactions
May 25, 2004 2004 GSU Biotech Symposium 9
Primer Set Selection Problem
• Given:
• Genomic sequences around each amplification locus
• Primer length k
• Amplification upperbound L
• Find:
• Minimum size set of primers S of length k such that, for each amplification locus, there are two primers in S hybridizing to the forward and reverse sequences within a distance of L of each other
• For some applications: S should contain a unique pair of primers amplifying each each locus
May 25, 2004 2004 GSU Biotech Symposium 10
Previous Work (1)• [Pearson et al. 96][Linhart&Shamir’02][Souvenir et al.’03]
- Separately select forward and reverse primers
- To enforce bound of L on amplification length, select only primers that are within a distance of L/2 of the target SNP
• Ignores half of the feasible primer pairs
• Solution can increase by a factor of O(n) by ignoring them!
• Greedy set cover algorithm gives O(ln n) approximation factor for this formulation
• Cannot approximate better unless P=NP
May 25, 2004 2004 GSU Biotech Symposium 11
Previous Work (2)• [Fernandes&Skiena’02] model primer selection as a minimum multicolored subgraph problem:
• Vertices of the graph correspond to candidate primers
• There is an edge colored by color i between primers u and v if they hybridize to i-th forward and reverse sequences within a distance of L
• Goal is to find minimum size set of vertices inducing edges of all colors
• No non-trivial approximation factor known previously
May 25, 2004 2004 GSU Biotech Symposium 12
Selection w/o Uniqueness Constraints• Can be seen as a “simultaneous set covering” problem:
- The ground set is partitioned into n disjoint sets, each with 2L elements
- The goal is to select a minimum number of sets (== primers) that cover at least half of the elements in each partition
• Naïve modifications of the greedy set cover algorithm do not work
• Key idea: use potential function for a partial solution P = minium number of elements that are not yet covered as measure of infeasibility
• Initially, = nL
• For feasible solutions, = 0
May 25, 2004 2004 GSU Biotech Symposium 13
Potential-Function Driven Greedy
1. Select a primer that decreases the potential function by the largest amount (breaking ties arbitrarily)
2. Repeat until feasibility is achived
• Lemma: Each greedy selection reduces by a factor of at least (1-1/OPT)
• Theorem: The number of primers selected by the greedy algorithm is at most ln(nL) larger than the optimum
May 25, 2004 2004 GSU Biotech Symposium 14
Selection w/ Uniqueness Constraints
• Can be modeled as minimum multicolored subgraph problem: add edge colored by color i between two primers if they amplify i-th SNP and do not amplify any other SNP
• Trivial approximation algorithm: select 2 primers for each SNP
• O(n1/2) approximation since at least n1/2 primers required by every solution
• Non-trivial approximation?
May 25, 2004 2004 GSU Biotech Symposium 15
Integer Program Formulation• Variable xu for every vertex (candidate primer) u
- xu set to 1 if u is selected, and to 0 otherwise
• Variable ye for every edge e
- ye set to 1 if corresponding primer pair selected to amplify one of the SNPs
• Objective: minimize sum of xu’s
• Constraints:
- for each i, sum of {ye : e amplifying SNP i} 1
- ye xu for every e incident to u
May 25, 2004 2004 GSU Biotech Symposium 16
LP-Rounding Algorithm
1. Solve linear programming relaxation
2. Select node u with probability xu
• Theorem: With probability of at least 1/3, the number of selected nodes is within a factor of O(m1/2lnn) of the optimum, where m is the maximum number of edges sharing the same color.
• For primer selection, m L2 approximation factor is O(Lln n)
May 25, 2004 2004 GSU Biotech Symposium 17
Experimental Setting• SNP sets extracted from NCBI databases + randomly generated
• C/C++ code run on a 2.8GHz Dell PowerEdge running Linux
• Compared algorithms
• G-FIX: greedy primer cover algorithm of Pearson et al.
- Primers restricted to be within L/2 of amplified SNPs
• G-VAR: naïve modification of G-FIX
- For each SNP, first selected primer can be L bases away from SNP
- If first selected primer is L1 bases away from the SNP, opposite sequence is truncated to a length of L- L1
• G-POT: potential function driven greedy algorithm
• MIPS-PT: iterative beam-search heuristic of Souvenir et al (WABI’03)
May 25, 2004 2004 GSU Biotech Symposium 23
Conclusions
• New combinatorial optimization problems arising in the area of high-throughput assay design
• Theoretical insights (such as approximation results) give algorithms with significant practical improvements
• Choosing the proper problem model is critical to solution efficiency
May 25, 2004 2004 GSU Biotech Symposium 24
Ongoing Work & Open Problems
• Allow degenerate primers
• Incorporate more biochemical constraints into the model (melting temperature, secondary structure, cross hybridization, etc.)
• Close gap between O(lnn) inapproximability bound and O(L lnn) approximation factor for minimum multi-colored subgraph problem
• Approximation algorithms for partition into multiple multiplexed PCR reactions (Aumann et al. WABI’03)