24
Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011

Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

  • Upload
    nibal

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Effective Heuristics for NP-Hard Problems Arising in Molecular Biology. Richard M. Karp Bangalore, January 5, 2011. NP-Hard Problems. The P vs. NP problem: Is finding a solution to a combinatorial search problem as easy as checking a solution. The answer is expected to be “No.” - PowerPoint PPT Presentation

Citation preview

Page 1: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Richard M. KarpBangalore, January 5, 2011

Page 2: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

NP-Hard Problems

• The P vs. NP problem: Is finding a solution to a combinatorial search problem as easy as checking a solution. The answer is expected to be “No.”

• NP-Hard Problems: Solvable in polynomial time only if P=NP.

• General belief: Solving an NP-hard problem requires worst case exponential time.

Page 3: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Understanding NP-Hard ProblemsThrough Worst-Case Analysis

• Exact solution methods: exponential running time in worst case.

• Polynomial-time approximation algorithms for optimization problems, yielding a worst-case upper bound of the ratio between the cost of an approximate solution and the cost of optimal solution. Unfortunately, these guaranteed approximation ratios are unrealistically high.

Page 4: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Probabilistic Analysis and Heuristics

• In probabilistic analysis problem instances are drawn from simple probability distributions. Often one can prove excellent performance on the average. However, the probability distributions may not correspond to real-life instances.

• Heuristics are typically evaluated empirically on examples drawn from, or representative of. real-life instances. Heuristics are often “unreasonably effective,” for reasons not well understood.

Page 5: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Famous Unreasonably Effective Heuristics

• Large traveling-salesman problems can be solved by quick tour construction methods, local improvement methods or cutting plane methods.

• Local improvement methods find near-optimal solutions to graph bisection problems.

• Huge satisfiability problems are routinely solved rapidly by branch-and-bound methods.

• The greedy set cover algorithm typically gives solutions within a few percent of optimal.

Page 6: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

NP-Hard Problems Arising in Molecular Biology and Genetics

• Genome Sequencing• Global alignment of multiple genomes• Identifying siblings, cousins, second cousins

etc. through comparison of genomes• Finding protein modules containing specified

types of proteins• Computational discovery of dysregulated

pathways in human diseases

Page 7: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Patterns of Inheritance

• In each region of the genome, each individual has two haplotypes, one inherited from each parent. A haplotype is a sequence of alleles.

• The haplotype inherited from a parent is a mosaic of segments inherited from the parent’s two haplotypes. Recombination occurs at the boundaries between segments.

• In a pedigree graph the vertices are individuals and the edges represent parent-child relations.

Page 8: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Reconstructing Pedigrees

• Given the haplotypes of individuals in the current generation, we wish to reconstruct the pedigree that gave rise to that generation and chart the flow of alleles.

Page 9: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Assumptions of a Generative Model

• Monogamy• Layered structure: each individual and its mate

lie in generation g, have parents in generation g-1, and children in generation g+1. Generation 1 is the founding generation.

• The number of children of each couple is drawn from a Poisson distribution with mean 2.

• In each haplotype, sites of recombination occur according to a Poisson process with known rate.

Page 10: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Working Backwards

• We construct the pedigree generation by generation, working backwards from the current generation.

• It suffices to determine, in each generation, which individuals are siblings.

• Two alleles are identical by descent (IBD) if they are inherited from the same allele in the founding generation.

• To test whether two individuals in generation g are likely to be siblings, we observe the amount of IBD between their descendants in the current generation

Page 11: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Inferring Siblinghood

• Problem: determine which individuals in generation g are siblings.

• Using IBD, we construct a compatibility graph with a vertex for each individual in generation g, and edges indicating pairs of individuals that are likely to be siblings on the basis of the IBD of their descendants.

• Problem: Infer the siblinghood graph from the compatibility graph.

Page 12: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Inferring Siblinghood• Because of the monogamy assumption, the siblinghood

graph must be a union of cliques.• Problem: Given a compatibility graph C determine the

“closest” siblinghood graph S.• The algorithm maintains a partition of the vertices of C.

The parts of the partition are called quasi-cliques. The score of a partition is A times the number of edges of C whose end points lie in the same quasi-clique, minus the number of non-edges of C whose end points lie in the same in the same quasi-clique. We seek a partition of maximum score.

Page 13: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Justifying the Scoring Function

• Assumptions: The compatibility graph C is obtained by randomly perturbing the siblinghood graph S. S is a random union of disjoint cliques with sizes uniformly distributed between 1 and a parameter t.

• If u is adjacent to v in S then u is adjacent to v in C with probability p; if u is not adjacent to v in S then u is adjacent to v in C with probability q, where q <p.

• Under these assumptions maximizing the score produces a siblinghood graph of maximum conditional probability given C.

Page 14: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Heuristic Algorithm• The heuristic algorithm creates an initial partition by

greedily constructing disjoint quasi-cliques. It then performs the following local operations to improve the score:

Move a vertex; Extract a vertex; Split a quasi-clique; Merge two quasi-cliques; Restructure two quasi-cliques adjacent to a vertex v;

Dynamic Programming: given a chain of quasi-cliques, make an optimal simultaneous move of a small set of vertices from each quasi-clique in the chain to its successor quasi-clique.

Page 15: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Performance of the Algorithm

Typically the algorithm produces a partition with a slightly higher score than the “true” partition from which the compatibility graph was generated by perturbation. However, the fraction of vertices placed in the “correct” partition lies between 93% and 98%, depending on the fraction of edges deleted from cliques and the fraction of edges added between cliques in creating the compatibility graph C from the siblinghood graph S.

Page 16: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

The Colorful Subgraph Problem

• Input: A graph G and an assignment of a color to each vertex.

• Find, if one exists, a connected subgraph H containing exactly one vertex of each color.

• Optimization version: Minimize a x (number of extra vertices)

+ b x (number of omitted colors) • The problem is NP-hard, even on planar graphs.

Page 17: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Interpretation

• In the protein-protein interaction (PPI) graph of a species the vertices represent proteins and the edges represent pairs of physically interacting proteins.

• Given a connected set X of proteins performing a regulatory function in species A, we seek a similar connected set of proteins in species B. The color of each protein in species B indicates its similarity to a particular protein in X.

Page 18: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Dynamic Programming

• For each vertex v and set of colors S, determine whether there is a tree containing exactly one vertex of each color in S, no vertices of any other color, and containing vertex v. The computation is recursive, running through sets S in order of increasing cardinality. The running time is of order n3k where n is the number of vertices and k is the number of colors.

Page 19: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Integer Programming plus Constraint Generation

• We may assume that the desired connected subgraph is a tree T

• Variables: x(i)= 1 iff vertex i is included in T y(e) = 1 iff edge e is included in T• Constraints: Exactly one vertex of each color is included; Exactly n-1 edges are included in T; If an edge is included then its endpoints are included; For each set of colors X, the number of edges of T connecting

two vertices in X is at most |X| -1.

Page 20: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Performance

• An implementation of integer programming plus constraint generation solves typical instances with 100 vertices in less than a minute.

• Using a heuristic not yet implemented, one can solve typical instances with 100 vertices by hand in 20-30 minutes.

Page 21: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Heuristic Algorithmic Strategy• Repeat: (1) Delete vertices with frequent colors; (2) In the remaining graph, select a minimal set of connected components covering all infrequent colors; (3) Insert minimal set of vertices with frequent colors to restore connectedness and cover all colors.

Page 22: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

Example on Grid

W E L C O M E T O T H E W E B S I T E O F T H E A N N U A L S Y MP O S I U M O N C O MB I N A T O R I A L P A T T E R NMA T

Page 23: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

After One Iteration

W e B S F t H N N U a L S Y P S I U N C o m B I N R I L P R N

Page 24: Effective Heuristics for NP-Hard Problems Arising in Molecular Biology

After Two Iterations

W E B F T H n A l s Y P I u C O M B