Computational Molecular Biology

Computational Molecular Biology

Group Testing – Pooling Designs

My T. [email protected]

2

Group Testing (GT)

Definition: Given n items with at most d positive ones

Identify all positive ones by the minimum number of tests

Each test is on a subset of items Positive test outcome: there exists a positive item

in the subset


3

An Idea of GT

_ _ __ __ _ +

Positive Negative

+

_ _ __ _ __ _ _


4

Example 1 – Sequential Method

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9

1 2 3 4 5

4 5 4 5


5

Example 2 – Non-adaptive Method

P4 p5 p6

p1 1 2 3

p2 4 5 6

p3 7 8 9

Non-adaptive group testing is called pooling design in biology


6

Sequential and Non-adaptive

Sequential GT needs less number of tests, but longer time.

Non-adaptive GT needs more tests, but shorter time.

In molecular biology, non-adaptive GT is usually taken. Why?


7

Because…

The same library is screened with many different probes. It is expensive to prepare a pool for testing first time. Once a pool is prepared, it can be screened many times with different probes.

Screening one pool at a time is expensive. Screening pools in parallel with same probe is cheaper.

There are constrains on pool sizes. If a pool contains too many different clones, then positive pools can become too dilute and could be mislabeled as negative pools.


8

Pooling Designs

Problem Definition Given a set of n clones with at most d positive

clones Identify all positive clones with the minimum

number of tests

Pool: a subset of clones Positive pool: a pool contains at least one positive

clone Clones = Items


9

Relation to Pooling Designsclones

c1 c2 cj cn

p1 0 0 … 0 … 0 … 0 … 0 0 p2 0 1 … 0 … 0 … 0 … 0 1

pools . .. .

pi 0 0 … 0 … 1 … 0 … 0 1. .. .

pt 0 0 … 0 … 0 … 0 … 0 0 txn tx1

M[i, j] = 1 iff the ith pool contains the jth clone

Decoding Algorithm: Given M and V, identify all positive clones

Testing

V

Mtxn =


10

Observationclones

c1 c2 c3 cj p1 1 1 1 0 0 0 0 0 0 p2 0 0 0 1 1 1 0 0 0 p3 0 0 0 0 0 0 1 1 1

pools 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 00 0 1 0 0 1 0 0 1

Observation: All columns are distinct.

To identify up to d positives, all unions of up to d columns should be distinct!

Union of d columns: Boolean sum of these d columns


11

Challenges

Challenge 1: How to construct the binary matrix M such that: Outputs of any union of d columns are distinct

Challenge 2: How to design a decoding algorithm with efficient time complexity [O(tn)]


12

d-separable Matrixclones

c1 c2 c3 cj cn

p1 0 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 p2 0 1 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0

p3 1 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 pools 0 0 1 … 0 … 0 … 0 … 0 … 0 … 0 … 0

.

. pi 0 0 0 … 0 … 0 … 1 … 0 … 0 … 0 … 0

.

. pt 0 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0

All unions of d columns are distinct.


13

d-separable Matrixclones

c1 c2 c3 cj cn

p1 0 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 p2 0 1 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0

p3 1 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 pools 0 0 1 … 0 … 0 … 0 … 0 … 0 … 0 … 0

.

. pi 0 0 0 … 0 … 0 … 1 … 0 … 0 … 0 … 0

.

. pt 0 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0

All unions of up to d columns are distinct.Decoding: O(nd)


14

d-disjunct Matrix

Definition: An binary matrix Mtxn is a d-disjunct matrix (d < t) if: The union of any d columns does not contain any

other column

Example:

1 0 0

0 1 0

0 0 1

A 2-disjunct matrix M =


15

d-disjunct Matrix (cont)

d-disjunct matrix can efficiently identify up to d positive clones. Why?

Theorem 1: All unions of d distinct columns are distinct (thus d-disjunct implies d-separable)

Theorem 2: The number of clones not in negative pools is always at most d Corollary 1: The tests of negative outputs determine all

negative clones Decoding time complexity: O(tn)


16

Proof of Theorem 2

Note that an item does not appearing in any negative pool iff its corresponding column is contained by the union of d positive columns

Therefore, the number of items not appearing in any negative pool is more than d iff there are at least a non-positive item whose column is contained by the d positive columns

But M is d-disjunct, hence Theorem 2 follows


17

Decoding AlgorithmInput: d-disjunct matrix M and output vector V

Output: All positive clones

for each clone c in n clones

if c is in a negative pool

remove c

return remaining clones c1 c2 c3 c4 c5 c6

p1 1 1 1 0 0 0 1

P2 1 0 0 1 1 0 0

P3 0 1 0 1 0 1 0

P4 0 0 1 0 1 1 1


18

Fields

Field: is any set of elements that satisfies the field axioms for both addition and multiplication and is a division algebra

Eg: Compex, Rational, Real


19

Division Algebra


20

Finite Fields

Finite Field: is a field with a finite field order, i.e., number of

elements. The order of a finite field is always a prime or a

prime power (power of a prime) Eg: 16 = 2^4 is a prime power where 6, 15 are not

Eg: in GF(5), 4+3=7 is reduced to 2 modulo 5


21

Consider a finite field GF(q). Choose s, q, k satisfying:

Step 1: Construct matrix Asxn as follows:

for x from 0 to s -1

for each polynomials pj of degree k

A[x,pj] = pj(x) p1 p2 pj pn

0

1

A =

x p2(x) pj(x)

s-1

How to construct a d-disjunct matrix

kqnqskd and


22

Step 2: Construct matrix Btxn from Asxn as follows:for x from 0 to s -1 for y from 0 to q -1

for each polynomials pj of degree k if A[x,pj] = = y

B[(x,y),pj] = 1 else B[(x,y),pj] = 0

p1 p2 pj pn

0

1

A =

x p2(x) pj(x)

s-1

Algorithm (cont)

p1 p2 pj pn

(0,0)

(0,1)

B =

(x,y)

(s-1,q-1)

0

p2(x) ≠ y

pj(x) = y

1


23

Algorithm Analysis

Theorem 3: (Correctness) If kd ≤ s ≤ q, then Btxn is d-disjunct.

Theorem 4: The number of tests t obtained from this algorithm is t = qs = O(q2) where:

)log(log

log))1(2(

22

2

nd

ndoq


24

Errors in Experiments

False negative: Pool contains some positive clones But return the negative outcome

False positive: Pool contains all negative clones But return the positive outcome


25

An e-Error Correcting Model

Definition: Assume that there is at most e errors in testing All positive clones can still be identified

Hamming distance: the Hamming distance of two column vectors is the number of different components between them

e-error-correcting: A matrix is said to be e-error-correcting if the Hamming distance of any two unions of d columns is at least 2e + 1


26

(d,e)-disjunct Matrix

Definition: An t × n binary matrix M is (d, e)-disjunct if for any one column j and any other d columns j1, j2, . . . , jd, there exist e + 1 rows i0, i2, … , ie such that Miuj = 1 and Miujv = 0 for u = 0, 1,…, e and v = 1, 2, . . . , d


27

E-error Correcting

Theorem 5: For every (d,k)-disjunct matrix, the Hamming distance between any two unions of d columns is at least 2k + 2


28

Theorem 6

Theorem 6: Suppose testing is based on a (d,e)-disjunct matrix. If the number of errors is at most e, then the number of negative pools containing a positive item is always smaller than the number of negative pools containing a negative item


29

Proof of Theorem 6

Let i be a positive item, j be a negative item. Suppose #negative pools containing i = m. Then m pools must receive errors. Hence, there are at most e – m error tests turning negative outcome to positive outcome. Moreover, if no error exists, # negative pools containing j is at least e + 1 due to (d,e)-disjunct. Hence #negative pools containing j is at least (e+1)-(e-m) = m +1>m


30

Decoding in e-error-correcting

Corollary: From Theorem 6, we see that to decode positives from testing based on (d,e)-disjuct matrix, we only need to compute the number of negative pools containing each item and select d smallest one. This runs in time O(nt)


31

Decoding Algorithm with e Errors

T = empty set

for each clone ci (i = 1…n)

t(ci) = # negative pools containing ci

T = T t(ci)

end for

Let Td = set of d smallest t(ci) in T

return ci if t(ci) in Td

Time complexity: O(tn)

Documents

Computational Molecular Biology