RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

RANDOMIZED ALGORITHMS

Thursday, June 3, 2010

mailto:[email protected]


http://www.hyphy.org/pubs/181/Lectures



CONCEPTS

Deterministic algorithms will always perform the same series of operations (and produce the same result) on a given input

Random algorithms may execute different series of operations and potentially produce different results on a given input

Why use random algorithms?

Improve expected worst case run time

Accelerate execution at the cost of making errors infrequently

Approximate solutions to difficult problems







WE’VE ALREADY SEEN RANDOM ALGORITHMS

Simulations

Bootstrap

Random starting points in K-means clustering







LINEAR SEARCHConsider searching an (unsorted) array on N elements for a key.

Data : An array of N elements and a key k

Result: The index of k in N , or −1 if not found.for i = 0 to N − 1 do1

if N[i] = k then2

return i;3

end4

end5

return -1 ;6

Worst case run time: O(N)

For a given key consider an adversarial opponent whose job it is to make the algorithm perform as poorly as possible if he/she knows the key

Can always achieve worst case scenario by placing the key last







RANDOMIZED LINEAR SEARCHHalf the time, search left to right, and the other half: right to left

What is the expected run time for a given key?

If the key is in position i, then:

E[run time] = 1/2 i + 1/2 (N+1-i) = 1/2 (N+1)

The worst that an adversary can do now is place the key in the middle, because they don’t know which algorithm we will choose.

Data : An array of N elements and a key k

Result: The index of k in N , or −1 if not found.if Random(0, 1) < 0.5 then1

for i = 0 to N − 1 do2

if N[i] = k then3

return i;4

end5

end6

end7

else8

for i = N − 1 downto 0 do9

if N[i] = k then10

return i;11

end12

end13

end14

return -1 ;15







COMPARING STRINGS

To decide if two strings of equal length are equal or not, we may choose to compare a random number of positions (e.g. every 10th position) and report strings as equal if the subset matches.

The expected run time is better than direct comparison but now there is a possibility of making an error: reporting that two strings are equal when they are not.

Error rate analysis will depend on the assumptions we make about the text.







Las Vegas algorithms are randomized algorithms that always run correctly.

Monte Carlo algorithms, on the other hand can make errors. Useful algorithms make errors infrequently.







RANDOMIZED QUICKSORT

Quicksort is a classical example of divide and conquer approaches.

Given an array of N elements, select a pivot element of this array p and split the array into two subarrays:

Elements smaller than p

Elements greater than p

Repeat the process recursively in each of the subarrays

Recursion terminates when the size of the subarray is 0 or 1

The sorted list can be assembled as the recursion returns







QUICKSORT PERFORMANCE

If the pivot splits the elements into proportions c and 1-c at each step, then the following recursion holds for Quicksort complexity:

Sort Subarray 1

Sort Subarray 2

Split the array into two subarrays

T (N) = T (cN) + T ([1− c]N) + O(N)

If we think of the recursion process as a binary tree (each level of recursion is a new level), then the total work performed at each level is O(N)

If the tree is reasonably balanced (e.g. c is in 0.25-0.75), then its depth will be O(logN), giving total run complexity of O(NlogN)







5 6 0 1 3 4 7 2

0 5 6 3 4 7 2

Choose midpoint as the pivot

2 5 6 4 7

5 4 7

4 ∅

Not a very balanced tree







5 6 0 1 3 4 7 2Choose best element

A better-balanced tree

5 6 4 7

4 0 26 7

0 1 2

∅ 7







QUICKSORT PERFORMANCE

Poor choice of pivots can lead to quadratic run-time.

Assuming uniform distribution of values in the array, if we pick a pivot at random, then 50% of the time it will split the array no worse that 25% to 75%.

Hence, on average, good splits will be obtained frequently and the algorithm is expected to have the NlogN runtime.







MOTIF FINDINGGiven a collection of T strings, find the best pattern of length L that appears in all sequences. “Best” could mean:

Exactly contained

Is close to a specified sequence profile

The idea is to recast the problem in terms of a randomized profile alignment

For example, the first 16-20 nucleotides in HIV-1 protease are highly conserved, and can be thought of as a motif:

ACGT

0

0.5

1

2 4 6 8 10







SEQUENCE PROFILEWe can build a probabil ity matr ix o f f ind ing a g i ven nucleotide in the i-th position of a mot i f, to a l low some mismatches

E.g. CCCATTAGTC is the consensus decamer a t the beginning of HIV-1 protease

Some positions al low more variability then others:

E.g. 2 vs 3.

A C G T

1

2

3

4

5

6

7

8

9

10

0 0.9998 0.0002 0

0 1 0 0

0.0172 0.9017 0 0.0811

0.9995 0.0005 0 0

0.0001 0.0001 0.0001 0.9996

0.0245 0.0267 0 0.9487

0.9995 0.0005 0 0

0 0 1 0

0.0005 0.0311 0 0.9684

0.0011 0.9953 0 0.0036







SEQUENCE PROFILEThe profile defines a probability distribution that can be used to score other motifs.

To compute the score of a motif, we evaluate the probability that it was generated by the profile.

E.g.

Pr (CCCATTAGTC) = 0.82

Pr (CCTATTAGTC) = 0.07

Pr (AAAAAAAAAA) = 0.00

A C G T

1

2

3

4

5

6

7

8

9

10

0 0.9998 0.0002 0

0 1 0 0

0.0172 0.9017 0 0.0811

0.9995 0.0005 0 0

0.0001 0.0001 0.0001 0.9996

0.0245 0.0267 0 0.9487

0.9995 0.0005 0 0

0 0 1 0

0.0005 0.0311 0 0.9684

0.0011 0.9953 0 0.0036







P-MOST PROBABLE L-MERGiven a sequence profile on L letters and a string of length N we define the most probable L-mer of the string as the one that has the highest probability as measured by the profile P.

Scan the string left to right, considering all L-mers

Compute the probability of generating a given L-mer using the profile

Select the one with the highest score

Can be computed in time O (LN)

Note: in practice, zero probabilities for some letters in a given position will be replaced with small numbers.

For example, if none of 1000 training strings had a ‘C’ in the third position, instead of assigning Pr (C,3) = 0, we may instead set Pr (C,3) = 1/1001.







P-MOST PROBABLE L-MERS IN MANY SEQUENCES

FIND THE P-MOST PROBABLE L-MER IN EACH OF THE SEQUENCES.

ctataaacgttacatcatagcgattcgactgcagcccagaaccctcggtataccttacatctgcattcaatagcttatatcctttccactcacctccaaatcctttacaggtcatcctttatcct

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

P=







ctataaacgttacatc

atagcgattcgactg

cagcccagaaccct

cggtgaaccttacatc

tgcattcaatagctta

tgtcctgtccactcac

ctccaaatcctttaca

ggtctacctttatcct

1 a a a c g t

2 a t a g c g

3 a a c c c t

4 g a a c c t

5 a t a g c t

6 g a c c t g

7 a t c c t t

8 t a c c t t

A 5/8 5/8 4/8 0 0 0

C 0 0 4/8 6/8 4/8 0

T 1/8 3/8 0 0 3/8 6/8

G 2/8 0 0 2/8 1/8 2/8

Initial profile

Generate an updated profile using the new set of P-most probable L-mers.

Red: increase in frequencyBlue: decrease in frequency

1 a a a c g t

2 a t a g c g

3 a a c c c t

4 g a a c c t

5 a t a g c t

6 g a c c t g

7 a t c c t t

8 t a c c t t

A 5/8 5/8 4/8 0 0 0

C 0 0 4/8 6/8 4/8 0

T 1/8 3/8 0 0 3/8 6/8

G 2/8 0 0 2/8 1/8 2/8

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8







GREEDY PROFILE MOTIF SEARCH

Use P-Most probable l-mers to adjust start positions until we reach a “best” profile; this is the motif.

Select random starting positions.

Create a profile P from the substrings at these starting positions.

Find the P-most probable l-mer a in each sequence and change the starting position to the starting position of a.

Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score.







PERFORMANCE?

Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif.

It is unlikely that the random starting positions will lead us to the correct solution at all.

In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.







GIBBS SAMPLING

We can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one L-mer after each iteration and replaces it with a new one.

Gibbs sampling proceeds more slowly and chooses new L-mers at random increasing the odds that it will converge to the correct solution.

Gibbs sampling is a general class of sampling procedures used for approximating complex, difficult to compute distributions: in our case the distribution of P profiles.







HOW GIBBS SAMPLING WORKS

1. Randomly choose starting positions. s = (s1,...,st) and form the set of L-mers associated with these starting positions.

2. Randomly choose one of the T sequences.

3. Create a profile P from the other T-1 sequences.

4. For each position in the excluded sequence, ' calculate the probability that the l-mer starting at that position was generated by P.

5. Choose a new starting position for the excluded sequence at random based on the probabilities from step 4.'

6. Repeat steps 2-5 until there is no improvement







GIBBS SAMPLING: AN EXAMPLE

Input:

T = 5 sequences, motif length L = 8

1. GTAAACAATATTTATAGC

2. AAAATTTACCTTAGAAGG

3. CCGTACTGTCAAGCGTGG

4. TGAGTAAACGACGTCCCA

5. TACTTAACACCCTGTCAA







STEP 1

Randomly choose starting positions.

1. GTAAACAATATTTATAGC (7)

2. AAAATTTACCTTAGAAGG (11)

3. CCGTACTGTCAAGCGTGG (9)

4. TGAGTAAACGACGTCCCA (4)

5. TACTTAACACCCTGTCAA (1)







STEP 2

Exclude one sequence at random

1. GTAAACAATATTTATAGC (7)

2. AAAATTTACCTTAGAAGG (11)

3. CCGTACTGTCAAGCGTGG (9)

4. TGAGTAAACGACGTCCCA (4)

5. TACTTAACACCCTGTCAA (1)







STEP 3Create the octamer profile from sequences 1,3,4,5

1 A A T A T T T A

3 T C A A G C G T

4 G T A A A C G A

5 T A C T T A A C

A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4

C 0 1/4 1/4 0 0 2/4 0 1/4

T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4

G 1/4 0 0 0 1/4 0 3/4 0

Consensus

String T A A A T C G A







STEP 4Calculate the probability of every octamer from the excluded sequence (2)

AAAATTTACCTTAGAAGG .000732


AAAATTTACCTTAGAAGG 0









Normalize (divide by the sum) to obtain a proper probability distribution.







STEP 5

Select the starting position of the octamer in string 2 using the just computed probabilties:

P(selecting starting position 1): .706



Go back to Step 2 until no change in the P-profile.







GIBBS SAMPLER IN PRACTICE

Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach).

Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs.

Needs to be run with many randomly chosen seeds to achieve good results.






Documents

RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms