29
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [SPOND@UCSD.EDU] WWW.HYPHY.ORG/PUBS/181/LECTURES R ANDOMIZED ALGORITHMS Thursday, June 3, 2010

RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

  • Upload
    others

  • View
    21

  • Download
    2

Embed Size (px)

Citation preview

Page 1: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

RANDOMIZED ALGORITHMS

Thursday, June 3, 2010

Page 2: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

CONCEPTS

Deterministic algorithms will always perform the same series of operations (and produce the same result) on a given input

Random algorithms may execute different series of operations and potentially produce different results on a given input

Why use random algorithms?

Improve expected worst case run time

Accelerate execution at the cost of making errors infrequently

Approximate solutions to difficult problems

Thursday, June 3, 2010

Page 3: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

WE’VE ALREADY SEEN RANDOM ALGORITHMS

Simulations

Bootstrap

Random starting points in K-means clustering

Thursday, June 3, 2010

Page 4: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

LINEAR SEARCHConsider searching an (unsorted) array on N elements for a key.

Data : An array of N elements and a key k

Result: The index of k in N , or −1 if not found.for i = 0 to N − 1 do1

if N[i] = k then2

return i;3

end4

end5

return -1 ;6

Worst case run time: O(N)

For a given key consider an adversarial opponent whose job it is to make the algorithm perform as poorly as possible if he/she knows the key

Can always achieve worst case scenario by placing the key last

Thursday, June 3, 2010

Page 5: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

RANDOMIZED LINEAR SEARCHHalf the time, search left to right, and the other half: right to left

What is the expected run time for a given key?

If the key is in position i, then:

E[run time] = 1/2 i + 1/2 (N+1-i) = 1/2 (N+1)

The worst that an adversary can do now is place the key in the middle, because they don’t know which algorithm we will choose.

Data : An array of N elements and a key k

Result: The index of k in N , or −1 if not found.if Random(0, 1) < 0.5 then1

for i = 0 to N − 1 do2

if N[i] = k then3

return i;4

end5

end6

end7

else8

for i = N − 1 downto 0 do9

if N[i] = k then10

return i;11

end12

end13

end14

return -1 ;15

Thursday, June 3, 2010

Page 6: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

COMPARING STRINGS

To decide if two strings of equal length are equal or not, we may choose to compare a random number of positions (e.g. every 10th position) and report strings as equal if the subset matches.

The expected run time is better than direct comparison but now there is a possibility of making an error: reporting that two strings are equal when they are not.

Error rate analysis will depend on the assumptions we make about the text.

Thursday, June 3, 2010

Page 7: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

Las Vegas algorithms are randomized algorithms that always run correctly.

Monte Carlo algorithms, on the other hand can make errors. Useful algorithms make errors infrequently.

Thursday, June 3, 2010

Page 8: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

RANDOMIZED QUICKSORT

Quicksort is a classical example of divide and conquer approaches.

Given an array of N elements, select a pivot element of this array p and split the array into two subarrays:

Elements smaller than p

Elements greater than p

Repeat the process recursively in each of the subarrays

Recursion terminates when the size of the subarray is 0 or 1

The sorted list can be assembled as the recursion returns

Thursday, June 3, 2010

Page 9: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

QUICKSORT PERFORMANCE

If the pivot splits the elements into proportions c and 1-c at each step, then the following recursion holds for Quicksort complexity:

Sort Subarray 1

Sort Subarray 2

Split the array into two subarrays

T (N) = T (cN) + T ([1− c]N) + O(N)

If we think of the recursion process as a binary tree (each level of recursion is a new level), then the total work performed at each level is O(N)

If the tree is reasonably balanced (e.g. c is in 0.25-0.75), then its depth will be O(logN), giving total run complexity of O(NlogN)

Thursday, June 3, 2010

Page 10: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

5 6 0 1 3 4 7 2

0 5 6 3 4 7 2

Choose midpoint as the pivot

2 5 6 4 7

5 4 7

4 ∅

Not a very balanced tree

Thursday, June 3, 2010

Page 11: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

5 6 0 1 3 4 7 2Choose best element

A better-balanced tree

5 6 4 7

4 0 26 7

0 1 2

∅ 7

Thursday, June 3, 2010

Page 12: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

QUICKSORT PERFORMANCE

Poor choice of pivots can lead to quadratic run-time.

Assuming uniform distribution of values in the array, if we pick a pivot at random, then 50% of the time it will split the array no worse that 25% to 75%.

Hence, on average, good splits will be obtained frequently and the algorithm is expected to have the NlogN runtime.

Thursday, June 3, 2010

Page 13: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

MOTIF FINDINGGiven a collection of T strings, find the best pattern of length L that appears in all sequences. “Best” could mean:

Exactly contained

Is close to a specified sequence profile

The idea is to recast the problem in terms of a randomized profile alignment

For example, the first 16-20 nucleotides in HIV-1 protease are highly conserved, and can be thought of as a motif:

ACGT

0

0.5

1

2 4 6 8 10

Thursday, June 3, 2010

Page 14: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

SEQUENCE PROFILEWe can build a probabil ity matr ix o f f ind ing a g i ven nucleotide in the i-th position of a mot i f, to a l low some mismatches

E.g. CCCATTAGTC is the consensus decamer a t the beginning of HIV-1 protease

Some positions al low more variability then others:

E.g. 2 vs 3.

A C G T

1

2

3

4

5

6

7

8

9

10

0 0.9998 0.0002 0

0 1 0 0

0.0172 0.9017 0 0.0811

0.9995 0.0005 0 0

0.0001 0.0001 0.0001 0.9996

0.0245 0.0267 0 0.9487

0.9995 0.0005 0 0

0 0 1 0

0.0005 0.0311 0 0.9684

0.0011 0.9953 0 0.0036

Thursday, June 3, 2010

Page 15: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

SEQUENCE PROFILEThe profile defines a probability distribution that can be used to score other motifs.

To compute the score of a motif, we evaluate the probability that it was generated by the profile.

E.g.

Pr (CCCATTAGTC) = 0.82

Pr (CCTATTAGTC) = 0.07

Pr (AAAAAAAAAA) = 0.00

A C G T

1

2

3

4

5

6

7

8

9

10

0 0.9998 0.0002 0

0 1 0 0

0.0172 0.9017 0 0.0811

0.9995 0.0005 0 0

0.0001 0.0001 0.0001 0.9996

0.0245 0.0267 0 0.9487

0.9995 0.0005 0 0

0 0 1 0

0.0005 0.0311 0 0.9684

0.0011 0.9953 0 0.0036

Thursday, June 3, 2010

Page 16: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

P-MOST PROBABLE L-MERGiven a sequence profile on L letters and a string of length N we define the most probable L-mer of the string as the one that has the highest probability as measured by the profile P.

Scan the string left to right, considering all L-mers

Compute the probability of generating a given L-mer using the profile

Select the one with the highest score

Can be computed in time O (LN)

Note: in practice, zero probabilities for some letters in a given position will be replaced with small numbers.

For example, if none of 1000 training strings had a ‘C’ in the third position, instead of assigning Pr (C,3) = 0, we may instead set Pr (C,3) = 1/1001.

Thursday, June 3, 2010

Page 17: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

P-MOST PROBABLE L-MERS IN MANY SEQUENCES

FIND THE P-MOST PROBABLE L-MER IN EACH OF THE SEQUENCES.

ctataaacgttacatcatagcgattcgactgcagcccagaaccctcggtataccttacatctgcattcaatagcttatatcctttccactcacctccaaatcctttacaggtcatcctttatcct

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

P=

Thursday, June 3, 2010

Page 18: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

ctataaacgttacatc

atagcgattcgactg

cagcccagaaccct

cggtgaaccttacatc

tgcattcaatagctta

tgtcctgtccactcac

ctccaaatcctttaca

ggtctacctttatcct

1 a a a c g t

2 a t a g c g

3 a a c c c t

4 g a a c c t

5 a t a g c t

6 g a c c t g

7 a t c c t t

8 t a c c t t

A 5/8 5/8 4/8 0 0 0

C 0 0 4/8 6/8 4/8 0

T 1/8 3/8 0 0 3/8 6/8

G 2/8 0 0 2/8 1/8 2/8

Initial profile

Generate an updated profile using the new set of P-most probable L-mers.

Red: increase in frequencyBlue: decrease in frequency

1 a a a c g t

2 a t a g c g

3 a a c c c t

4 g a a c c t

5 a t a g c t

6 g a c c t g

7 a t c c t t

8 t a c c t t

A 5/8 5/8 4/8 0 0 0

C 0 0 4/8 6/8 4/8 0

T 1/8 3/8 0 0 3/8 6/8

G 2/8 0 0 2/8 1/8 2/8

A 1/2 7/8 3/8 0 1/8 0

C 1/8 0 1/2 5/8 3/8 0

T 1/8 1/8 0 0 1/4 7/8

G 1/4 0 1/8 3/8 1/4 1/8

Thursday, June 3, 2010

Page 19: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

GREEDY PROFILE MOTIF SEARCH

Use P-Most probable l-mers to adjust start positions until we reach a “best” profile; this is the motif.

Select random starting positions.

Create a profile P from the substrings at these starting positions.

Find the P-most probable l-mer a in each sequence and change the starting position to the starting position of a.

Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score.

Thursday, June 3, 2010

Page 20: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

PERFORMANCE?

Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif.

It is unlikely that the random starting positions will lead us to the correct solution at all.

In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.

Thursday, June 3, 2010

Page 21: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

GIBBS SAMPLING

We can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one L-mer after each iteration and replaces it with a new one.

Gibbs sampling proceeds more slowly and chooses new L-mers at random increasing the odds that it will converge to the correct solution.

Gibbs sampling is a general class of sampling procedures used for approximating complex, difficult to compute distributions: in our case the distribution of P profiles.

Thursday, June 3, 2010

Page 22: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

HOW GIBBS SAMPLING WORKS

1. Randomly choose starting positions. s = (s1,...,st) and form the set of L-mers associated with these starting positions.

2. Randomly choose one of the T sequences.

3. Create a profile P from the other T-1 sequences.

4. For each position in the excluded sequence, ' calculate the probability that the l-mer starting at that position was generated by P.

5. Choose a new starting position for the excluded sequence at random based on the probabilities from step 4.'

6. Repeat steps 2-5 until there is no improvement

Thursday, June 3, 2010

Page 23: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

GIBBS SAMPLING: AN EXAMPLE

Input:

T = 5 sequences, motif length L = 8

1. GTAAACAATATTTATAGC

2. AAAATTTACCTTAGAAGG

3. CCGTACTGTCAAGCGTGG

4. TGAGTAAACGACGTCCCA

5. TACTTAACACCCTGTCAA

Thursday, June 3, 2010

Page 24: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

STEP 1

Randomly choose starting positions.

1. GTAAACAATATTTATAGC (7)

2. AAAATTTACCTTAGAAGG (11)

3. CCGTACTGTCAAGCGTGG (9)

4. TGAGTAAACGACGTCCCA (4)

5. TACTTAACACCCTGTCAA (1)

Thursday, June 3, 2010

Page 25: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

STEP 2

Exclude one sequence at random

1. GTAAACAATATTTATAGC (7)

2. AAAATTTACCTTAGAAGG (11)

3. CCGTACTGTCAAGCGTGG (9)

4. TGAGTAAACGACGTCCCA (4)

5. TACTTAACACCCTGTCAA (1)

Thursday, June 3, 2010

Page 26: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

STEP 3Create the octamer profile from sequences 1,3,4,5

1 A A T A T T T A

3 T C A A G C G T

4 G T A A A C G A

5 T A C T T A A C

A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4

C 0 1/4 1/4 0 0 2/4 0 1/4

T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4

G 1/4 0 0 0 1/4 0 3/4 0

Consensus

String T A A A T C G A

Thursday, June 3, 2010

Page 27: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

STEP 4Calculate the probability of every octamer from the excluded sequence (2)

AAAATTTACCTTAGAAGG .000732

AAAATTTACCTTAGAAGG .000122

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG .000183

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

AAAATTTACCTTAGAAGG 0

Normalize (divide by the sum) to obtain a proper probability distribution.

Thursday, June 3, 2010

Page 28: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

STEP 5

Select the starting position of the octamer in string 2 using the just computed probabilties:

P(selecting starting position 1): .706

P(selecting starting position 2): .118

P(selecting starting position 8): .176

Go back to Step 2 until no change in the P-profile.

Thursday, June 3, 2010

Page 29: RANDOMIZED ALGORITHMS - HYPHY · Las Vegas algorithms are randomized algorithms that always run correctly. Monte Carlo algorithms, on the other hand can make errors. Useful algorithms

CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES

GIBBS SAMPLER IN PRACTICE

Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach).

Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs.

Needs to be run with many randomly chosen seeds to achieve good results.

Thursday, June 3, 2010