Upload
others
View
21
Download
2
Embed Size (px)
Citation preview
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
RANDOMIZED ALGORITHMS
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
CONCEPTS
Deterministic algorithms will always perform the same series of operations (and produce the same result) on a given input
Random algorithms may execute different series of operations and potentially produce different results on a given input
Why use random algorithms?
Improve expected worst case run time
Accelerate execution at the cost of making errors infrequently
Approximate solutions to difficult problems
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
WE’VE ALREADY SEEN RANDOM ALGORITHMS
Simulations
Bootstrap
Random starting points in K-means clustering
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
LINEAR SEARCHConsider searching an (unsorted) array on N elements for a key.
Data : An array of N elements and a key k
Result: The index of k in N , or −1 if not found.for i = 0 to N − 1 do1
if N[i] = k then2
return i;3
end4
end5
return -1 ;6
Worst case run time: O(N)
For a given key consider an adversarial opponent whose job it is to make the algorithm perform as poorly as possible if he/she knows the key
Can always achieve worst case scenario by placing the key last
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
RANDOMIZED LINEAR SEARCHHalf the time, search left to right, and the other half: right to left
What is the expected run time for a given key?
If the key is in position i, then:
E[run time] = 1/2 i + 1/2 (N+1-i) = 1/2 (N+1)
The worst that an adversary can do now is place the key in the middle, because they don’t know which algorithm we will choose.
Data : An array of N elements and a key k
Result: The index of k in N , or −1 if not found.if Random(0, 1) < 0.5 then1
for i = 0 to N − 1 do2
if N[i] = k then3
return i;4
end5
end6
end7
else8
for i = N − 1 downto 0 do9
if N[i] = k then10
return i;11
end12
end13
end14
return -1 ;15
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
COMPARING STRINGS
To decide if two strings of equal length are equal or not, we may choose to compare a random number of positions (e.g. every 10th position) and report strings as equal if the subset matches.
The expected run time is better than direct comparison but now there is a possibility of making an error: reporting that two strings are equal when they are not.
Error rate analysis will depend on the assumptions we make about the text.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
Las Vegas algorithms are randomized algorithms that always run correctly.
Monte Carlo algorithms, on the other hand can make errors. Useful algorithms make errors infrequently.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
RANDOMIZED QUICKSORT
Quicksort is a classical example of divide and conquer approaches.
Given an array of N elements, select a pivot element of this array p and split the array into two subarrays:
Elements smaller than p
Elements greater than p
Repeat the process recursively in each of the subarrays
Recursion terminates when the size of the subarray is 0 or 1
The sorted list can be assembled as the recursion returns
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
QUICKSORT PERFORMANCE
If the pivot splits the elements into proportions c and 1-c at each step, then the following recursion holds for Quicksort complexity:
Sort Subarray 1
Sort Subarray 2
Split the array into two subarrays
T (N) = T (cN) + T ([1− c]N) + O(N)
If we think of the recursion process as a binary tree (each level of recursion is a new level), then the total work performed at each level is O(N)
If the tree is reasonably balanced (e.g. c is in 0.25-0.75), then its depth will be O(logN), giving total run complexity of O(NlogN)
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
5 6 0 1 3 4 7 2
0 5 6 3 4 7 2
Choose midpoint as the pivot
2 5 6 4 7
5 4 7
4 ∅
Not a very balanced tree
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
5 6 0 1 3 4 7 2Choose best element
A better-balanced tree
5 6 4 7
4 0 26 7
0 1 2
∅ 7
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
QUICKSORT PERFORMANCE
Poor choice of pivots can lead to quadratic run-time.
Assuming uniform distribution of values in the array, if we pick a pivot at random, then 50% of the time it will split the array no worse that 25% to 75%.
Hence, on average, good splits will be obtained frequently and the algorithm is expected to have the NlogN runtime.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
MOTIF FINDINGGiven a collection of T strings, find the best pattern of length L that appears in all sequences. “Best” could mean:
Exactly contained
Is close to a specified sequence profile
The idea is to recast the problem in terms of a randomized profile alignment
For example, the first 16-20 nucleotides in HIV-1 protease are highly conserved, and can be thought of as a motif:
ACGT
0
0.5
1
2 4 6 8 10
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
SEQUENCE PROFILEWe can build a probabil ity matr ix o f f ind ing a g i ven nucleotide in the i-th position of a mot i f, to a l low some mismatches
E.g. CCCATTAGTC is the consensus decamer a t the beginning of HIV-1 protease
Some positions al low more variability then others:
E.g. 2 vs 3.
A C G T
1
2
3
4
5
6
7
8
9
10
0 0.9998 0.0002 0
0 1 0 0
0.0172 0.9017 0 0.0811
0.9995 0.0005 0 0
0.0001 0.0001 0.0001 0.9996
0.0245 0.0267 0 0.9487
0.9995 0.0005 0 0
0 0 1 0
0.0005 0.0311 0 0.9684
0.0011 0.9953 0 0.0036
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
SEQUENCE PROFILEThe profile defines a probability distribution that can be used to score other motifs.
To compute the score of a motif, we evaluate the probability that it was generated by the profile.
E.g.
Pr (CCCATTAGTC) = 0.82
Pr (CCTATTAGTC) = 0.07
Pr (AAAAAAAAAA) = 0.00
A C G T
1
2
3
4
5
6
7
8
9
10
0 0.9998 0.0002 0
0 1 0 0
0.0172 0.9017 0 0.0811
0.9995 0.0005 0 0
0.0001 0.0001 0.0001 0.9996
0.0245 0.0267 0 0.9487
0.9995 0.0005 0 0
0 0 1 0
0.0005 0.0311 0 0.9684
0.0011 0.9953 0 0.0036
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
P-MOST PROBABLE L-MERGiven a sequence profile on L letters and a string of length N we define the most probable L-mer of the string as the one that has the highest probability as measured by the profile P.
Scan the string left to right, considering all L-mers
Compute the probability of generating a given L-mer using the profile
Select the one with the highest score
Can be computed in time O (LN)
Note: in practice, zero probabilities for some letters in a given position will be replaced with small numbers.
For example, if none of 1000 training strings had a ‘C’ in the third position, instead of assigning Pr (C,3) = 0, we may instead set Pr (C,3) = 1/1001.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
P-MOST PROBABLE L-MERS IN MANY SEQUENCES
FIND THE P-MOST PROBABLE L-MER IN EACH OF THE SEQUENCES.
ctataaacgttacatcatagcgattcgactgcagcccagaaccctcggtataccttacatctgcattcaatagcttatatcctttccactcacctccaaatcctttacaggtcatcctttatcct
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
P=
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
ctataaacgttacatc
atagcgattcgactg
cagcccagaaccct
cggtgaaccttacatc
tgcattcaatagctta
tgtcctgtccactcac
ctccaaatcctttaca
ggtctacctttatcct
1 a a a c g t
2 a t a g c g
3 a a c c c t
4 g a a c c t
5 a t a g c t
6 g a c c t g
7 a t c c t t
8 t a c c t t
A 5/8 5/8 4/8 0 0 0
C 0 0 4/8 6/8 4/8 0
T 1/8 3/8 0 0 3/8 6/8
G 2/8 0 0 2/8 1/8 2/8
Initial profile
Generate an updated profile using the new set of P-most probable L-mers.
Red: increase in frequencyBlue: decrease in frequency
1 a a a c g t
2 a t a g c g
3 a a c c c t
4 g a a c c t
5 a t a g c t
6 g a c c t g
7 a t c c t t
8 t a c c t t
A 5/8 5/8 4/8 0 0 0
C 0 0 4/8 6/8 4/8 0
T 1/8 3/8 0 0 3/8 6/8
G 2/8 0 0 2/8 1/8 2/8
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
GREEDY PROFILE MOTIF SEARCH
Use P-Most probable l-mers to adjust start positions until we reach a “best” profile; this is the motif.
Select random starting positions.
Create a profile P from the substrings at these starting positions.
Find the P-most probable l-mer a in each sequence and change the starting position to the starting position of a.
Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
PERFORMANCE?
Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif.
It is unlikely that the random starting positions will lead us to the correct solution at all.
In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
GIBBS SAMPLING
We can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one L-mer after each iteration and replaces it with a new one.
Gibbs sampling proceeds more slowly and chooses new L-mers at random increasing the odds that it will converge to the correct solution.
Gibbs sampling is a general class of sampling procedures used for approximating complex, difficult to compute distributions: in our case the distribution of P profiles.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
HOW GIBBS SAMPLING WORKS
1. Randomly choose starting positions. s = (s1,...,st) and form the set of L-mers associated with these starting positions.
2. Randomly choose one of the T sequences.
3. Create a profile P from the other T-1 sequences.
4. For each position in the excluded sequence, ' calculate the probability that the l-mer starting at that position was generated by P.
5. Choose a new starting position for the excluded sequence at random based on the probabilities from step 4.'
6. Repeat steps 2-5 until there is no improvement
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
GIBBS SAMPLING: AN EXAMPLE
Input:
T = 5 sequences, motif length L = 8
1. GTAAACAATATTTATAGC
2. AAAATTTACCTTAGAAGG
3. CCGTACTGTCAAGCGTGG
4. TGAGTAAACGACGTCCCA
5. TACTTAACACCCTGTCAA
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
STEP 1
Randomly choose starting positions.
1. GTAAACAATATTTATAGC (7)
2. AAAATTTACCTTAGAAGG (11)
3. CCGTACTGTCAAGCGTGG (9)
4. TGAGTAAACGACGTCCCA (4)
5. TACTTAACACCCTGTCAA (1)
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
STEP 2
Exclude one sequence at random
1. GTAAACAATATTTATAGC (7)
2. AAAATTTACCTTAGAAGG (11)
3. CCGTACTGTCAAGCGTGG (9)
4. TGAGTAAACGACGTCCCA (4)
5. TACTTAACACCCTGTCAA (1)
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
STEP 3Create the octamer profile from sequences 1,3,4,5
1 A A T A T T T A
3 T C A A G C G T
4 G T A A A C G A
5 T A C T T A A C
A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4
C 0 1/4 1/4 0 0 2/4 0 1/4
T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4
G 1/4 0 0 0 1/4 0 3/4 0
Consensus
String T A A A T C G A
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
STEP 4Calculate the probability of every octamer from the excluded sequence (2)
AAAATTTACCTTAGAAGG .000732
AAAATTTACCTTAGAAGG .000122
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG .000183
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
Normalize (divide by the sum) to obtain a proper probability distribution.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
STEP 5
Select the starting position of the octamer in string 2 using the just computed probabilties:
P(selecting starting position 1): .706
P(selecting starting position 2): .118
P(selecting starting position 8): .176
Go back to Step 2 until no change in the P-profile.
Thursday, June 3, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES
GIBBS SAMPLER IN PRACTICE
Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach).
Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs.
Needs to be run with many randomly chosen seeds to achieve good results.
Thursday, June 3, 2010