Segment Alignment (SEA)
Yuzhen YeAdam Godzik
The Burnham Institute
Outline
• A new look at the local structure prediction• Network matching problem• Practical issues• Applications
GSDKKGNGVALMTTLFADN
EEEEEEHHHHHHHH HHHHHH
EEEEEE LLHHHHHHHHLLL
LHHHHHLLLLLLLEEEEEEEEE
LLLLL
Description of local structure one or many answers?
GSDKKGNGVALMTTLFADN
LLHHHHHHHHLLLEEEEEE A prediction
HHHHHHHHLLLLLHHHHHH Real structure
Motivation• A natural description of local structures: keep the
segment information of local structures
• Keep uncertainties in local structure predictions: drawbacks of prediction programs and intrinsic uncertainties of local structures in absence of global interactions
Incorporating the protein local structure in protein sequence comparison may help to detect the distant homologies and to improve their alignments (for homology modeling)
Proteins are described as a network of PLSSs (predicted local structure segments)
Protein comparison problem is equivalent to a network matching problem
Given two networks of PLSSs, find two optimal paths from the source to the sink in each of the networks, whose corresponding PLSSs are most similar to each other.
It does not follow the typical position-by-position alignment mode
Solving the network matching problem: dynamic programming
V(i,j)
i
j
V(i, j) = maxall ( combinations E i E j V i j, ( ), ( ) ( , )
V(i1,j1)
V(i1,j2)
V(i3,j1)
V(i3,j2)
(i-1)1 i1
32
1 2
4 (i-1)3, (i-1)4 i2
Example: (1e68A,1nkl)
Each protein is represented as a collection of potentially overlapping and contradictory PLSSs (a network).
SEA finds an optimal alignment between these two proteins
Simultaneously, SEA identifies the optimal subset of PLSSs (a path in the network) describing each protein.
1e68A: Bacteriocin As-48
1nkl : Nk-lysin
subset measures CE SEA_true SEA_c30 SEA_c10 SEA_c5 SEA_1d BLAST ALIGN FFAS
average-shift
0.61 0.56 0.56 0.54 0.49 0.44 0.48 0.49
shift>0.9 73 69 63 56 47 51 60 43
shift>0.7 207 199 192 183 152 146 165 161
shift>0.5 282 260 259 251 215 197 228 227
RMSD3.0
257 95 82 82 76 63 77 54 40
RMSD5.0 397 237 184 171 177 147 157 138 118
RMSD8.0 408 294 248 249 249 231 196 206 194
Family
(409 pairs)
all 409 345 404 398 368 366 232 372 409
average-shift
0.27 0.12 0.12 0.12 0.08 0.09 0.06 0.07
shift>0.9 3 3 3 2 0 1 2 1
shift>0.7 17 8 9 7 4 10 9 7
shift>0.5 54 26 23 21 17 18 18 17
RMSD3.0
55 12 6 6 7 6 8 3 1
RMSD5.0 160 44 16 18 18 11 18 11 1
RMSD8.0 163 69 37 34 41 28 23 22 15
Superfamily
(225 pairs)
all 166 128 217 204 181 177 41 149 225
General performance of SEA incorporating different local structure diversities
Keeping local structure diversity helps improve alignment quality
alignment between -repressor from E.coli (1lliA) and 434 repressor (1r69)
Stable region
Variable region
Local structure information is crucial for improving alignments, especially in the more divergent regions
1esfA: straphylococcal enterotoxin
2tssA: toxic shock syndrome toxin-1
Practical issue: local structural prediction
• Searching I-site database (web-server or standalone program)
• Our solution: FragLib– using sensitive profile-profile alignment program FFAS to
predict local structures
Applications
• Distant homology detection
• Local structure prediction
• Improving alignments for protein modeling
Reference A segment alignment approach to protein comparison (Bioinformatics, April issue)
Web server http://ffas.ljcrf.edu/sea
Related work• Spliced sequence alignment
– Gelfand et al., 1996, PNAS; Novichkov et al., 2001
– Assembling genes from alternative exons
• Jumping alignment– Spang R, Rehmsmeier M, Stoye J. JCB, 2002
– Computes a local alignment of a single sequence and a multiple alignment
– The sequence is at each position aligned to one sequence of the multiple alignment (reference sequence) instead of a profile
• Partial order alignment– Lee C, Grasso C, Sharlow MF, Bioinformatics, 2002
– Multiple alignment
Acknowledgements
• Dariusz Plewczyński
• Iddo Friedberg
• Łukasz Jaroszewski
• Weizhong Li
• This project is supported by SPAM grant GM63208