Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Using residue coevolution to retrieve protein homologs1
ComPotts
Hugo Talibart, François Coste
Co-evolutionary methods for the prediction and designof protein structure and interactions
CECAM-HQ-EPFL, June 18, 2019
1work in progress
The Dyliss bioinformatics teamhttp://www.irisa.fr/dyliss
H. Talibart, F. Coste ComPotts June 18, 2019 1 / 35
Bioinformatics at IRISA / Inria Rennes
Symbiose (bioinformatics Irisa/Inria Rennes):
Dyliss research teamGenscale research teamGenouest bioinformatics platform
Seminars: http://symbiose.irisa.fr/symbioseSeminars
Biogenouest western France life science and environment networkMarine biology, agriculture/food-processing, human health, and bioinformatics.
H. Talibart, F. Coste ComPotts June 18, 2019 2 / 35
Motivation
Sequences annotation problem
High throughput production of raw sequences
Problem
Function(s) of these sequences ?
H. Talibart, F. Coste ComPotts June 18, 2019 3 / 35
Protein function?
In-vivo / in-vitro experimentsEspecially on model organisms:
Gene knockout and others mutations→ key sequence(s) for a functionStructure determination→ key positions for a function. . .
Does not scale well. . .To face the (ever-increasing) amount of available sequences, automaticmethods are needed in-silico functional or structural predictions.
Classical approach to predict the function of a new gene sequence
Search for annotated homologs . . .
H. Talibart, F. Coste ComPotts June 18, 2019 4 / 35
Retrieve the homologs of a protein gene
Search for a significant match with:
an (already annotated) protein sequence, e.g. with BLAST2
>1shg:A
AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD
2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein
annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 5 / 35
Retrieve the homologs of a protein gene
Search for a significant match with:
an (already annotated) protein sequence, e.g. with BLAST2
>1shg:A
AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD
2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein
annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 5 / 35
Retrieve the homologs of a protein gene
Search for a significant match with:
an (already annotated) protein sequence, e.g. with BLAST2
>1shg:A
AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD
the Profile HMM of a protein family, e.g. with HMMER3 or HH-suite4
2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein
annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 5 / 35
Yet. . .
H. Talibart, F. Coste ComPotts June 18, 2019 6 / 35
Retrieve the homologs of a protein gene
Search for a significant match with:
an (already annotated) protein sequence, e.g. with BLAST2
>1shg:A
AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD
the Profile HMM of a protein family, e.g. with HMMER3 or HH-suite4
2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein
annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 7 / 35
Retrieve the homologs of a protein gene
Search for a significant match with:
an (already annotated) protein sequence, e.g. with BLAST2
>1shg:A
AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD
the Profile HMM of a protein family, e.g. with HMMER3 or HH-suite4
Score each position independently :-(2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein
annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 7 / 35
Our researchGrammatical inference on biological sequences
Automatic characterization of protein sequence families with:
Automata (Protomata-Learner5,6)
M
V
3: 1..11
4: 1..11
ELTIKSGDKV
7: 1
D
8: 1..9
I
V
ELDMKPGDKI
V
V
V
ALFDYA
V
ALYAFN
V
ALYDFL
ALYDYM
I
AKFDYV
V
ALYDFV
ALYSFA
26: 1
Q
ALYDFV
P
30: 1..10
I
V
AEYDYE
33
34: 1
T
S
37: 1..11
T
S
41: 1..12
K
T
HLPLNLGDTI
AVEG
E
51: 1..11
56: 1 12
AEHDFQ2
ALYDYE
9
ALYDYK10
ALYDYE12
ALYPYD13
ALYDYE
14
ALYPFK16
AMYDFQ
18
ALYDYQ
21ALYDFQ
24
ALYEYD31
P
2: 8..10
D
9: 8..11
SRDEAI12
GI13
G
14: 8..11
AV15
AI16
DLRLERGQEY
AIYDYE QV22
N24: 8..11
G26: 8..11
ALFDFD
G
27: 8..11
ALFDFNG
28: 8..11
AM31
33: 7..10
ALFSYEP
35: 8..10
DLNFAVGSQI
DLSFPAGAVI
ALYDYK
D
39: 8..11
ALFDYK
D
40: 8..11
ALYDYD
D42: 8..11
EMALSTGDVV
I
AADK
ALYPFS
E
50: 8..11
AQTS
WW
1: 23..31
ELSFKRGNTL KVLNK2
WY
WY
YILDD5
WW
8: 21..29
DLTFTKGEKF HILNN9
DLSFMKGDRM EVIDD10
KVITD11
DLSFQKGDQM VVLEE12
DLSFKKGEKM KVLEE13
DLGLKQGEKL RVLEE14
DLQVLKGEKL QVLRS15
DLSLEKNAEY EVIDD16
LILEK18
NLALRRAEEY LILEK19
DLQLRKGDEY FILEE20
ELTFHENDVF DVFDD22
ELDIKKNERL WLLDD23
DLPFRKGDVI TILKK26
ELGFRRGDFI QVLDN27
ELAFKRGDVI TLINK28
ELDFRRGDVI TVTDR29
ELSFRKGDVI TVLEQ30
ELTFKENDVI NLIKK31
WW
32: 23..31
WW
33: 22..30
DLEFQEGDII LVLSK35
MVTAR36
ELSLKEGDII KILNK37
EIVQR38
ELSFCRGALI HNVSK39
ELTFTKSAII QNVEK40
EVVEK43
HVLSK45
EVSLLEGEAV EVIHK46
WW
49: 25..32
ELNFEKGDVM DVIEK50
DLPFKKGEIL VIIEK51
E
1: 34..35
WW
V
5: 35..38
K
8: 32..35
WW
EARS9
WW
RVVN10
WW
KARS12
WW
KAKS13
WW
RAQS14
WW
LARS15
WW KVK16
WW RAR18
WW KAR19
WW RAR20
21
WW
Q
28: 34..35
WW
E
29: 34..35
WW
WW
V
31: 34..37
R
32: 34..35
E
33: 33..36
WW
E
37: 35..36
WW
38: 36..37
WW
WW
G
WW
Q
43: 34..35
WW
WW
R
46: 34..35
KAR49
A1
E L2
N K3
E L4
D5
D8
L S9
L A12
L L13
L V15
D A16
D K18
D R19
D K20
D R21
S D24
L28
I29
R31
C32
K33
E C36
I37
L D41
L K42
M43
C44
K46
R A49
D T54
I L56
ADHQGIVP1
DGNEGFIP2
VGREGIIP3
DGKEGLIP4
SGKSGLVP5
TGKEGYIP8
SGKTGCIP9
TRKEGYIP12
TKKEGFIP13
TGREGYVP15
LGNVGYIP16
YGNEGYIP18
LGNEGLIP19
NGQEGYIP20
NGHEGYVP21
SGNVGWVP24
NNRRGIFP28
GNRKGIFP29
TKQIGMLP31
HGHFGLFP32
TGEKGLFP33
FGRSGIFP36
YGRVGWFP37
GKKQLWFP40
GSKEGWVP41
SGQKGWAP42
KAKRGWIP43
SEKRGWFP44
DDVTGYFP46
NGETGIIP49
TGETGLVP54
NGVTGQFP56
A1
S2
A3
S4
A5
S8
S9
S12
S13
S15
S16
S18
S19
S20
S21
S24
S28
A29
S31
A32
S33
S36
A37
S40
T41
T42
A43
T44
S46
S49
T54
A56
1
2
3
4
5
8
9
12
13
15
16
18
19
20
21
24
28
29
31
32
33
36
37
40
41
42
43
44
46
49
54
56
D2
P
D13
D30
P
D
31
E35
2
9
S 10
E 12
13
14
G16
N
20
E
22
Q 23
26
27
28
G 29
30
31
35
37
39
40
42
D 46
D
D
50
51
K
52
WY2: 28..32
5: 28..32
D9: 28..30
10: 28..31
WY11: 28..34
E12: 28..29
E13: 28..29
E14: 28..29
D15: 28..29
16: 28..31
18: 28..31
19: 28..31
20: 28..31
22: 30..39
23: 28..30
26: 28..33
27: 28..31
28: 28..31
29: 28..31
30: 28..31
31: 28..31
35: 28..35
36: 28..35
37: 28..32
38: 28..33
39: 28..31
40: 28..31
43: 28..31
45: 28..31
46: 28..31
50: 28..33
51: 28..31
55: 28..35
2: 35..36
KAK
3
11: 37..40
3
16
18
19
20
21
49
9
12
13
14
15
9
12
13
15
10
46
47
52
12: 10..11
13: 10
16: 10..11
22: 10..13
31: 10
20: 8..11
23: 8..11
29: 8..11
22
27
28
35
39
40
42
50
Local dependencies :-)
5G. Kerbellec. “Apprentissage d’automates modélisant des familles de séquences protéiques”.PhD thesis. Université de Rennes 1, Apr. 2008, p. 139.
6A. Bretaudeau et al. “CyanoLyase: a database of phycobilin lyase sequences, motifs andfunctions”. Nucleic Acids Research 41.Database-Issue (2013), pp. 396–401.
H. Talibart, F. Coste ComPotts June 18, 2019 8 / 35
Our researchGrammatical inference on biological sequences
Automatic characterization of protein sequence families with:
Context-free grammars (ReGLiS7, see also8)
ALYDFQA
1: 1
DLPFSKG
3: 1..12
D
8: 1..9
ALYDYEA
9: 1
ALYDYKS
10: 1
ALYDYEA12: 1
D
13: 1..11
ALYDYEP14: 1
D15: 1..29
H
YAMYDFQA18: 1
ALYDFLP19: 1
ALYDYMP20: 1
VAKFDY23
ALYDFQA24: 1
DLPFRKG
26: 1..12
ALYDFVP29: 1
ALYEYDA
31: 1
32: 1..31
IAKFDY34
FQEGDII
ARVNEEWLEGECFG
36: 1..25
ALYDYQA
38: 1
K39
K40
K
42
K
49
E
DWYKASN
52: 1..29ALYDYTA
53: 1GETGLVP
54: 1..41
R1
R9
R10
DLSFQKG12: 9..12
EYLILEK18: 9..20
YNKSGEWCEA
24: 9..26
31: 9..12
Q38
ALYDYKA K39
ALFDYKA
EDELTFTK
40: 9..10
ALYDYDAFKEGDII
42: 9..15
ALYPYDA
R
49: 9..31
R53
EDDLTFTK9: 10
DLSFMKG
10: 10..12
DVN
38: 10..29
G39: 10..29
53: 10..49
NWWEGQ
KEGYIPS
KEGYIPS
KEGFIPS
REGYVPS
NVGYIPS
NEGYIPS
NEGLIPS
NVGWVPSDV
3
DLSFKKG
D8: 18..28
E
12: 20..29
DLSFKKG
E
13: 20..29
DV26
DLPFKKG
EI51
NWYKAKN
3: 22..31
SQN
26: 22..29
Q
51: 22..30
REGIIPA
3: 39..41
52: 37..47
REGIFPA
RRGIFPS
RSGIFPS
8
13
WWFAR8
D
WWEAR
9
WWKAR12
WWKAK13
WWKVK16
WWLVK
17
H WWRAR18
H WWKAR19
D WWTGR26
E WWTGR38
WWKAR
49
WWKCR
50
WWSAR51
TG
8: 35..37
9: 37..48
R
12: 36..39K
13: 36..39
TG
16: 37..40
17: 37..48
DK18
DR19
WWRAR DK20
WWRVRN
23
VNG26
YNG38
R
49 K
50
N
51
8
15
9: 19..30
G40: 19..29
12
13
RE19
20: 9..31
QE29
YNHNGEWCEA
25: 26
18: 28..30
EYLILEK
19: 28..30
18: 39..40
19: 39..40
19: 11..20
29: 11..29
23: 7..30
34: 7..48
23: 37..46
51: 38..47
24: 37..41
25: 37..47
26
38
26
38: 42..48
N
28: 37..38
DWWLGE 2833
SKVNEEWLEGECKG
35: 23..25
42: 23..47
36
39
40
51: 9..12
42
49
GETGIIP
49: 39..40
50: 40..49
Nested dependencies :-D
7F. Coste, G. Garet, and J. Nicolas. “A bottom-up efficient algorithm learning substitutablelanguages from positive examples”. ICGI. 2014.
8W. Dyrka et al. “Estimating probabilistic context-free grammars for proteins using contactmap constraints”. PeerJ (2019).
H. Talibart, F. Coste ComPotts June 18, 2019 9 / 35
Proteins are 3D objects
Many crossing interactions between amino-acidsdistant in the sequence but close in the structure
H. Talibart, F. Coste ComPotts June 18, 2019 10 / 35
The Chomsky Hierarchy
H. Talibart, F. Coste ComPotts June 18, 2019 11 / 35
Mutual information in HIV-1 gp120 homologs
Many (crossing) correlations between MSA columns
H. Talibart, F. Coste ComPotts June 18, 2019 12 / 35
Direct Coupling Analysis to the rescue
Direct Coupling Analysis (DCA) to the rescueHugo Talibart’s PhD
A recent breakthrough for the prediction of 3D structures byprediction of contacts
Principle : disentangle direct from indirect effectsMutual information on HIV-1 gp120 protein
Idea: Use DCA for automatic characterization of protein families:
Identify important (crossing) dependencies with DCABuild accordingly a syntactic model that can be used in practice. . .
H. Talibart, F. Coste ComPotts June 18, 2019 13 / 35
Choice of DCA method
CCMpred9
Best one-model precision for contact prediction10
“Structuring” couplings
Figure: Top 25 PSICOV predictions Figure: Top 25 CCMpred predictions
9S. Seemayer, M. Gruber, and J. Söding. “CCMpred—fast and precise prediction of proteinresidue–residue contacts from correlated mutations”. Bioinformatics 30.21 (2014),pp. 3128–3130.
10S. H. P. de Oliveira, J. Shi, and C. M. Deane. “Comparing co-evolution methods and theirapplication to template-free protein structure prediction”. Bioinformatics 33.3 (2017),pp. 373–381.
H. Talibart, F. Coste ComPotts June 18, 2019 14 / 35
DCA workflow
1. Protein sequence query q
1CC8:A|PDBID|CHAIN|SEQUENCE MAEIKHYQFNVVMTCSGCSGAVNKVLTKLEPDVSKIDISLEKQLVDVYT
H. Talibart, F. Coste ComPotts June 18, 2019 15 / 35
DCA workflow
2. Retrieve close homologs and build a MSA (e.g. with HHblits11)
5.
10.
15.
20.
25.
30.
35.
40.
45.
1CC8:A|PDBID|CHAIN|SEQUENCE MAEIKHYQFNVVMTCSGCSGAVNKVLTKLEPDVSKIDISLEKQLVDVYTsp|Q54PZ2|ATOX1_DICDI ....MTYSFFVDMTCGGCSKAVNAILSKIDGVS.NIQIDLENKKVCESStr|A0A0C7MWI5|A0A0C7MWI5_9SACH .STAQHYHFDVVMTCAGCSNAINRVLTRLEPDVSNIEISLEKQTVDVVStr|A7TF58|A7TF58_VANPO .SNDNHYQFEVVMTCSGCSNAVNKALTRLEPDVSNIDISLENQTVDVHStr|G0WD69|G0WD69_NAUDC .MAENHYQFNVVMTCSGCSNAINRVLTKLEPEVSKIDISLEDQTVDVTTtr|G8ZQK6|G8ZQK6_TORDC .SQQNHYQFNVVMSCSGCSNAINKVLSRLEPDVSKIETSLDSQTVDVYTtr|S6E8D5|S6E8D5_ZYGB2 .MSQNHYHFEVVMSCEGCSNAINRVLTKLKPDVSEIRISLENQTVDVYTtr|J7R785|J7R785_KAZNA ..MSNHYQFDVVMTCSACSNAISKVLTRMEPEVTKFDVSLEKQTVDVQTtr|W1QBQ2|W1QBQ2_OGAPD .MSAKHYKFDVTMACSGCSNAVNRVLTRL.PGVKNVEISLEKQTVDVIStr|H2AUI5|H2AUI5_KAZAF ..MIYCYHFNVVMTCSGCSDAIHRSLSKLGPEVTDIDISLENQYVEVFTtr|G8JMM3|G8JMM3_ERECY .MDTKHYQFQVALACSGCVAAVEKALAKLQPDISKFDISLEKQIVDVYTtr|S9Q3L9|S9Q3L9_SCHOY ....MKYSFNVVMTCDGCKNAIDRVLNRL..GVDEKEISLEAQEVHVTTtr|Q01AV4|Q01AV4_OSTTA ..MSTTVTLRCDFACDGCANAVKRILSKDDA....VRTSVEDKLVVVV.tr|E5R4F7|E5R4F7_LEPMJ ..MTHTYKFNVTMTCGGCSGAVERVLRKLE.GVESFNVNLETQTAEVVAtr|R7Z484|R7Z484_CONA1 .MSEHNYKFNVAMSCGGCSGAVERVLKKLD.GVKSFNVSLDTQTAEIVAtr|M3CXY4|M3CXY4_SPHMS .MAEHKYKFNVSMSCGGCSGAIERVLKKLD.GVKEFNVSLETQTAEITTtr|W9XE16|W9XE16_9EURO .MSEHHYKFNVTMTCGGCSGAVERVLKKLD.GVKNYTVSLDTQTADVTTtr|Q5BDJ0|Q5BDJ0_EMENI .DQEHHYKFNVSMSCGGCSGAVERVLKKLD.GVKSFDVNLDSQTASVVTtr|W3WZP2|W3WZP2_9PEZI .ADNHTYKFNVSMSCGGCSGAVDRVLKKLD.GIESYDVSLEKQEATVIAtr|A0A0D2B224|A0A0D2B224_9PEZI ..MSHTYKFNVAMSCGGCSGAIDRVLKKLE.GVDKYEVSLEKQTAEVHTtr|A0A093XHT8|A0A093XHT8_PENMA .MAEHQYKFNVSMSCGGCSGAVERVLKKLDVGVKSYDVSLESQTATVVAtr|A0A074WQB6|A0A074WQB6_9PEZI .MSDHTYNFNITMTCGGCSGAVERVLKKLD.GVKSFDVSLDSQTAFVIT
11M. Remmert et al. “HHblits: lightning-fast iterative protein sequence searching byHMM-HMM alignment”. Nature methods 9.2 (2012), p. 173.
H. Talibart, F. Coste ComPotts June 18, 2019 16 / 35
DCA workflow
3. Infer a Potts model from MSA
P(a|w , v) =1
Zexp
(
L−1∑
i=1
L∑
j=i+1
wij(ai , aj) +L∑
i=1
vi(ai)
)
Probability of sequencea = a1, . . . , aL
Normalization constant
Couplings Fields
H. Talibart, F. Coste ComPotts June 18, 2019 17 / 35
Inference of Potts model from MSA (CCMpred)
Maximise pseudo-likelihood of N aligned sequences, i.e.:
(w , v) = argmaxw ,v
N∑
n=1
L∑
i=1
logP(Ai = ani |an1, · · · , a
ni−1, a
ni+1, · · · , a
nL, v ,w)
(more tractable and still good precision)
while respecting empirical frequencies:
Pi (a) = fi (a)
Pij(a, b) = fij(a, b)
H. Talibart, F. Coste ComPotts June 18, 2019 18 / 35
Using Potts model for contact prediction
4. Contacts in q are predicted using Frobenius norm of the couplings
||wij || =
√
∑
a
∑
b
wij(a, b)2
A larger norm is interpreted as a likelier contact between positions
H. Talibart, F. Coste ComPotts June 18, 2019 19 / 35
Using Potts model for homology search
Using Potts model for homology search
Use whole Potts model Pq of q instead of Frobenius norms
Use Pq to score each possibly homologous sequence s
Require to compute best alignment of s in Pq
As HHalign for pairs of HMMs12, align directly pairs of Potts models
A new tool: ComPotts
ComPottsPqPs
alignment
with two options to get Potts model Ps of s:One-hot encoding vi (ai ) = 1, wij(ai , aj) = 1, others are 0
s 1-hot encoding Ps
From close homologs of s as for Pq
s HHblits MSA trimal trimmed MSA CCMpredPy Ps
12J. Söding. “Protein homology detection by HMM–HMM comparison”. Bioinformatics 21.7(2004), pp. 951–960.
H. Talibart, F. Coste ComPotts June 18, 2019 20 / 35
ComPotts (Comparing Potts models)
Formulation of Potts model alignment as an Integer LinearProgramming (ILP) problem
Based on Inken Wohlers’ solver13
13I. Wohlers. “Exact Algorithms For Pairwise Protein Structure Alignment”. PhD thesis. VrijeUniversiteit, Jan. 2012, pp. 1 –147.
H. Talibart, F. Coste ComPotts June 18, 2019 21 / 35
Scoring alignment of Potts models A and B
s(A,B) =
LA∑
i=1
LB∑
k=1
sv (vAi , v
Bk )xik +
LA−1∑
i=1
LA∑
j=i+1
LB−1∑
k=1
LB∑
l=k+1
sw (wAij ,w
Bkl )yikjl
where
xik = 1 iff position i of A and position k of B are aligned (otherwise, xik = 0)yikjl = 1 iff xik = 1 and xjl = 1 (otherwise, yikjl = 0)
H. Talibart, F. Coste ComPotts June 18, 2019 22 / 35
Choice of sv(vi , vk) and sw(wij ,wkl): scalar products
sv (vAi , v
Bk ) = 〈vAi , v
Bk 〉
→ standard scalar product : 〈x , y〉 =∑
i xiyi
sw (wAij ,w
Bkl ) = 〈wA
ij ,wBkl 〉F
→ Frobenius scalar product : 〈X ,Y 〉F =∑
i
∑
j XijYij
Geometric insight
vBk
vAiθ
〈vAi , v
Bk 〉 =
∥
∥vAi
∥
∥
∥
∥vBk
∥
∥ cos θ
importance of position i
importance of position k
similarity measure
H. Talibart, F. Coste ComPotts June 18, 2019 23 / 35
Natural extension of the 1D score of a sequence
P(a|w , v) =1
Zexp (H(a|v ,w))
H(a|v ,w) =
L∑
i=1
vi (ai ) +
L−1∑
i=1
L∑
j=i+1
wij(ai , aj)
=
L∑
i=1
〈vi , eai 〉+
L−1∑
i=1
L∑
j=i+1
〈wij , eaiaj 〉F
eai aj =
0 . . . . . . 0 . . . . . . 0
.
.
.
.
.
.
.
.
.0
0 . . . 0 1 0 . . . 00
.
.
.
.
.
.
.
.
.0 . . . . . . 0 . . . . . . 0
ai
aj
eai =
0
.
.
.010
.
.
.0
ai
H. Talibart, F. Coste ComPotts June 18, 2019 24 / 35
First experiments
PDB 1CC8 : Atx1 metallochaperone (Saccharomyces cerevisiae)
× one homolog s (150 sequences sampling, identity with 1CC8 : 25%-50%)
One-hot encoding of Ps Timeout: 6 hours
trimmed not trimmedǫ = machine epsilon t ∈ [11s, 6h],
avg: 2ht ∈ [25s, 6h],avg: 4h30
ǫ = 114 t ∈ [8s, 6h],avg: 1h30
t ∈ [16s, 6h],avg: 2h
Build Ps from homologs of s Timeout: 6 hours
trimmed not trimmedǫ = machine epsilon t ∈ [3s, 6h],
avg: 3 mint ∈ [3s, 6h],avg: 8 min
ǫ = 1 t ∈ [2s, 6s],avg: 5s
t ∈ [3s, 50s],avg: 21s
Tractable time! not for the simpler models?Small proteins, easy to align. . .
14≃ Energy needed to change one a.a. into anotherH. Talibart, F. Coste ComPotts June 18, 2019 25 / 35
Testing the limits on thioredoxins
Enzymes involved in reduction–oxidation reactions through oxidationof their active site
Figure: 3D structure of thioredoxin-1 (Caenorhabditis elegans) (Q09433)
100 amino acids on average
Between 15 and 20% sequence identity within the family
→ known to be hard to align
H. Talibart, F. Coste ComPotts June 18, 2019 26 / 35
An example of failure
{vQ12404i }i and {vP17967
i }i aligned by ComPotts:
Even well-conserved positions of the active site are not aligned
H. Talibart, F. Coste ComPotts June 18, 2019 27 / 35
The trouble with scalar product alone
A well-conserved column i may have a smaller ||vi || than a lessconserved column j
It may be more profitable to align many less conserved columns thanto align fewer well-conserved columns with each other
H. Talibart, F. Coste ComPotts June 18, 2019 28 / 35
An idea
Use rescaling function : f (x) = sign(x)(e |x | − 1)
Figure: Before rescaling
Figure: After rescaling
H. Talibart, F. Coste ComPotts June 18, 2019 29 / 35
It’s better :-)
{vQ12404i }i and {vP17967
i }i aligned by ComPotts (wo couplings!)
H. Talibart, F. Coste ComPotts June 18, 2019 30 / 35
To be continued. . .
How to rescale also consistently the couplings wij?
→ Slightly change the rescaling function f (x) = sign(x)(βeα|x| − γ)?
Other similarity functions?. . .
Introduce gap costs
Constrain Potts model inference?
Canonical Potts model?Better control amplitude of vectors and matrices?
H. Talibart, F. Coste ComPotts June 18, 2019 31 / 35
Conclusion so far. . .
Good news: alignment to Potts model is tractable
A surprise: may require a transformation to a Potts model
A working efficient implementation
Quality of alignment can still be improved. . .
H. Talibart, F. Coste ComPotts June 18, 2019 32 / 35
Thanks for your attention!
Ideas, remarks, suggestions are welcome.See you next to our poster. . .
Bibliography I
[Alt+90] S. F. Altschul et al. “Basic local alignment search tool”.Journal of molecular biology (1990).
[Edd98] S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics14.9 (1998), pp. 755–763.
[Ste+19] M. Steinegger et al. “HH-suite3 for fast remote homologydetection and deep protein annotation”. bioRxiv (2019),p. 560029.
[Ker08] G. Kerbellec. “Apprentissage d’automates modélisant desfamilles de séquences protéiques”. PhD thesis. Université deRennes 1, Apr. 2008, p. 139.
[Bre+13] A. Bretaudeau et al. “CyanoLyase: a database of phycobilinlyase sequences, motifs and functions”. Nucleic Acids Research41.Database-Issue (2013), pp. 396–401.
H. Talibart, F. Coste ComPotts June 18, 2019 33 / 35
Bibliography II
[CGN14] F. Coste, G. Garet, and J. Nicolas. “A bottom-up efficientalgorithm learning substitutable languages from positiveexamples”. ICGI. 2014.
[Dyr+19] W. Dyrka et al. “Estimating probabilistic context-freegrammars for proteins using contact map constraints”. PeerJ(2019).
[SGS14] S. Seemayer, M. Gruber, and J. Söding. “CCMpred—fast andprecise prediction of protein residue–residue contacts fromcorrelated mutations”. Bioinformatics 30.21 (2014),pp. 3128–3130.
[OSD17] S. H. P. de Oliveira, J. Shi, and C. M. Deane. “Comparingco-evolution methods and their application to template-freeprotein structure prediction”. Bioinformatics 33.3 (2017),pp. 373–381.
H. Talibart, F. Coste ComPotts June 18, 2019 34 / 35
Bibliography III
[Jon+11] D. T. Jones et al. “PSICOV: precise structural contactprediction using sparse inverse covariance estimation on largemultiple sequence alignments”. Bioinformatics 28.2 (2011),pp. 184–190.
[Rem+12] M. Remmert et al. “HHblits: lightning-fast iterative proteinsequence searching by HMM-HMM alignment”. Naturemethods 9.2 (2012), p. 173.
[Söd04] J. Söding. “Protein homology detection by HMM–HMMcomparison”. Bioinformatics 21.7 (2004), pp. 951–960.
[Woh12] I. Wohlers. “Exact Algorithms For Pairwise Protein StructureAlignment”. PhD thesis. Vrije Universiteit, Jan. 2012, pp. 1–147.
H. Talibart, F. Coste ComPotts June 18, 2019 35 / 35