43
Using residue coevolution to retrieve protein homologs 1 ComPotts Hugo Talibart, François Coste Co-evolutionary methods for the prediction and design of protein structure and interactions CECAM-HQ-EPFL, June 18, 2019 1 work in progress

Using residue coevolution to retrieve protein homologs1

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using residue coevolution to retrieve protein homologs1

Using residue coevolution to retrieve protein homologs1

ComPotts

Hugo Talibart, François Coste

Co-evolutionary methods for the prediction and designof protein structure and interactions

CECAM-HQ-EPFL, June 18, 2019

1work in progress

Page 2: Using residue coevolution to retrieve protein homologs1

The Dyliss bioinformatics teamhttp://www.irisa.fr/dyliss

H. Talibart, F. Coste ComPotts June 18, 2019 1 / 35

Page 3: Using residue coevolution to retrieve protein homologs1

Bioinformatics at IRISA / Inria Rennes

Symbiose (bioinformatics Irisa/Inria Rennes):

Dyliss research teamGenscale research teamGenouest bioinformatics platform

Seminars: http://symbiose.irisa.fr/symbioseSeminars

Biogenouest western France life science and environment networkMarine biology, agriculture/food-processing, human health, and bioinformatics.

H. Talibart, F. Coste ComPotts June 18, 2019 2 / 35

Page 4: Using residue coevolution to retrieve protein homologs1

Motivation

Page 5: Using residue coevolution to retrieve protein homologs1

Sequences annotation problem

High throughput production of raw sequences

Problem

Function(s) of these sequences ?

H. Talibart, F. Coste ComPotts June 18, 2019 3 / 35

Page 6: Using residue coevolution to retrieve protein homologs1

Protein function?

In-vivo / in-vitro experimentsEspecially on model organisms:

Gene knockout and others mutations→ key sequence(s) for a functionStructure determination→ key positions for a function. . .

Does not scale well. . .To face the (ever-increasing) amount of available sequences, automaticmethods are needed in-silico functional or structural predictions.

Classical approach to predict the function of a new gene sequence

Search for annotated homologs . . .

H. Talibart, F. Coste ComPotts June 18, 2019 4 / 35

Page 7: Using residue coevolution to retrieve protein homologs1

Retrieve the homologs of a protein gene

Search for a significant match with:

an (already annotated) protein sequence, e.g. with BLAST2

>1shg:A

AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD

2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein

annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 5 / 35

Page 8: Using residue coevolution to retrieve protein homologs1

Retrieve the homologs of a protein gene

Search for a significant match with:

an (already annotated) protein sequence, e.g. with BLAST2

>1shg:A

AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD

2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein

annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 5 / 35

Page 9: Using residue coevolution to retrieve protein homologs1

Retrieve the homologs of a protein gene

Search for a significant match with:

an (already annotated) protein sequence, e.g. with BLAST2

>1shg:A

AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD

the Profile HMM of a protein family, e.g. with HMMER3 or HH-suite4

2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein

annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 5 / 35

Page 10: Using residue coevolution to retrieve protein homologs1

Yet. . .

H. Talibart, F. Coste ComPotts June 18, 2019 6 / 35

Page 11: Using residue coevolution to retrieve protein homologs1

Retrieve the homologs of a protein gene

Search for a significant match with:

an (already annotated) protein sequence, e.g. with BLAST2

>1shg:A

AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD

the Profile HMM of a protein family, e.g. with HMMER3 or HH-suite4

2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein

annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 7 / 35

Page 12: Using residue coevolution to retrieve protein homologs1

Retrieve the homologs of a protein gene

Search for a significant match with:

an (already annotated) protein sequence, e.g. with BLAST2

>1shg:A

AKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD

the Profile HMM of a protein family, e.g. with HMMER3 or HH-suite4

Score each position independently :-(2S. F. Altschul et al. “Basic local alignment search tool”. Journal of molecular biology (1990).3S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics 14.9 (1998), pp. 755–763.4M. Steinegger et al. “HH-suite3 for fast remote homology detection and deep protein

annotation”. bioRxiv (2019), p. 560029.H. Talibart, F. Coste ComPotts June 18, 2019 7 / 35

Page 13: Using residue coevolution to retrieve protein homologs1

Our researchGrammatical inference on biological sequences

Automatic characterization of protein sequence families with:

Automata (Protomata-Learner5,6)

M

V

3: 1..11

4: 1..11

ELTIKSGDKV

7: 1

D

8: 1..9

I

V

ELDMKPGDKI

V

V

V

ALFDYA

V

ALYAFN

V

ALYDFL

ALYDYM

I

AKFDYV

V

ALYDFV

ALYSFA

26: 1

Q

ALYDFV

P

30: 1..10

I

V

AEYDYE

33

34: 1

T

S

37: 1..11

T

S

41: 1..12

K

T

HLPLNLGDTI

AVEG

E

51: 1..11

56: 1 12

AEHDFQ2

ALYDYE

9

ALYDYK10

ALYDYE12

ALYPYD13

ALYDYE

14

ALYPFK16

AMYDFQ

18

ALYDYQ

21ALYDFQ

24

ALYEYD31

P

2: 8..10

D

9: 8..11

SRDEAI12

GI13

G

14: 8..11

AV15

AI16

DLRLERGQEY

AIYDYE QV22

N24: 8..11

G26: 8..11

ALFDFD

G

27: 8..11

ALFDFNG

28: 8..11

AM31

33: 7..10

ALFSYEP

35: 8..10

DLNFAVGSQI

DLSFPAGAVI

ALYDYK

D

39: 8..11

ALFDYK

D

40: 8..11

ALYDYD

D42: 8..11

EMALSTGDVV

I

AADK

ALYPFS

E

50: 8..11

AQTS

WW

1: 23..31

ELSFKRGNTL KVLNK2

WY

WY

YILDD5

WW

8: 21..29

DLTFTKGEKF HILNN9

DLSFMKGDRM EVIDD10

KVITD11

DLSFQKGDQM VVLEE12

DLSFKKGEKM KVLEE13

DLGLKQGEKL RVLEE14

DLQVLKGEKL QVLRS15

DLSLEKNAEY EVIDD16

LILEK18

NLALRRAEEY LILEK19

DLQLRKGDEY FILEE20

ELTFHENDVF DVFDD22

ELDIKKNERL WLLDD23

DLPFRKGDVI TILKK26

ELGFRRGDFI QVLDN27

ELAFKRGDVI TLINK28

ELDFRRGDVI TVTDR29

ELSFRKGDVI TVLEQ30

ELTFKENDVI NLIKK31

WW

32: 23..31

WW

33: 22..30

DLEFQEGDII LVLSK35

MVTAR36

ELSLKEGDII KILNK37

EIVQR38

ELSFCRGALI HNVSK39

ELTFTKSAII QNVEK40

EVVEK43

HVLSK45

EVSLLEGEAV EVIHK46

WW

49: 25..32

ELNFEKGDVM DVIEK50

DLPFKKGEIL VIIEK51

E

1: 34..35

WW

V

5: 35..38

K

8: 32..35

WW

EARS9

WW

RVVN10

WW

KARS12

WW

KAKS13

WW

RAQS14

WW

LARS15

WW KVK16

WW RAR18

WW KAR19

WW RAR20

21

WW

Q

28: 34..35

WW

E

29: 34..35

WW

WW

V

31: 34..37

R

32: 34..35

E

33: 33..36

WW

E

37: 35..36

WW

38: 36..37

WW

WW

G

WW

Q

43: 34..35

WW

WW

R

46: 34..35

KAR49

A1

E L2

N K3

E L4

D5

D8

L S9

L A12

L L13

L V15

D A16

D K18

D R19

D K20

D R21

S D24

L28

I29

R31

C32

K33

E C36

I37

L D41

L K42

M43

C44

K46

R A49

D T54

I L56

ADHQGIVP1

DGNEGFIP2

VGREGIIP3

DGKEGLIP4

SGKSGLVP5

TGKEGYIP8

SGKTGCIP9

TRKEGYIP12

TKKEGFIP13

TGREGYVP15

LGNVGYIP16

YGNEGYIP18

LGNEGLIP19

NGQEGYIP20

NGHEGYVP21

SGNVGWVP24

NNRRGIFP28

GNRKGIFP29

TKQIGMLP31

HGHFGLFP32

TGEKGLFP33

FGRSGIFP36

YGRVGWFP37

GKKQLWFP40

GSKEGWVP41

SGQKGWAP42

KAKRGWIP43

SEKRGWFP44

DDVTGYFP46

NGETGIIP49

TGETGLVP54

NGVTGQFP56

A1

S2

A3

S4

A5

S8

S9

S12

S13

S15

S16

S18

S19

S20

S21

S24

S28

A29

S31

A32

S33

S36

A37

S40

T41

T42

A43

T44

S46

S49

T54

A56

1

2

3

4

5

8

9

12

13

15

16

18

19

20

21

24

28

29

31

32

33

36

37

40

41

42

43

44

46

49

54

56

D2

P

D13

D30

P

D

31

E35

2

9

S 10

E 12

13

14

G16

N

20

E

22

Q 23

26

27

28

G 29

30

31

35

37

39

40

42

D 46

D

D

50

51

K

52

WY2: 28..32

5: 28..32

D9: 28..30

10: 28..31

WY11: 28..34

E12: 28..29

E13: 28..29

E14: 28..29

D15: 28..29

16: 28..31

18: 28..31

19: 28..31

20: 28..31

22: 30..39

23: 28..30

26: 28..33

27: 28..31

28: 28..31

29: 28..31

30: 28..31

31: 28..31

35: 28..35

36: 28..35

37: 28..32

38: 28..33

39: 28..31

40: 28..31

43: 28..31

45: 28..31

46: 28..31

50: 28..33

51: 28..31

55: 28..35

2: 35..36

KAK

3

11: 37..40

3

16

18

19

20

21

49

9

12

13

14

15

9

12

13

15

10

46

47

52

12: 10..11

13: 10

16: 10..11

22: 10..13

31: 10

20: 8..11

23: 8..11

29: 8..11

22

27

28

35

39

40

42

50

Local dependencies :-)

5G. Kerbellec. “Apprentissage d’automates modélisant des familles de séquences protéiques”.PhD thesis. Université de Rennes 1, Apr. 2008, p. 139.

6A. Bretaudeau et al. “CyanoLyase: a database of phycobilin lyase sequences, motifs andfunctions”. Nucleic Acids Research 41.Database-Issue (2013), pp. 396–401.

H. Talibart, F. Coste ComPotts June 18, 2019 8 / 35

Page 14: Using residue coevolution to retrieve protein homologs1

Our researchGrammatical inference on biological sequences

Automatic characterization of protein sequence families with:

Context-free grammars (ReGLiS7, see also8)

ALYDFQA

1: 1

DLPFSKG

3: 1..12

D

8: 1..9

ALYDYEA

9: 1

ALYDYKS

10: 1

ALYDYEA12: 1

D

13: 1..11

ALYDYEP14: 1

D15: 1..29

H

YAMYDFQA18: 1

ALYDFLP19: 1

ALYDYMP20: 1

VAKFDY23

ALYDFQA24: 1

DLPFRKG

26: 1..12

ALYDFVP29: 1

ALYEYDA

31: 1

32: 1..31

IAKFDY34

FQEGDII

ARVNEEWLEGECFG

36: 1..25

ALYDYQA

38: 1

K39

K40

K

42

K

49

E

DWYKASN

52: 1..29ALYDYTA

53: 1GETGLVP

54: 1..41

R1

R9

R10

DLSFQKG12: 9..12

EYLILEK18: 9..20

YNKSGEWCEA

24: 9..26

31: 9..12

Q38

ALYDYKA K39

ALFDYKA

EDELTFTK

40: 9..10

ALYDYDAFKEGDII

42: 9..15

ALYPYDA

R

49: 9..31

R53

EDDLTFTK9: 10

DLSFMKG

10: 10..12

DVN

38: 10..29

G39: 10..29

53: 10..49

NWWEGQ

KEGYIPS

KEGYIPS

KEGFIPS

REGYVPS

NVGYIPS

NEGYIPS

NEGLIPS

NVGWVPSDV

3

DLSFKKG

D8: 18..28

E

12: 20..29

DLSFKKG

E

13: 20..29

DV26

DLPFKKG

EI51

NWYKAKN

3: 22..31

SQN

26: 22..29

Q

51: 22..30

REGIIPA

3: 39..41

52: 37..47

REGIFPA

RRGIFPS

RSGIFPS

8

13

WWFAR8

D

WWEAR

9

WWKAR12

WWKAK13

WWKVK16

WWLVK

17

H WWRAR18

H WWKAR19

D WWTGR26

E WWTGR38

WWKAR

49

WWKCR

50

WWSAR51

TG

8: 35..37

9: 37..48

R

12: 36..39K

13: 36..39

TG

16: 37..40

17: 37..48

DK18

DR19

WWRAR DK20

WWRVRN

23

VNG26

YNG38

R

49 K

50

N

51

8

15

9: 19..30

G40: 19..29

12

13

RE19

20: 9..31

QE29

YNHNGEWCEA

25: 26

18: 28..30

EYLILEK

19: 28..30

18: 39..40

19: 39..40

19: 11..20

29: 11..29

23: 7..30

34: 7..48

23: 37..46

51: 38..47

24: 37..41

25: 37..47

26

38

26

38: 42..48

N

28: 37..38

DWWLGE 2833

SKVNEEWLEGECKG

35: 23..25

42: 23..47

36

39

40

51: 9..12

42

49

GETGIIP

49: 39..40

50: 40..49

Nested dependencies :-D

7F. Coste, G. Garet, and J. Nicolas. “A bottom-up efficient algorithm learning substitutablelanguages from positive examples”. ICGI. 2014.

8W. Dyrka et al. “Estimating probabilistic context-free grammars for proteins using contactmap constraints”. PeerJ (2019).

H. Talibart, F. Coste ComPotts June 18, 2019 9 / 35

Page 15: Using residue coevolution to retrieve protein homologs1

Proteins are 3D objects

Many crossing interactions between amino-acidsdistant in the sequence but close in the structure

H. Talibart, F. Coste ComPotts June 18, 2019 10 / 35

Page 16: Using residue coevolution to retrieve protein homologs1

The Chomsky Hierarchy

H. Talibart, F. Coste ComPotts June 18, 2019 11 / 35

Page 17: Using residue coevolution to retrieve protein homologs1

Mutual information in HIV-1 gp120 homologs

Many (crossing) correlations between MSA columns

H. Talibart, F. Coste ComPotts June 18, 2019 12 / 35

Page 18: Using residue coevolution to retrieve protein homologs1

Direct Coupling Analysis to the rescue

Page 19: Using residue coevolution to retrieve protein homologs1

Direct Coupling Analysis (DCA) to the rescueHugo Talibart’s PhD

A recent breakthrough for the prediction of 3D structures byprediction of contacts

Principle : disentangle direct from indirect effectsMutual information on HIV-1 gp120 protein

Idea: Use DCA for automatic characterization of protein families:

Identify important (crossing) dependencies with DCABuild accordingly a syntactic model that can be used in practice. . .

H. Talibart, F. Coste ComPotts June 18, 2019 13 / 35

Page 20: Using residue coevolution to retrieve protein homologs1

Choice of DCA method

CCMpred9

Best one-model precision for contact prediction10

“Structuring” couplings

Figure: Top 25 PSICOV predictions Figure: Top 25 CCMpred predictions

9S. Seemayer, M. Gruber, and J. Söding. “CCMpred—fast and precise prediction of proteinresidue–residue contacts from correlated mutations”. Bioinformatics 30.21 (2014),pp. 3128–3130.

10S. H. P. de Oliveira, J. Shi, and C. M. Deane. “Comparing co-evolution methods and theirapplication to template-free protein structure prediction”. Bioinformatics 33.3 (2017),pp. 373–381.

H. Talibart, F. Coste ComPotts June 18, 2019 14 / 35

Page 21: Using residue coevolution to retrieve protein homologs1

DCA workflow

1. Protein sequence query q

1CC8:A|PDBID|CHAIN|SEQUENCE MAEIKHYQFNVVMTCSGCSGAVNKVLTKLEPDVSKIDISLEKQLVDVYT

H. Talibart, F. Coste ComPotts June 18, 2019 15 / 35

Page 22: Using residue coevolution to retrieve protein homologs1

DCA workflow

2. Retrieve close homologs and build a MSA (e.g. with HHblits11)

5.

10.

15.

20.

25.

30.

35.

40.

45.

1CC8:A|PDBID|CHAIN|SEQUENCE MAEIKHYQFNVVMTCSGCSGAVNKVLTKLEPDVSKIDISLEKQLVDVYTsp|Q54PZ2|ATOX1_DICDI ....MTYSFFVDMTCGGCSKAVNAILSKIDGVS.NIQIDLENKKVCESStr|A0A0C7MWI5|A0A0C7MWI5_9SACH .STAQHYHFDVVMTCAGCSNAINRVLTRLEPDVSNIEISLEKQTVDVVStr|A7TF58|A7TF58_VANPO .SNDNHYQFEVVMTCSGCSNAVNKALTRLEPDVSNIDISLENQTVDVHStr|G0WD69|G0WD69_NAUDC .MAENHYQFNVVMTCSGCSNAINRVLTKLEPEVSKIDISLEDQTVDVTTtr|G8ZQK6|G8ZQK6_TORDC .SQQNHYQFNVVMSCSGCSNAINKVLSRLEPDVSKIETSLDSQTVDVYTtr|S6E8D5|S6E8D5_ZYGB2 .MSQNHYHFEVVMSCEGCSNAINRVLTKLKPDVSEIRISLENQTVDVYTtr|J7R785|J7R785_KAZNA ..MSNHYQFDVVMTCSACSNAISKVLTRMEPEVTKFDVSLEKQTVDVQTtr|W1QBQ2|W1QBQ2_OGAPD .MSAKHYKFDVTMACSGCSNAVNRVLTRL.PGVKNVEISLEKQTVDVIStr|H2AUI5|H2AUI5_KAZAF ..MIYCYHFNVVMTCSGCSDAIHRSLSKLGPEVTDIDISLENQYVEVFTtr|G8JMM3|G8JMM3_ERECY .MDTKHYQFQVALACSGCVAAVEKALAKLQPDISKFDISLEKQIVDVYTtr|S9Q3L9|S9Q3L9_SCHOY ....MKYSFNVVMTCDGCKNAIDRVLNRL..GVDEKEISLEAQEVHVTTtr|Q01AV4|Q01AV4_OSTTA ..MSTTVTLRCDFACDGCANAVKRILSKDDA....VRTSVEDKLVVVV.tr|E5R4F7|E5R4F7_LEPMJ ..MTHTYKFNVTMTCGGCSGAVERVLRKLE.GVESFNVNLETQTAEVVAtr|R7Z484|R7Z484_CONA1 .MSEHNYKFNVAMSCGGCSGAVERVLKKLD.GVKSFNVSLDTQTAEIVAtr|M3CXY4|M3CXY4_SPHMS .MAEHKYKFNVSMSCGGCSGAIERVLKKLD.GVKEFNVSLETQTAEITTtr|W9XE16|W9XE16_9EURO .MSEHHYKFNVTMTCGGCSGAVERVLKKLD.GVKNYTVSLDTQTADVTTtr|Q5BDJ0|Q5BDJ0_EMENI .DQEHHYKFNVSMSCGGCSGAVERVLKKLD.GVKSFDVNLDSQTASVVTtr|W3WZP2|W3WZP2_9PEZI .ADNHTYKFNVSMSCGGCSGAVDRVLKKLD.GIESYDVSLEKQEATVIAtr|A0A0D2B224|A0A0D2B224_9PEZI ..MSHTYKFNVAMSCGGCSGAIDRVLKKLE.GVDKYEVSLEKQTAEVHTtr|A0A093XHT8|A0A093XHT8_PENMA .MAEHQYKFNVSMSCGGCSGAVERVLKKLDVGVKSYDVSLESQTATVVAtr|A0A074WQB6|A0A074WQB6_9PEZI .MSDHTYNFNITMTCGGCSGAVERVLKKLD.GVKSFDVSLDSQTAFVIT

11M. Remmert et al. “HHblits: lightning-fast iterative protein sequence searching byHMM-HMM alignment”. Nature methods 9.2 (2012), p. 173.

H. Talibart, F. Coste ComPotts June 18, 2019 16 / 35

Page 23: Using residue coevolution to retrieve protein homologs1

DCA workflow

3. Infer a Potts model from MSA

P(a|w , v) =1

Zexp

(

L−1∑

i=1

L∑

j=i+1

wij(ai , aj) +L∑

i=1

vi(ai)

)

Probability of sequencea = a1, . . . , aL

Normalization constant

Couplings Fields

H. Talibart, F. Coste ComPotts June 18, 2019 17 / 35

Page 24: Using residue coevolution to retrieve protein homologs1

Inference of Potts model from MSA (CCMpred)

Maximise pseudo-likelihood of N aligned sequences, i.e.:

(w , v) = argmaxw ,v

N∑

n=1

L∑

i=1

logP(Ai = ani |an1, · · · , a

ni−1, a

ni+1, · · · , a

nL, v ,w)

(more tractable and still good precision)

while respecting empirical frequencies:

Pi (a) = fi (a)

Pij(a, b) = fij(a, b)

H. Talibart, F. Coste ComPotts June 18, 2019 18 / 35

Page 25: Using residue coevolution to retrieve protein homologs1

Using Potts model for contact prediction

4. Contacts in q are predicted using Frobenius norm of the couplings

||wij || =

a

b

wij(a, b)2

A larger norm is interpreted as a likelier contact between positions

H. Talibart, F. Coste ComPotts June 18, 2019 19 / 35

Page 26: Using residue coevolution to retrieve protein homologs1

Using Potts model for homology search

Page 27: Using residue coevolution to retrieve protein homologs1

Using Potts model for homology search

Use whole Potts model Pq of q instead of Frobenius norms

Use Pq to score each possibly homologous sequence s

Require to compute best alignment of s in Pq

As HHalign for pairs of HMMs12, align directly pairs of Potts models

A new tool: ComPotts

ComPottsPqPs

alignment

with two options to get Potts model Ps of s:One-hot encoding vi (ai ) = 1, wij(ai , aj) = 1, others are 0

s 1-hot encoding Ps

From close homologs of s as for Pq

s HHblits MSA trimal trimmed MSA CCMpredPy Ps

12J. Söding. “Protein homology detection by HMM–HMM comparison”. Bioinformatics 21.7(2004), pp. 951–960.

H. Talibart, F. Coste ComPotts June 18, 2019 20 / 35

Page 28: Using residue coevolution to retrieve protein homologs1

ComPotts (Comparing Potts models)

Formulation of Potts model alignment as an Integer LinearProgramming (ILP) problem

Based on Inken Wohlers’ solver13

13I. Wohlers. “Exact Algorithms For Pairwise Protein Structure Alignment”. PhD thesis. VrijeUniversiteit, Jan. 2012, pp. 1 –147.

H. Talibart, F. Coste ComPotts June 18, 2019 21 / 35

Page 29: Using residue coevolution to retrieve protein homologs1

Scoring alignment of Potts models A and B

s(A,B) =

LA∑

i=1

LB∑

k=1

sv (vAi , v

Bk )xik +

LA−1∑

i=1

LA∑

j=i+1

LB−1∑

k=1

LB∑

l=k+1

sw (wAij ,w

Bkl )yikjl

where

xik = 1 iff position i of A and position k of B are aligned (otherwise, xik = 0)yikjl = 1 iff xik = 1 and xjl = 1 (otherwise, yikjl = 0)

H. Talibart, F. Coste ComPotts June 18, 2019 22 / 35

Page 30: Using residue coevolution to retrieve protein homologs1

Choice of sv(vi , vk) and sw(wij ,wkl): scalar products

sv (vAi , v

Bk ) = 〈vAi , v

Bk 〉

→ standard scalar product : 〈x , y〉 =∑

i xiyi

sw (wAij ,w

Bkl ) = 〈wA

ij ,wBkl 〉F

→ Frobenius scalar product : 〈X ,Y 〉F =∑

i

j XijYij

Geometric insight

vBk

vAiθ

〈vAi , v

Bk 〉 =

∥vAi

∥vBk

∥ cos θ

importance of position i

importance of position k

similarity measure

H. Talibart, F. Coste ComPotts June 18, 2019 23 / 35

Page 31: Using residue coevolution to retrieve protein homologs1

Natural extension of the 1D score of a sequence

P(a|w , v) =1

Zexp (H(a|v ,w))

H(a|v ,w) =

L∑

i=1

vi (ai ) +

L−1∑

i=1

L∑

j=i+1

wij(ai , aj)

=

L∑

i=1

〈vi , eai 〉+

L−1∑

i=1

L∑

j=i+1

〈wij , eaiaj 〉F

eai aj =

0 . . . . . . 0 . . . . . . 0

.

.

.

.

.

.

.

.

.0

0 . . . 0 1 0 . . . 00

.

.

.

.

.

.

.

.

.0 . . . . . . 0 . . . . . . 0

ai

aj

eai =

0

.

.

.010

.

.

.0

ai

H. Talibart, F. Coste ComPotts June 18, 2019 24 / 35

Page 32: Using residue coevolution to retrieve protein homologs1

First experiments

PDB 1CC8 : Atx1 metallochaperone (Saccharomyces cerevisiae)

× one homolog s (150 sequences sampling, identity with 1CC8 : 25%-50%)

One-hot encoding of Ps Timeout: 6 hours

trimmed not trimmedǫ = machine epsilon t ∈ [11s, 6h],

avg: 2ht ∈ [25s, 6h],avg: 4h30

ǫ = 114 t ∈ [8s, 6h],avg: 1h30

t ∈ [16s, 6h],avg: 2h

Build Ps from homologs of s Timeout: 6 hours

trimmed not trimmedǫ = machine epsilon t ∈ [3s, 6h],

avg: 3 mint ∈ [3s, 6h],avg: 8 min

ǫ = 1 t ∈ [2s, 6s],avg: 5s

t ∈ [3s, 50s],avg: 21s

Tractable time! not for the simpler models?Small proteins, easy to align. . .

14≃ Energy needed to change one a.a. into anotherH. Talibart, F. Coste ComPotts June 18, 2019 25 / 35

Page 33: Using residue coevolution to retrieve protein homologs1

Testing the limits on thioredoxins

Enzymes involved in reduction–oxidation reactions through oxidationof their active site

Figure: 3D structure of thioredoxin-1 (Caenorhabditis elegans) (Q09433)

100 amino acids on average

Between 15 and 20% sequence identity within the family

→ known to be hard to align

H. Talibart, F. Coste ComPotts June 18, 2019 26 / 35

Page 34: Using residue coevolution to retrieve protein homologs1

An example of failure

{vQ12404i }i and {vP17967

i }i aligned by ComPotts:

Even well-conserved positions of the active site are not aligned

H. Talibart, F. Coste ComPotts June 18, 2019 27 / 35

Page 35: Using residue coevolution to retrieve protein homologs1

The trouble with scalar product alone

A well-conserved column i may have a smaller ||vi || than a lessconserved column j

It may be more profitable to align many less conserved columns thanto align fewer well-conserved columns with each other

H. Talibart, F. Coste ComPotts June 18, 2019 28 / 35

Page 36: Using residue coevolution to retrieve protein homologs1

An idea

Use rescaling function : f (x) = sign(x)(e |x | − 1)

Figure: Before rescaling

Figure: After rescaling

H. Talibart, F. Coste ComPotts June 18, 2019 29 / 35

Page 37: Using residue coevolution to retrieve protein homologs1

It’s better :-)

{vQ12404i }i and {vP17967

i }i aligned by ComPotts (wo couplings!)

H. Talibart, F. Coste ComPotts June 18, 2019 30 / 35

Page 38: Using residue coevolution to retrieve protein homologs1

To be continued. . .

How to rescale also consistently the couplings wij?

→ Slightly change the rescaling function f (x) = sign(x)(βeα|x| − γ)?

Other similarity functions?. . .

Introduce gap costs

Constrain Potts model inference?

Canonical Potts model?Better control amplitude of vectors and matrices?

H. Talibart, F. Coste ComPotts June 18, 2019 31 / 35

Page 39: Using residue coevolution to retrieve protein homologs1

Conclusion so far. . .

Good news: alignment to Potts model is tractable

A surprise: may require a transformation to a Potts model

A working efficient implementation

Quality of alignment can still be improved. . .

H. Talibart, F. Coste ComPotts June 18, 2019 32 / 35

Page 40: Using residue coevolution to retrieve protein homologs1

Thanks for your attention!

Ideas, remarks, suggestions are welcome.See you next to our poster. . .

Page 41: Using residue coevolution to retrieve protein homologs1

Bibliography I

[Alt+90] S. F. Altschul et al. “Basic local alignment search tool”.Journal of molecular biology (1990).

[Edd98] S. R. Eddy. “Profile hidden Markov models.”. Bioinformatics14.9 (1998), pp. 755–763.

[Ste+19] M. Steinegger et al. “HH-suite3 for fast remote homologydetection and deep protein annotation”. bioRxiv (2019),p. 560029.

[Ker08] G. Kerbellec. “Apprentissage d’automates modélisant desfamilles de séquences protéiques”. PhD thesis. Université deRennes 1, Apr. 2008, p. 139.

[Bre+13] A. Bretaudeau et al. “CyanoLyase: a database of phycobilinlyase sequences, motifs and functions”. Nucleic Acids Research41.Database-Issue (2013), pp. 396–401.

H. Talibart, F. Coste ComPotts June 18, 2019 33 / 35

Page 42: Using residue coevolution to retrieve protein homologs1

Bibliography II

[CGN14] F. Coste, G. Garet, and J. Nicolas. “A bottom-up efficientalgorithm learning substitutable languages from positiveexamples”. ICGI. 2014.

[Dyr+19] W. Dyrka et al. “Estimating probabilistic context-freegrammars for proteins using contact map constraints”. PeerJ(2019).

[SGS14] S. Seemayer, M. Gruber, and J. Söding. “CCMpred—fast andprecise prediction of protein residue–residue contacts fromcorrelated mutations”. Bioinformatics 30.21 (2014),pp. 3128–3130.

[OSD17] S. H. P. de Oliveira, J. Shi, and C. M. Deane. “Comparingco-evolution methods and their application to template-freeprotein structure prediction”. Bioinformatics 33.3 (2017),pp. 373–381.

H. Talibart, F. Coste ComPotts June 18, 2019 34 / 35

Page 43: Using residue coevolution to retrieve protein homologs1

Bibliography III

[Jon+11] D. T. Jones et al. “PSICOV: precise structural contactprediction using sparse inverse covariance estimation on largemultiple sequence alignments”. Bioinformatics 28.2 (2011),pp. 184–190.

[Rem+12] M. Remmert et al. “HHblits: lightning-fast iterative proteinsequence searching by HMM-HMM alignment”. Naturemethods 9.2 (2012), p. 173.

[Söd04] J. Söding. “Protein homology detection by HMM–HMMcomparison”. Bioinformatics 21.7 (2004), pp. 951–960.

[Woh12] I. Wohlers. “Exact Algorithms For Pairwise Protein StructureAlignment”. PhD thesis. Vrije Universiteit, Jan. 2012, pp. 1–147.

H. Talibart, F. Coste ComPotts June 18, 2019 35 / 35