Upload
deepak
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Gibbs Clustering Massimo Andreatta , Morten Nielsen CBS, Department of Systems biology DTU, Denmark . Class II MHC binding. MHC class II binds peptides in the class II antigen presentation pathway Binds peptides of length 9-18 (even whole proteins can bind!) Binding cleft is open - PowerPoint PPT Presentation
Citation preview
Gibbs ClusteringMassimo Andreatta, Morten Nielsen
CBS, Department of Systems biologyDTU, Denmark
Class II MHC binding• MHC class II binds peptides in the class II antigen presentation pathway• Binds peptides of length 9-18 (even whole proteins can bind!)• Binding cleft is open• Binding core is 9 aa
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 3
Conventional Gibbs samplingMHC class II binding
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
...
...
...
... DFAAQVDYPSTGLY
9 AAs
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRT ... ... ... ...DFAAQVDYPSTGLY
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 4
Gibbs sampling - sequence alignment
Why sampling? SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
...
...
...
... DFAAQVDYPSTGLY
50 sequences 12 amino acids long
try all possible combinations with a 9-mer overlap
450 ~ 1030 possible combinations
...computationally unfeasible
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 5
Gibbs sampling - sequence alignmentState transition
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 6
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
Accept or reject the move?
Note that the probability of going to the new state depends on the previous state only
Gibbs sampling - sequence alignmentState transition
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 7
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
Gibbs sampling - sequence alignment
Numerical example - 1
Accept move with Prob = 100%
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 8
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
Gibbs sampling - sequence alignment
Numerical example - 2
Accept move with Prob = 63.8%
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 9
Gibbs sampling - sequence alignment
What is the MC temperature?
iteration
T
it’s a scalar decreased during the simulation
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 10
Gibbs sampling - sequence alignment
What is the MC temperature?
iteration
T
it’s a scalar decreased during the simulation E.g. same dE=-0.3 but at different temperatures
t1=0.4
t3=0.02
t2=0.1
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 11
Z
f(Z)
Move freely around states when the system is “warm”, then cool it off to force it into a state of high fitness
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 12
Does it work?
HLA-DRB3*01:01
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 13
It works
High Temperature Low Temperature
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 14
Gibbs sampler. Prediction accuracy
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 15
Next generation peptide-chip technology
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 16
Identification of receptor binding motifs
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 17
Data interpretationPeptide Signal
LMLSLFEQSLSCQAQ 9
QGTDATKSIIFEAER -12
RLEEAQAYLAAGQHD 10
EISELRTKVQEQQKQ 44
FAGAKKIFGSLAFLP 70
VRASSRVSGSFPEDS -7
CKAFFKRSIQGHNDY 86
CEGCKAFFKRSIQGH 100
RLSEADIRGFVAAVV -7
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 18
Or with multiple specificities? Peptide Signal
LMLSLFEQSLSCQAQ 9
QGTDATKSIIFEAER -12
RLEEAQAYLAAGQHD 10
EISELRTKVQEQQKQ 44
FAGAKKIFGSLAFLP 70
VRASSRVSGSFPEDS -7
CKAFFKRSIQGHNDY 86
CEGCKAFFKRSIQGH 100
RLSEADIRGFVAAVV -7
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 19
Gibbs clustering (multiple specificities)
--ELLEFHYYLSSKLNK----------LNKFISPKSVAGRFAESLHNPYPDYHWLRT-------NKVKSLRILNTRRKL-------MMGMFNMLSTVLGVS----AKSSPAYPSVLGQTI--------RHLIFCHSKKKCDELAAK-
----SLFIGLKGDIRESTV----DGEEEVQLIAAVPGK----------VFRLKGGAPIKGVTF---SFSCIAIGIITLYLG-------IDQVTIAGAKLRSLN--WIQKETLVTFKNPHAKKQDV-------KMLLDNINTPEGIIP
Cluster 2
Cluster 1SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRTNKVKSLRILNTRRKLMMGMFNMLSTVLGVSAKSSPAYPSVLGQTIRHLIFCHSKKKCDELAAK
Multiple motifs!
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 20
The algorithm
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLN
WIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRTNKVKSLRILNTRRKLMMGMFNMLSTVLGVSAKSSPAYPSVLGQTI
RHLIFCHSKKKCDELAAK
1. List of peptides
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 21
The algorithm
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTF
SFSCIAIGIITLYLGIDQVTIAGAKLRSLN
WIQKETLVTFKNPHAKKQDV
KMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFA
ESLHNPYPDYHWLRTNKVKSLRILNTRRKL
MMGMFNMLSTVLGVSAKSSPAYPSVLGQTI
RHLIFCHSKKKCDELAAK
1. List of peptides 2. create Nrandom groups
----IDQVTIAGAKLRSLN-WIQKETLVTFKNPHAKKQ
DV--ELLEFHYYLSSKLNK---
--MMGMFNMLSTVLGVS------AKSSPAYPSVLGQTI--
-SLFIGLKGDIRESTV-----SFSCIAIGIITLYLG
KMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
---ESLHNPYPDYHWLRT-RHLIFCHSKKKCDELAAK-----VFRLKGGAPIKGVTF--LNKFISPKSVAGRFA--DGEEEVQLIAAVPGK----
g1 g2 gN
3 Simple shift Remove peptide Phase shift
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 22
The algorithm
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLN
WIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRTNKVKSLRILNTRRKLMMGMFNMLSTVLGVSAKSSPAYPSVLGQTI
RHLIFCHSKKKCDELAAK
1. List of peptides 2. create Nrandom groups
----IDQVTIAGAKLRSLN-WIQKETLVTFKNPHAKKQDV--ELLEFHYYLSSKLNK-----MMGMFNMLSTVLGVS------AKSSPAYPSVLGQTI--
-SLFIGLKGDIRESTV-----SFSCIAIGIITLYLGKMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
---ESLHNPYPDYHWLRT-RHLIFCHSKKKCDELAAK-----VFRLKGGAPIKGVTF--LNKFISPKSVAGRFA--DGEEEVQLIAAVPGK----
g1 g2 gN
MMGMFNMLSTVLGVS
3 3. Finding the optimal configuration (the MC moves)
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 23
The algorithm
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLN
WIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRTNKVKSLRILNTRRKLMMGMFNMLSTVLGVSAKSSPAYPSVLGQTI
RHLIFCHSKKKCDELAAK
1. List of peptides 2. create Nrandom groups
----IDQVTIAGAKLRSLN-WIQKETLVTFKNPHAKKQDV--ELLEFHYYLSSKLNK------AKSSPAYPSVLGQTI--
-SLFIGLKGDIRESTV-----SFSCIAIGIITLYLGKMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
---ESLHNPYPDYHWLRT-RHLIFCHSKKKCDELAAK-----VFRLKGGAPIKGVTF--LNKFISPKSVAGRFA--DGEEEVQLIAAVPGK----
g1 g2 gN
MMGMFNMLSTVLGVS
4. Random shift of the coreMMGMFNMLSTVLGVS
3. Finding the optimal configuration (the MC moves)
5. Score new core to log-odds
matrices6. Accept or reject move
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 24
The algorithm
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTF
SFSCIAIGIITLYLGIDQVTIAGAKLRSLN
WIQKETLVTFKNPHAKKQDV
KMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFA
ESLHNPYPDYHWLRTNKVKSLRILNTRRKL
MMGMFNMLSTVLGVSAKSSPAYPSVLGQTI
RHLIFCHSKKKCDELAAK
1. List of peptides 2. create Nrandom groups
----IDQVTIAGAKLRSLN-WIQKETLVTFKNPHAKKQ
DV--ELLEFHYYLSSKLNK---
--MMGMFNMLSTVLGVS------AKSSPAYPSVLGQTI--
-SLFIGLKGDIRESTV-----SFSCIAIGIITLYLG
KMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
---ESLHNPYPDYHWLRT-RHLIFCHSKKKCDELAAK-----VFRLKGGAPIKGVTF--LNKFISPKSVAGRFA--DGEEEVQLIAAVPGK----
g1 g2 gN
MMGMFNMLSTVLGVS
3 Simple shift Remove peptide Phase shift
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 25
The algorithm
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTF
SFSCIAIGIITLYLGIDQVTIAGAKLRSLN
WIQKETLVTFKNPHAKKQDV
KMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFA
ESLHNPYPDYHWLRTNKVKSLRILNTRRKL
MMGMFNMLSTVLGVSAKSSPAYPSVLGQTI
RHLIFCHSKKKCDELAAK
1. List of peptides 2. create Nrandom groups
----IDQVTIAGAKLRSLN-WIQKETLVTFKNPHAKKQ
DV--ELLEFHYYLSSKLNK------AKSSPAYPSVLGQTI--
-SLFIGLKGDIRESTV-----SFSCIAIGIITLYLG
KMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
---ESLHNPYPDYHWLRT-RHLIFCHSKKKCDELAAK-----VFRLKGGAPIKGVTF--LNKFISPKSVAGRFA--DGEEEVQLIAAVPGK----
g1 g2 gN
MMGMFNMLSTVLGVS
4b. Random shift of the coreMMGMFNMLSTVLGVS
3 Simple shift Remove peptide Phase shift
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 26
The algorithm
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTF
SFSCIAIGIITLYLGIDQVTIAGAKLRSLN
WIQKETLVTFKNPHAKKQDV
KMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFA
ESLHNPYPDYHWLRTNKVKSLRILNTRRKL
MMGMFNMLSTVLGVSAKSSPAYPSVLGQTI
RHLIFCHSKKKCDELAAK
1. List of peptides 2. create Nrandom groups
----IDQVTIAGAKLRSLN-WIQKETLVTFKNPHAKKQ
DV--ELLEFHYYLSSKLNK------AKSSPAYPSVLGQTI--
-SLFIGLKGDIRESTV-----SFSCIAIGIITLYLG
KMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
---ESLHNPYPDYHWLRT-RHLIFCHSKKKCDELAAK-----VFRLKGGAPIKGVTF--LNKFISPKSVAGRFA--DGEEEVQLIAAVPGK----
g1 g2 gN
MMGMFNMLSTVLGVS
4b. Random shift of the coreMMGMFNMLSTVLGVS
5b. Score new core to all log-odds
matrices
E1 E2E3
3 Simple shift Remove peptide Phase shift
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 27
The algorithm
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTF
SFSCIAIGIITLYLGIDQVTIAGAKLRSLN
WIQKETLVTFKNPHAKKQDV
KMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFA
ESLHNPYPDYHWLRTNKVKSLRILNTRRKL
MMGMFNMLSTVLGVSAKSSPAYPSVLGQTI
RHLIFCHSKKKCDELAAK
1. List of peptides 2. create Nrandom groups
----IDQVTIAGAKLRSLN-WIQKETLVTFKNPHAKKQ
DV--ELLEFHYYLSSKLNK------AKSSPAYPSVLGQTI--
-SLFIGLKGDIRESTV-----SFSCIAIGIITLYLG
KMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
---ESLHNPYPDYHWLRT-RHLIFCHSKKKCDELAAK-----VFRLKGGAPIKGVTF--LNKFISPKSVAGRFA--DGEEEVQLIAAVPGK----
-MMGMFNMLSTVLGVS---
g1 g2 gN
MMGMFNMLSTVLGVS
4b. Random shift of the coreMMGMFNMLSTVLGVS
5b. Score new core to all log-odds
matrices
E1 E2E3
6b. Accept or reject move
3 Simple shift Remove peptide Phase shift
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 28
The scoring function
-SLFIGLKGDIRESTV-----SFSCIAIGIITLYLGKMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
-SLFIGLKGDIRESTV---SFSCIAIGIITLYLG--KMLLDNINTPEGIIP----LNKYVHGTWRSILP-----NKVKSLRILNTRRKL-
P = Probability of accepting the
move
Avoid small specialized clusters
(s = 10)Maximize intra cluster similarity
whilst minimize inter cluster similarity
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 29
A mixture of 500 9mer peptides. How many motifs?
OR
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 30
A mixture of 500 9mer peptides. How many motifs?
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 31
Mixture of 100 binders for the two allelesATDKAAAAY A*0101EVDQTKIQY A*0101AETGSQGVY B*4402ITDITKYLY A*0101AEMKTDAAT B*4402FEIKSAKKF B*4402LSEMLNKEY A*0101GELDRWEKI B*4402LTDSSTLLV A*0101FTIDFKLKY A*0101TTTIKPVSY A*0101EEKAFSPEV B*4402AENLWVPVY B*4402
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 32
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402
A0101 B4402
G 1
G 2
Mixed
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 33
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402
97 3
3 97
A0101 B4402
G 1
G 2
Resolved
Mixed
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 34
Five MHC class I allelesA0101
A0201 A0301 B0702 B4402
G 0
G 1
G 2
G 3
G 4
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 35
Five MHC class I allelesA0101
0 1 76 1 02 4 0 0 955 87 5 1 093 2 19 0 20 6 0 98 3
A0201 A0301 B0702 B4402
G 0
G 1
G 2
G 3
G 4
HLA-B440294%
HLA-B070292%
HLA-A020189%
HLA-A010180%
HLA-A030197%
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 36
The right number of specificities
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 37
How many motifs?
l=0
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 38
How many motifs?
OR
l>0
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 39
Adding in alignment (MHC class II)
HLA-DRB1*03:01HLA-DRB1*04:01
SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRTNKVKSLRILNTRRKLMMGMFNMLSTVLGVSAKSSPAYPSVLGQTIRHLIFCHSKKKCDELAAK
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 40
Dealing with noisy data
• Experimental data often contain false positives
• Outliers do not match any recurrent motif
• Introduce a garbage bin to collect outliers
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 41
Dealing with noisy data
• Introduce a garbage bin to collect outliersSLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRTNKVKSLRILNTRRKLMMGMFNMLSTVLGVSAKSSPAYPSVLGQTIRHLIFCHSKKKCDELAAK
g1 g2 gn trash
Move to the trash cluster peptides that do not match any motif
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 42
Dealing with noisy data
50 random sequences are added to the data set
200 binders to 3 MHC class I alleles
3 alleles HLA-A0101 HLA-B0702 HLA-B4001 Random g0 0 197 2 6 205g1 199 1 0 2 202g2 0 0 196 5 201trash 1 2 2 37 42
200 200 200 50
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 43
Dealing with noisy data
50 random sequences are added to the data set
200 binders to 3 MHC class I alleles
3 alleles HLA-A0101 HLA-B0702 HLA-B4001 Random g0 0 197 2 6 205g1 199 1 0 2 202g2 0 0 196 5 201trash 1 2 2 37 42
200 200 200 50
DHHFTPQIINAFGWENAYSQTSYQYLI
ELPIVTPALADKNLIKCS
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 44
Dealing with noisy data
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 45
SH3 domains
- SH3 domains are small interaction modules abundantly found in eukaryotes
- They preferentially bind proline-rich regions (minimum consensus sequence is PxxP)
- Two main classes of ligands exist:
Surface representation of the Hck SH3 domain bound to an artificial high-affinity peptide ligand, in green. Schmidt et al. (2007)
class I+xPxP
class IIPxPx+
+: R or K: hydrophobicx: any amino acid
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 46
SH3 domains - Gibbs clustering
- Large data set of 2,457 unique peptide sequences* binding to Src SH3 domain
- 12 amino acids long peptides, unaligned with respect to the binding core
WVTAPRSLPVLPGSWVVDISNVEDNYSGNRPLPGIWRSPIVRQLPSLPRALPVMPTNGPMAKSRPLPMVGLVVRPLPSPEGFGQYRGRMLPVIWGTRALPLPRAYEGIVPLLPIRNGAVNPVRVLPNIPLSV...
*Kim, T., et al. (2011) MUSI: an integrated system for identifying multiple specificity from large peptide or nucleic acid data sets, Nucleic Acids Res, 1-8.
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 47
SH3 domains
Gibbs clustering - 1 cluster
class I+xPxP
class IIPxPx+
2,350 sequences(107 to trash)
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 48
SH3 domains
Gibbs clustering - 2 clusters
class I+xPxP
class IIPxPx+ (63 to trash)
466 sequences
Class II
1,928 sequences
Class I
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 49
SH3 domains
Gibbs clustering - 3 clusters
class I+xPxP
class IIPxPx+
1,404 sequences 560 sequences
(48 to trash)
445 sequences
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 50
SH3 domains
class I+xPxPclass II
PxPx+
Optimal solution has two
specificities
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 51
http://www.cbs.dtu.dk/services/GibbsCluster
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
Department of Systems BiologyTechnical University of Denmark 52
Acknowledgements
• Massimo Andreatta, PhD (for doing all the work)
• John Sidney (La Jolla Institute for Allergy and Immunology, California, USA) for the binding affinity validation of the predicted MHC class I outliers.
• The European Union 7th Framework Program FP7/2007-2013 [grant number 222773]