Upload
nelson-nicholas-melton
View
219
Download
3
Embed Size (px)
Citation preview
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 1/31
Prediction of significant positions in biological
sequences
Ilka HoofPh.D. student
Immunological Bioinformatics
Center for Biological Sequence Analysis
Danmarks Tekniske Universitet
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 2/31
Significant positions?
HIV-1 gp120
PDB: 2NY7
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 3/31
Significant positions?
HIV-1 gp120
PDB: 2NY7
Antibody-binding site?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 4/31
Significant positions?
HIV-1 protease
PDB: 2CEN
Catalytic efficiency?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 5/31
Significant positions?
“Which sites in HIV-1 protease contribute significantly to the fitness level of an HIV-1 mutant?”
“Where is the binding site of a specific antibody located on the antigen?”
“Which sites are important for enzymatic activity?”
Given a multiple sequence alignment and a numerical value associated with each sequence Values imply a ranking of the sequences
What we’re interested in:Which positions distinguish high and low ranking sequence?
e.g. binders vs. non-binders high vs. low fitness
high vs low enzymatic activity
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 6/31
The data we have
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 7/31
The output we want
...how do we get there?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 8/31
SigniSite 1.0
http://www.cbs.dtu.dk/services/SigniSite/
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 9/31
SigniSite - method
Rank-based statistical test
0.0020.0840.1280.2730.5930.5930.8920.9230.999
1.02.03.04.05.55.57.08.09.0
real-valued data ranks
Calculate mean rankfor each residue type
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 10/31
SigniSite - the method
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 11/31
SigniSite - the method
Calculate the mean rankfor each residue type.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 12/31
SigniSite - the method
What’s the null hypothesis of our statistical test?
The observed mean rank of a residue type does not significantly deviate from the expected mean rank.
What is expected?
We assume random distribution of the amino acids in the column.Given N sequences, the expected mean rank is
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 13/31
Z score determines significance
Given the shape of the distribution, what’s significant?
mean
sd
obs. rank
Z score can be calculated from mean and standard deviation:
+1.96
p < 0.025
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 14/31
Z score determines significance
observed mean rank for E
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 15/31
Are the random mean ranks normally distributed?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 16/31
Same mean, but different standard deviation
Frequencies: 0.5 0.25 0.1 0.05
Mean rank distributions for different frequencies
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 17/31
How to estimate the standard deviation?
Our test reminds of the Wilcoxon rank statistic:Given two samples of size n1 and n2, n1+n2 = N.
Let R be the mean rank of sample 1.
The distribution of mean ranks R can be approximated by the normal distribution
with mean
and standard deviation
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 18/31
Coping with ties
Formula as before but weighted with tie-correction factor T
where
and t is a vector which contains the counts of ties, i.e. m denotes the number of distinct values in the data set.
Example: all values the same => T = 0 all values different => T = 1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 19/31
Simple example
cate
gory
1
cate
gory
2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 20/31
Simple example
Tie correction vs. no tie correction
Standard deviation Z score
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 21/31
Multiple testing problem
We perform a significance test for each amino acid type in each column.
Problem: The more hypotheses we test, the higher the probability of obtaining at least one false positive.
Each test is performed with the same type-I error e.g. = 0.05.The total significance level totof m significance tests is then
given by
tot 1 - (1 - )m
Examples:1 test tot 1 - (1 - 0.05)1 = 0.05
2 tests tot 1 - (1 - 0.05)2 = 0.0975
100 tests tot 1 - (1 - 0.05)100 = 0.99Correction for multiple testing necessary!
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 22/31
How many statistical tests are performed?
One test per amino acid type and column.
wi is the number of different amino acids in column i
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 23/31
Correction for multiple testing
Adjusted p-values using Bonferroni’s single-step method:
Multiply all unadjusted p-values by the number of tests m
Adjusted p-values are given by
for j = 1, ..., m
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 24/31
Correction for multiple testing
Adjusted p-values using Holm’s step-down method:
observed ordered unadjusted p-values
Adjusted p-values are given by
for j = 1, ..., m
So, nothing more than:
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 25/31
Application of SigniSite
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 26/31
Ab-binding affinity to HIV-1 gp120
Alignment length: 569 residues
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 27/31
SigniSite web service
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 28/31
SigniSite results
10 significant sites identified.
Holm step-down correction, = 0.05
Heatmap
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 29/31
SigniSite results
Sequence logos
display Z score for all amino acid types
display Z score only for significant amino acid types
“ordinary” frequency logo
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 30/31
SigniSite results
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 31/31
SDPpred
http://math.genebee.msu.ru/~psn/index.htm
Kalinina et al. (2004), Protein Sci 13(2): 443-56
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 32/31
SDPpred
• Categories instead of continuous values• Mutual information
• Amino acids with similar physico-chemical properties are weakly penalized
• Statistical test: observed mutual inf. = expected
mutual inf.?
€
Ip = fPα =1
20
∑i=1
N
∑ (α ,i)logfP (α ,i)
fP (α ) f (i)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 33/31
SDPpred - Results
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 34/31
SDPpred - Results
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 35/31
SDPpred - Results
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
BiC BioCentrum-DTUTechnical University of Denmark 36/31
• You can use SigniSite and SDPpred to find sites of interest in your biological data
• Logos are a nice and clear way of displaying sequence information
• Whenever you perform statistical tests, remember the multiple testing problem!
Conclusion