Data Visualization
http://blogs.nature.com/methagora/2013/07/data
-visualization-points-of-view.html
Slide 4
MS/MS Lysis Fractionation Protein Identification MS/MS
Digestion Sequence DB All Fragment Masses Pick Protein Compare,
Score, Test Significance Repeat for all proteins Pick PeptideLC-MS
Repeat for all peptides
Slide 5
Search Results
Slide 6
Significance Testing False protein identification is caused by
random matching An objective criterion for testing the significance
of protein identification results is necessary. The significance of
protein identifications can be tested once the distribution of
scores for false results is known.
Slide 7
Distribution of Extreme Values NormalSkewed n=3 n=10 n=100 n=3
n=10 n=100
Slide 8
Significance Testing - Expectation Values The majority of
sequences in a collection will give a score due to random
matching.
Slide 9
Database Search M/Z List of Candidates Extrapolate And
Calculate Expectation Values List of Candidates With Expectation
Values Distribution of Scores for Random and False Identifications
Significance Testing - Expectation Values
A Few Characteristics of Analytical Measurements Accuracy:
Closeness of agreement between a test result and an accepted
reference value. Precision: Closeness of agreement between
independent test results. Robustness: Test precision given small,
deliberate changes in test conditions (preanalytic delays,
variations in storage temperature). Lower limit of detection: The
lowest amount of analyte that is statistically distinguishable from
background or a negative control. Limit of quantification: Lowest
and highest concentrations of analyte that can be quantitatively
determined with suitable precision and accuracy. Linearity: The
ability of the test to return values that are directly proportional
to the concentration of the analyte in the sample.
Slide 13
Measuring Blanks
Slide 14
Coefficient of Variation Variance Sample Mean Coefficient of
Variation (CV)
Slide 15
Lower Limit of Detection The lowest amount of analyte that is
statistically distinguishable from background or a negative
control. Two methods to determine lower limit of detection:
1.Lowest concentration of the analyte where CV is less than for
example 20%. 2.Determine level of blank by taking 95 th percentile
of the blank measurements and add a constant times the standard
deviation of the lowest concentration. K. Linnet and M.
Kondratovich, Partly Nonparametric Approach for Determining the
Limit of Detection, Clinical Chemistry 50 (2004) 732740.
Slide 16
Limit of Detection and Linearity Theoretical Concentration
Measured Concentration
Slide 17
Precision and Accuracy Theoretical Concentration Measured
Concentration
Slide 18
A Data Set with Two Samples
Slide 19
A proteomics example no replicates
Slide 20
A proteomics example three replicates no replicates three
replicates Log 2 Standard Deviation Log 2 Average Spectrum Count
Log 2 Sum Spectrum Count Log 2 Spectrum Count Ratio Log 2 Sum
Spectrum Count Log 2 Spectrum Count Ratio
Slide 21
How Different are Two Measurements?
Slide 22
A Data Set with Seven Samples 3 replicates 3 replicates + one
more replicate a few months later Normalized
Slide 23
A Data Set with Seven Samples
Slide 24
Slide 25
Box Plot M. Krzywinski & N. Altman, Visualizing samples
with box plots, Nature Methods 11 (2014) 119
Box Plots with All the Data Points ComplexNormalSkewedLong
tails n=5 n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100 n=5 n=10
n=100
Slide 28
Box Plots, Scatter Plots and Bar Graphs Normal Distribution
Error bars: standard deviation error bars: standard deviation error
bars: standard error
Slide 29
Box Plots, Scatter Plots and Bar Graphs Skewed Distribution
Error bars: standard deviation error bars: standard deviation error
bars: standard error
Slide 30
Box Plots, Scatter Plots and Bar Graphs Distribution with Fat
Tail Error bars: standard deviation error bars: standard deviation
error bars: standard error
Slide 31
Venn Diagrams
Slide 32
TCGA Unsupervised mRNA Expression Analysis The Cancer Genome
Atlas Network, Comprehensive molecular portraits of human breast
tumors. Nature. 490 (7418):61-70.
Slide 33
Correlations between mRNA and protein abundance in TCGA colon
tumors B Zhang et al. Nature 000, 1-6 (2014)
doi:10.1038/nature13438
Slide 34
The Effect of Copy Number Alterations B Zhang et al. Nature
000, 1-6 (2014) doi:10.1038/nature13438
Slide 35
The Effect of Copy Number Alterations
Slide 36
Testing multiple hypothesis Is the concentration of
calcium/calmodulin-dependent protein kinase type II different
between the two samples? What protein concentration are different
between the two samples? p = 2x10 -6 The p-value needs to be
corrected taking into account the we perform many tests. Bonferroni
correction: multiply the p-value with The number of tests performed
(n): p corr = p uncorr x n In this case where 3685 proteins are
identified, so the Bonferroni corrected p-value for
calcium/calmodulin-dependent protein kinase type II is p corr =
2x10 -6 x 3685 = 0.007
Slide 37
Testing multiple hypothesis The p-value distribution is uniform
when testing differences between samples from the same
distribution. Normal distribution Sample size = 10 p-value 1 0 # of
test p-value 1 0 # of test p-value 1 0 # of test 0 8 0 60 0 500
10,000 tests1,000 tests100 tests
Slide 38
Testing multiple hypothesis The p-value distribution is uniform
when testing differences between samples from the same
distribution. Normal distribution Sample size = 10 30 tests from a
distribution with a different mean ( 1 - 2 >>) p-value 1 # of
test p-value 1 # of test p-value 1 0 # of test 0 30 0 100 0 500
10,000 tests1,000 tests100 tests 0 0
Slide 39
Testing multiple hypothesis Controlling for False Discovery
Rate (FDR) Normal distribution Sample size = 10 30 tests from a
distribution with a different mean ( 1 - 2 >>) p-value 1
False Rate p-value 1 False Rate p-value 1 0 False Rate 0 1 0 1 0 1
0 0 False Discovery Rate False Discovery Rate False Discovery Rate
10,000 tests1,000 tests100 tests
Slide 40
Testing multiple hypothesis False Discovery Rate (FDR) and
False Negative Rate (FNR) Normal distribution Sample size = 10 100
tests 30 tests from a distribution with a different mean p-value 1
False Rate p-value 1 False Rate p-value 1 0 False Rate 0 1 0 1 0 1
0 0 1 - 2 =21-2=1-2= 1 - 2 =/2 False Discovery Rate False Negative
Rate False Discovery Rate False Negative Rate False Discovery Rate
False Negative Rate
Slide 41
Proteomics Informatics Data Analysis and Visualization (Week
13)