Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
jhu
Biostatistics and Computational Biology
Some selected examples . . . and a bit of R and Bioconductor
Ingo Ruczinski
Department of BiostatisticsJohns Hopkins Bloomberg School of Public Health
December 7, 2009
Ingo Ruczinski Biostatistics and Computational Biology 1
jhu
Bioinformatics and computational biology
Wikipedia:
Bioinformatics and computational biology involve the use of
techniques including applied mathematics, informatics, statistics,
computer science, artificial intelligence, chemistry, and biochemistry
to solve biological problems usually on the molecular level.
· · ·Major research efforts in the field include sequence alignment, gene
finding, genome assembly, protein structure alignment, protein
structure prediction, prediction of gene expression and protein-protein
interactions, and the modeling of evolution.
Ingo Ruczinski Biostatistics and Computational Biology 2
jhu
Bioinformatics and computational biology
Wikipedia:
The terms bioinformatics and computational biology are often used
interchangeably. However bioinformatics more properly refers to the
creation and advancement of algorithms, computational and statistical
techniques, and theory to solve formal and practical problems
inspired from the management and analysis of biological data.
Computational biology, on the other hand, refers to hypothesis-driven
investigation of a specific biological problem using computers, carried
out with experimental or simulated data, with the primary goal of
discovery and the advancement of biological knowledge.
Ingo Ruczinski Biostatistics and Computational Biology 3
jhu
Bioinformatics and computational biology
NIH definition of Bioinformatics and Computational Biology:
Bioinformatics and computational biology are rooted in life sciences
as well as computer and information sciences and technologies. Both
of these interdisciplinary approaches draw from specific disciplines
such as mathematics, physics, computer science and engineering,
biology, and behavioral science.
· · ·Bioinformatics applies principles of information sciences and
technologies to make the vast, diverse, and complex life sciences
data more understandable and useful. Computational biology uses
mathematical and computational approaches to address theoretical
and experimental questions in biology. Although bioinformatics and
computational biology are distinct, there is also significant overlap
and activity at their interface.
Ingo Ruczinski Biostatistics and Computational Biology 4
jhu
Bioinformatics and computational biology
NIH definition of Bioinformatics and Computational Biology:
The NIH Biomedical Information Science and Technology InitiativeConsortium agreed on the following definitions of bioinformatics and
computational biology recognizing that no definition could completelyeliminate overlap with other activities or preclude variations in
interpretation by different individuals and organizations.
Bioinformatics: Research, development, or application ofcomputational tools and approaches for expanding the use of
biological, medical, behavioral or health data, including those toacquire, store, organize, archive, analyze, or visualize such data.
Computational Biology: The development and application of
data-analytical and theoretical methods, mathematical modeling and
computational simulation techniques to the study of biological,
behavioral, and social systems.
Ingo Ruczinski Biostatistics and Computational Biology 5
jhu
2D gel electrophoresis
Ingo Ruczinski Biostatistics and Computational Biology 6
jhu
2D gel electrophoresis
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
Ingo Ruczinski Biostatistics and Computational Biology 7
jhu
2D gel electrophoresis
A:1 A:2 A:3 A:4 A:5 A:6 A:7 A:8 A:9 A:10 A:11 A:12 B:1 B:2 B:3 B:4 B:5 B:6 B:7 B:8 B:9 B:10 B:11 B:12
A B
Ingo Ruczinski Biostatistics and Computational Biology 8
jhu
2D gel electrophoresis
A:1 A:2 A:3 A:4 B:1 B:2 B:3 B:4
A B
Ingo Ruczinski Biostatistics and Computational Biology 9
jhu
2D gel electrophoresis
!20
0
20
40
60
% r
eduction o
f concentr
ation a
s c
om
pare
d to b
ackgro
und 1
st Trimester
3rd
Trimester
Folate Placebo
Ingo Ruczinski Biostatistics and Computational Biology 10
jhu
2D gel electrophoresis
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
Ingo Ruczinski Biostatistics and Computational Biology 11
jhu
2D gel electrophoresis
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
790
1056
853
662
1458
948
1026
770
586
768
797
332
248
Ingo Ruczinski Biostatistics and Computational Biology 12
jhu
Karyotypes
Ingo Ruczinski Biostatistics and Computational Biology 13
jhu
Trisomy
Ingo Ruczinski Biostatistics and Computational Biology 14
jhu
Karyotypes
General Cytogenetics Information http://members.aol.com/chrominfo/
Ingo Ruczinski Biostatistics and Computational Biology 15
jhu
FISH
Courtesy of the Pevsner Laboratory
Ingo Ruczinski Biostatistics and Computational Biology 16
jhu
DNA changes
Ingo Ruczinski Biostatistics and Computational Biology 17
jhu
Single nucleotide polymorphisms
Ingo Ruczinski Biostatistics and Computational Biology 18
jhu
SNP chip data
Ingo Ruczinski Biostatistics and Computational Biology 19
jhu
Deletion
0.5
1
22
3
44
5
66
88
0 20 40 60 80 100 120
Chr 11
0 20 40 60 80 100 120
Chr 12
co
py n
um
be
r
Mb
AA
AB
BB
Ingo Ruczinski Biostatistics and Computational Biology 20
jhu
Amplification
1
22
3
44
6
8
10
0 50 100 150
Chr 7
co
py n
um
be
r
Mb
AA
AB
BB
Ingo Ruczinski Biostatistics and Computational Biology 21
jhu
Uniparental isodisomy
11
22
33
44
55
0 50 100 150 200 250
Chr 1
0 50 100 150 200 250
Chr 2
co
py n
um
be
r
Mb
AA
AB
BB
Ingo Ruczinski Biostatistics and Computational Biology 22
jhu
A versus B plots
7 8 9 10 11 12 13
7
8
9
10
11
12
13
SNP_A!4251622
log2(A)
log
2(B)
plate 1
plate 2
20
21
22
23
SNP_A!4251622
log
2(A)+
log
2(B)
pla
te2
pla
te1
plates ordered by median date
Ingo Ruczinski Biostatistics and Computational Biology 23
jhu
A versus B plots
7 8 9 10 11 12 13
7
8
9
10
11
12
13
SNP_A!8348190
A
B
3 copies
2 copies
Ingo Ruczinski Biostatistics and Computational Biology 24
jhu
Trisomy 21
0
1
2
3
4
5
6
Chr 21co
py n
um
be
r
samples
Samples from Aravinda Chakravarti and Betty Doan
Ingo Ruczinski Biostatistics and Computational Biology 25
jhu
Trisomy 21
0
1
2
3
4
5
6
Chr 21
co
py n
um
be
r
samples
Samples from Aravinda Chakravarti and Betty Doan
Ingo Ruczinski Biostatistics and Computational Biology 26
jhu
Prediction regions for copy number
1
2
1
3
1 1
1
1
1
1
1
1
1
1
2
11
1
111
1
1
11
1
2
2
2
1
1
11
1
1
1
11
11
1 1
2
1
2
1
11
11
1
3
1
1
11
1
2
1
1
11
2
1
1
1
2
1
1
1
2
1
1
1
1
2
1
11
11
1
11
1 1
1
1
1
11
11
1
1
1
11
1
1
2
11
1
1
1
1
11
1 1
1
11
1
1
1
2
2
1
2
1
1
1
2
1
1
11
1
2
2
1
3
1
1 11
1
1
1 1
11
3
1
2
1
2
11
1
1
1
1
11
1
1
1
2
2
1
1
1
1
1
1
2
1
111
1
1
11
11
1
2
1
1
1
1
1
1
11
22
1
1
111
11
2
11 1
11
1
3
11
2
22
1
1 11
1
11
1
2
11
1
22
1
1
1
2
1
1
1
2
1
1
11
1
11
1
11
1
2
2
1
11
111
1
2
1
1
11
2
1
11
1
2
1 11
1
11
111
22
1
1
1
2
1
1
1
11
1
1
1
1
1
1
2
1
1
11
1
11
1
1
1
2
1
1
1
1
11
11
1
1
2
1
1
2
2
111
1
1
2
2
2
2
1
11
1
2
11
111
1
1
1
1
1
3
1
1
1
1
2
1
2
1
1
11
1
1
2
1
111
1
2
111
1
1
1
2
1
1
1
1
1
1
1
1
1
1
11
1
1
11
111
1
2
1
1
2
1
2
2
2
11
1
11
3
1
2
1
2
1
2
3
1
2
1
11
1
2
1
1
11
1
2
2
11
1 1
1
11
1
1
1
11
1
1
11
11
1
11
1
1
1
2
1
1
1
1
1
1
1
1
2
1111
1
1
11
1
1
1
1
1 1
1
1
1
11
1
1
11
1
1
1
11
1
11
1
1
2
2
11
2
1
1
1
1
11
1
1
11
1
11
2
1
2
1
1
1
1
11
1
1
1
2
1
11
11
1
11
1
1
1
11
2
1 1
1
1
2
2
1
1
1
2
1
1
11
1
2
1
1
1
1
1
2
1
11
1
2
11
1
1
11
11
1
1
1
1
11 1
1
1
1
2
1
1
1
2
1
2
1
11
1
1
1
11
1
1
1
1
3
1
1
1
1
1
1
1
1
11
2
11
1
1
1
1
11
111
2
1
1
1111
1
1
2
1 1
1
1
1
1
1
1 1
2
11
111
1
1
11
1
1
1
1
3
11
1
1
1
1
1
11
11
2
1
11
1
1
11 1
2
1
11 1
1
1
1
2
1
1
1
2
1
2
11
11
1
1
2
1
1
1
1
1
1
1
111
2
1
2
111
1
1
1
111 1
1
1
2
1
1
11
1
11 1
2
1
1
1
1
1 1
1
11
1
1
1
11
11
2
2
1
2
1
11
11
1
1
1
1
1
1
1
1
1
2
1
1
11
2
1
1 1
1
1 11
1
2
1
1 1
1
11 111 11
1
2
1
1
1
2
2
11
1
1
2
1
1 1
2
11
1
2
111
1
1
1
11
2
1
1
1
1
1
1 111
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
11
1
2
1
1
1
1
1
1
1
1
1
1
1
1 1
1
11 1
1
1
2
1 1
1
1
22
111
1 11
1
2
1
1
2
1
1
1
1
1
1
11
1
1
1
2
1 11
1
1
2
1
11
2
1
11
1 1
1
11
1
1
1
1
2
1
1
1
1
1
11
11
2
11
11
11
1
1
111
1
1
1
12
1
1
2
1
2
2
1
1
2
1
2
11
1
1
11
2
2
1
11
1111
1
1
1 1
1
1
2
1
1
1
1
1
11
1
1 1
2
111
1
11
1
2
2
2
11
3
1
2
1
1
2
11
2
111
1
1
1
1
2
1
1
11
11
1
1
1 1
1
11
1
2
1
1
11
2
1
11
1
2
2
1
1
1
1
1
22
1
1
2
1
1
1
1 11 1
1
11
1
1
1
22
1
2 2
11
1
1
1
2
1
11 1
11
SNP_A!8496017
A
B
Ingo Ruczinski Biostatistics and Computational Biology 27
jhu
De novo deletion
Ingo Ruczinski Biostatistics and Computational Biology 28
jhu
De novo deletion
80.9
1
MI!S
MI!D
UPI!M
UPI!P
BPI
HMM
child
co
py n
um
be
r
!2!1.5!1
!0.50
0.51
mother
co
py n
um
be
r
!2!1.5!1
!0.50
0.51
father
co
py n
um
be
r
!2!1.5!1
!0.50
0.51
15 20 25 30 35 40 45 49
Ingo Ruczinski Biostatistics and Computational Biology 29
jhu
The statistical environment R
R is an environment for data analysis and visualization.
R is both open source and open development.
You can look at the source code and propose changes.
R is not in the public domain.
You are given a license to run the software (currently GPL).
Ingo Ruczinski Biostatistics and Computational Biology 30
jhu
The R software
R is mainly written in C.
R is available for many platforms:
Unix of many flavors, including Linux, Solaris, FreeBSD.
Windows 95 and later.
MacOS X.
Binaries and source code are available from
www.r-project.org.
R “talks” to data bases, programming languages, and other
statistical packages.
R should be source code compatible with most of the
Splus code written.
Ingo Ruczinski Biostatistics and Computational Biology 31
jhu
CRAN
Ingo Ruczinski Biostatistics and Computational Biology 32
jhu
The R package system
Packages are self-contained units of code with
documentation.
The packages are simple to obtain and to understand, and
can easily be updated.
You can write your own packages!
All functions must have examples to run.
There are automatic testing features built in.
Ingo Ruczinski Biostatistics and Computational Biology 33
jhu
CRAN packages
Ingo Ruczinski Biostatistics and Computational Biology 34
jhu
Pros and cons
Advantages
Free
Available for all major platforms
Comprehensive
Powerful graphics
Well-designed programming language
Unlimited extensibility
Widely used by statisticians
Increasingly used for genomic analyses (Bioconductor)
Disadvantages
No dedicated support
Complex syntax
Not point-and-click
Some simple tasks are rather hard
Ingo Ruczinski Biostatistics and Computational Biology 35
jhu
Bioconductor
Ingo Ruczinski Biostatistics and Computational Biology 36
jhu
Bioconductor
Ingo Ruczinski Biostatistics and Computational Biology 37
jhu
JHSPH Biostatistics classes
3 140.615 Biostatistics for Laboratory Scientists I
MWF 10:30 – 11.20 (Ingo Ruczinski)
140.644 Practical Machine Learning
MW 1:30 – 2.50 (Rafael Irizarry)
4 140.616 Biostatistics for Laboratory Scientists II
MWF 10:30 – 11.20 (Ingo Ruczinski)
140.688 Statistics for Genomics
MW 10:30 – 11.50 (Jeff Leek)
Ingo Ruczinski Biostatistics and Computational Biology 38
jhu
Statistics and probability
What is statistics?
Data exploration and analysis.
Quantification of evidence and uncertainty.
Inductive inference with probability.
What is probability?
A branch of mathematics concerning the study of random
processes.
Ingo Ruczinski Biostatistics and Computational Biology 39
jhu
Diagnostics
!
+
!+
TEST
DISEASE
FN TN
TP FP
Ingo Ruczinski Biostatistics and Computational Biology 40
jhu
Diagnostics
!
+
!+
TEST
DISEASE
FN TN
TP FP
Sensitivity ! Pr ( positive test | disease )
Specificity ! Pr ( negative test | no disease )
Positive Predictive Value ! Pr ( disease | positive test )
Negative Predictive Value ! Pr ( no disease | negative test )
Accuracy ! Pr ( correct outcome )
Ingo Ruczinski Biostatistics and Computational Biology 41
jhu
Diagnostics
Assume that some disease has a 0.1% prevalence in the
population. Assume we have a test kit for that disease that
works with 99% sensitivity and 99% specificity. What is the
probability of a person having the disease, given the test result
is positive, if we randomly select a subject from the general
population?
Ingo Ruczinski Biostatistics and Computational Biology 42
jhu
Diagnostics
!
+
!+
TEST
DISEASE
1 98901
99 999
Ingo Ruczinski Biostatistics and Computational Biology 43
jhu
Diagnostics
!
+
!+
TEST
DISEASE
1 98901
99 999
Sensitivity ! 99 / (99+1) = 99%
Specificity ! 98901 / (999+98901) = 99%
Positive Predictive Value ! 99 / (99+999) " 9%
Negative Predictive Value ! 98901 / (1+98901) > 99.9%
Accuracy ! (99+98901) / 100000 = 99%
Ingo Ruczinski Biostatistics and Computational Biology 44
jhu
Calibration
Goal: Determine, by fluoresence, the concentration of quinine
in a sample of tonic water.
1 Obtain a stock solution with known concentration of quinine.
2 Create several dilutions of the stock.
3 Measure fluoresence intensity of each such standard.
4 Measure fluoresence intensity of the unknown.
5 Fit a line to the results for the standards.
6 Use line to estimate quinine concentration in the unknown.
Question: How precise is the resulting estimate?
Ingo Ruczinski Biostatistics and Computational Biology 45
jhu
Calibration
0 5 10 15 20 25 30 35
0
20
40
60
80
Quinine conc (ppb)
Flu
ore
scence inte
nsity
Ingo Ruczinski Biostatistics and Computational Biology 46
jhu
Summarizing data
0
2
4
6
8
10
12
14
A B0
2
4
6
8
10
12
14
A B
Ingo Ruczinski Biostatistics and Computational Biology 47
jhu
Summarizing data
0
2
4
6
8
10
12
14
A B0
2
4
6
8
10
12
14
A B
Ingo Ruczinski Biostatistics and Computational Biology 48
jhu
Summarizing data
0
2
4
6
8
10
12
14
A B0
2
4
6
8
10
12
14
A B
Ingo Ruczinski Biostatistics and Computational Biology 49
jhu
http: //biostat.jhsph.edu/#iruczins/
Ingo Ruczinski Biostatistics and Computational Biology 50