Upload
michael-lambert
View
213
Download
0
Embed Size (px)
Citation preview
University at Buffalo The State University of New York
Visualization and Microarray
• Complement to numerical analysis
• Offers insightful information
• Detects the structure of dataset
• Early / late stage of data mining• Challenges of Microarray Visualization
– High dimensionality– Large data size– Intuitive layout– Low time complexity
University at Buffalo The State University of New York
An Example – Early Stage
University at Buffalo The State University of New York
General Approaches• Global Visualizations
– Encode each dimension uniformly by the same visual cue
Parallel coordinates
University at Buffalo The State University of New York
General Approaches, con’t• Optimal Visualizations
– Estimate the parameters and assess the fit of various spatial distance models for proximity data
– Multidimensional scaling (MDS)• Sammon’s mapping: topology preservation. Two samples that
are close to each other have to stay close when projected.
University at Buffalo The State University of New York
Sammon’s mapping
• Sammon’s mapping is a classical case of MDS• MDS optimizes 2-D presentation to preserve
distances in original N-dimensional space
• Sammon’s mapping iteratively minimizes
i ij ij
i ijij d
ddd
ijijE *
2
*
)( *1
dij* is the distance between points i and j in the N-dimensional spacedij* is the distance between points I and j in the visualization.
University at Buffalo The State University of New York
2D to 1D
University at Buffalo The State University of New York
A method for achieving this projection 1. D1, D2 and D3 (the interpoint distances in the higher
dimensional space) are calculated. 2. P1', P2' and P3' are generated randomly in the lower
dimensional space. 3. The mapping error, E, is calculated for all the
interpoint distances in the lower dimensional space.4. The gradient showing the direction which minimizes
the error is calculated. 5. The points in the lower dimensional space are moved
according to the direction given by the gradient. 6. Steps 3 to 5 are repeated until E is below a given
limit.
University at Buffalo The State University of New York
Sammon’s mapping, con’t
• Some drawbacks – Computationally intensive, time complexity O(n2) – How to determine the best initialization– No user interaction is permitted– Addition of new data points requires rerun the process to get
new minimized projection– Information loss
University at Buffalo The State University of New York
General Approaches, con’t• Projective Visualizations
– Use projection functions to achieve a low dimensional display
– Radial Visualizations• RadViz• Star Coordinates• VizStruct
University at Buffalo The State University of New York
Comparison of ApproachesAdvantages Disadvantages
Global visualization Display all dimensional information, no computation
Severe overlapping, large space to display
Optimal visualization
Achieve optimal result, sound theoretical basis
Lack user interaction, heavy computation
Projection visualization
Concise display, little computation
Lack regorous proof, may not be optimal
University at Buffalo The State University of New York
Challenges of Microarray Visualization
• High dimensionality• Large data size• Intuitive layout• Low time complexity
University at Buffalo The State University of New York
Density or Heat Plots
Ge
nes
0
1
Sample
Increased
Before IFN After IFN
• Widely used with arrays
• Works well only for structured data
• Quantitative information is lost
• Gets easily cluttered
University at Buffalo The State University of New York
TreeView Visualization
University at Buffalo The State University of New York
Principal component analysisPCA: • linear projection of data onto major principal components defined by the eigenvectors of the covariance matrix.• PCA is also used for reducing the dimensionality of the data.• Criterion to be minimised: square of the distance between the original and projected data. This is fulfilled by the Karhuven-Loeve transformation
Px Px
1( )( )
1
ti i
i
x xn
C
P is composed by eigenvectors of the covariance matrix
Example: Leukemia data sets by Golub et al.: Classification of ALL and AML
University at Buffalo The State University of New York
Sammon`s mapping:• Non-linear multi-dimensional scaling such as Sammon's mapping aim to optimally conserve the distances in an higher dimensional space in the 2/3-dimensional space.• Mathematically: Minimalisation of error function E by steepest descent method:
Multi-linear scaling
Example: DLBCL prognosis – cured vs featal cases
2( )1
Nij ij
Ni j ijiji j
D dE
DD
University at Buffalo The State University of New York
Our Visualization Approach
Gene Space
Sample Space
Fourier Harmonic Projection
University at Buffalo The State University of New York
Geometric Interpretation
N-dimensional space Two-dimensional space
University at Buffalo The State University of New York
An Example of the Mapping
P=[a,a,…a] -> ?
University at Buffalo The State University of New York
First Fourier Harmonic Projection
N-dimensional space Two-dimensional space
University at Buffalo The State University of New York
Analytical Properties
University at Buffalo The State University of New York
Scaling and Transpose Property
Original
Shift
Scaling
Transpose
University at Buffalo The State University of New York
Time Shifting Property
University at Buffalo The State University of New York
Visual Exploration Framework
• Explorative Visualization – Sample space
• Confirmative Visualization – Gene space
University at Buffalo The State University of New York
VizStruct Architecture
WebBrowser WebBrowser
Internet
Client
ClientClient
Web Server
MatlabWeb Server
MatlabLibraries
Intranet
MatlabApplications
University at Buffalo The State University of New York
VizStruct User Interface
University at Buffalo The State University of New York
VizStruct User Interface (3)
Cartesian Plot Polar plot
University at Buffalo The State University of New York
VizStruct User Interface (2)
EM Mixture Density contour
University at Buffalo The State University of New York
Sample Classification
University at Buffalo The State University of New York
Binary Classification
Leukemia-A
72 samples with 7129 genes 38(27+11)Training,34(20+14) Testing, hold out evaluation
Multiple Sclerosis
44 samples, 4132 genes MS_IFN(28), MS_CON(30), cross validation evaluation
Binary classification: two sample classes
Evaluation: hold out and cross validation
University at Buffalo The State University of New York
Multiple Classification
Breast Cancer
22 samples with 3226 genes 3 Classes: BRCA1 (7), BRCA2 (8), Sporadic (7) cross validation evaluation
88 samples with 2308 genes 4 classes: RMS, BL, NB, EWS, 63 Training and 25 Testing
SRBCT
University at Buffalo The State University of New York
Classification Summary
University at Buffalo The State University of New York
Temporal Pattern (1)
10-OH NortryptylineNortryptyline
University at Buffalo The State University of New York
Temporal Pattern (2)
• Rat Kidney data set of Stuart et al. (2001) contains 873 genes of 7 time points during kidney development
• There are 5 patterns or gene groups classified by the author
• Parallel coordinate shows the actual data comply to the profiles but with some noise
Parallel coordinates for each of the gene groups
Idealized temporal gene expression profiles
University at Buffalo The State University of New York
Temporal Pattern (3)
Genes having very high relative levels of expression in early development
Genes having arelatively steady increase in expression throughout development
The first Fourier harmonic projection
Genes are somewhat symmetric to the middle time point, i.e., they are transposing each other
Genes are very similar except the last time point
University at Buffalo The State University of New York
VizStruct vs. Sammon’s Mapping
-0.2
-0.1
0
0.1
0.2
0.1 0.12 0.14 0.16 0.18 0.2
12
34
5 67 89 10
1112 13
14
151617
18 1920
2122
23
2425 2627 2829
30 3132
3334
3536 37
3839 4041
42
43 44 4546
4748
4950
51525354 5556
5758
5960
6162
6364
65
6667 68
69
707172
7374
75 76
777879
8081 82 83
84
8586
8788
8990
91 929394
959697 98
99
100
101102
103104105
106
107
108109
110
111
112 113114115 116117
118
119
120121
122
123
124125126
127128
129 130131
132133
134
135 136
137 138
139140141
142143
144145 146
147148149 150
Ima
gin
ary
Pa
rt o
f F
1(x[n
])
Real Part of F1(x[n])
VizStruct
-4
-2
0
2
4
-2 0 2
123
45
67
89 1011
1213
1415
161718
1920
2122
23
242526 27
28293031 32
33 34
3536 37
3839 40
41
42
43
44 4546
4748 4950
5152
53
54
5556
57
58
59
6061
6263
64
65
6667
68
69
70
71
72
73
74
75 76
7778
79
80
8182 83
84
8586
8788
8990
91 92
93
94
959697 98
99
100
101
102
103
104105
106
107
108
109110
111112
113114
115116
117
118
119
120
121
122
123
124
125
126
127128
129
130
131132
133
134
135
136
137138
139
140141
142143
144145
146147 148 149150
Ima
gin
ary
Pa
rt o
f F
1(x
[n])
Real Part of F1(x[n])
Sammon's Mapping
• VizStruct is similar to Sammon’s mapping
University at Buffalo The State University of New York
VizStruct - Dimension Tour
Interactively adjust dimension parameters
Manually or automatically
May cause false clusters to break
Create dynamic visualization
University at Buffalo The State University of New York
Visualized Results for a Time Series Data Set
University at Buffalo The State University of New York
Interrelated Dimensional Clustering
The approach is applied on classifying multiple-sclerosis patients and IFN-drug treated patients.
– (A) Shows the original 28 samples' distribution. Each point represents a sample, which is a mapping from the sample's 4132 genes intensity vectors.
– (B) Shows 28 samples' distribution on 2015 genes.– (C) Shows 28 samples' distribution on 312 genes. – (D) Shows the same 28 samples distribution after using our approach. We reduce
4132 genes to 96 genes.
University at Buffalo The State University of New York
References• Li Zhang, Aidong Zhang, and Murali Ramanathan VizStruct: Exploratory
Visualization for Gene Expression Profiling. Bioinformatics 2004 20: 85-92, 2004.• Li Zhang, Chun Tang, Yuqing Song, and Aidong Zhang, Murali Ramanathan.
VizCluster and Its Application on Clustering Gene Expression Data. International Journal of Distributed and Parallel Database, 13(1): 73-97, 2003
• Li Zhang, Aidong Zhang, and Murali Ramanathan: Enhanced Visualization of Time Series through Higher Fourier Harmonics. In proceeding of BIOKDD 2003, Washington DC, August 2003, pp 49-56.
• Li Zhang, Aidong Zhang, and Murali Ramanathan: Fourier Harmonic Approach for Visualizing Temporal Patterns of Gene Expression Data. In proceeding of IEEE Computer Society Bioinformatics Conference (CSB 2003). Stanford, CA, August 2003, pp131-141.
• Li Zhang, Aidong Zhang, and Murali Ramanathan. Visualized Classification of Multiple Sample Types. In proceeding of BIOKDD 2002, Edmonton, Alberta, Canada, July 2002, pp 55-62.
• Li Zhang, Chun Tang, Yong Shi, Yuqing Song, and Aidong Zhang, Murali Ramanathan. VizCluster: An Interactive Visualization Approach to Cluster Analysis and Its Application on Microarray Data. In proceeding of the Second SIAM International Conference on Data Mining (SDM02). Arlinton, VA. April 2002, pp 29-51.