Upload
neona
View
12
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Object Orie’d Data Analysis, Last Time. HDLSS Asymptotics In spirit of classical math’sl statistics But limit as , not usual Saw variation goes into random rotation Modulo rotation, have fixed structure Convergence to vertices of unit simplex Gave statistical insights - PowerPoint PPT Presentation
Citation preview
Object Orie’d Data Analysis, Last Time
HDLSS Asymptotics
• In spirit of classical math’sl statistics
• But limit as , not usual
• Saw variation goes into random rotation
• Modulo rotation, have fixed structure
• Convergence to vertices of unit simplex
• Gave statistical insights
(e.g. methods all come together for high d)
d n
HDLSS Asymptotics
Interesting Idea from Travis Gaydos:
Interpret from viewpoint of dual space
Recall from Aug. 25: for
• Distance to origin:
• Pairwise Distance:
• Angle from origin:
INZZ d ,0~,21
)1(2/1
2/1
1
2, p
d
ijij
OdZZ
)1(2 2/12/1
1
22,1,21 p
d
iji OdZZZZ
)(90cos, 2/1
1 21
2,1,121
dO
ZZ
ZZZZAngle p
d
i
ji
HDLSS Asymptotics – Dual View
Study these for simple Examples:
Over range
Look in dual space:
• Dimension (easy to visualize)
• Entries of appear as points:
• Relate HDLSS phenomena to these points
INZZ d ,0~,21
1000,300,100,39,10,3,1d
2n21
& ZZ
diZZ ii ,,1,, 21
HDLSS Asymptotics – Dual View
HDLSS Asymptotics – Dual View
HDLSS Asymptotics – Dual View
HDLSS Asymptotics – Dual View
HDLSS Asymptotics – Dual View
HDLSS Asymptotics – Dual View
HDLSS Asymptotics – Dual View
HDLSS Asymptotics – Dual View
Notes:
• Upper left: Dual view of data
• Upper right: Dual view of squares
• Lower: renormalized to see data addition:
• Lower right: study distance to origin
• Lower left: study pairwise distance
22,
21, , ii ZZ
22,
21, , ii ZZ
dZdZ ii2
2,2
1, ,
HDLSS Asymptotics – Dual View
Summary of insights:
• Distance to Origin
– Study via
– Expected to converge to 1
– Upper & Lower Right shows squares
– Average shown in green:
– Can see convergence to 1 (stable green lines)
)1(2/1
2/1
1
2, p
d
ijij
OdZZ
2/1
1
2, 1
1
dOZd p
d
iji
22,
21, , ii ZZ
d
ijiZd 1
2,
1
HDLSS Asymptotics – Dual View
Summary of insights:
• Pairwise Distance:
– Study via
– Expected to converge to
– Lower Left shows Sqrt(Average)
– Can see convergence (stable red line)
2
)1(2 2/12/1
1
22,1,21 p
d
iji OdZZZZ
)(21 2/12/1
2/1
1
22,1,
dOZZ
d p
d
iji
HDLSS Asymptotics – Dual View
Summary of insights:
• Angle from origin:
– Study via
– Expected to converge to 0
– Upper left text shows Convergence
)(90cos, 2/1
1 21
2,1,121
dO
ZZ
ZZZZAngle p
d
i
ji
)(0 2/1
1
2,1,
dOd
ZZp
d
i
ji
HDLSS Asymptotics – Dual View
Would be interesting to try:
• Study (i.e. explore conditions for):
– Consistency
– Strong Inconsistency
for PCA direction vectors, from this viewpoint
Perhaps other things as well…
NCI 60 DataRecall from: • Aug. 28• Aug. 30
NCI 60 Cancer Cell Lines Microarray Data
• Explored Data Combination
• cDNA & Affymetrix Measurements
• Right answer is known
Interesting Benchmark Data Set
• NCI 60 Cell Lines
– Interesting benchmark, since same cells
– Data Web available:– http://discover.nci.nih.gov/datasetsNature2000.js
p
– Both cDNA and Affymetrix Platforms
• Different from Breast Cancer Data
– Which had no common samples
NCI 60: Raw Data, Platform Colored
NCI 60: Raw Data
NCI 60: Raw Data, Before DWD Adjustment
NCI 60: Before & After DWD adjustment
NCI 60
Leave out many slides studied on
8/28/07
NCI 60: Fully Adjusted Data, Platform Colored
NCI 60: Fully Adjusted Data, Melanoma Cluster
BREAST.MDAMB435BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257
NCI 60: Fully Adjusted Data, Leukemia Cluster
LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR
Another DWD Appl’n: Visualization
• Recall PCA limitations• DWD uses class info• Hence can “better separate known
classes”• Do this for pairs of classes
(DWD just on those, ignore others)• Carefully choose pairs in NCI 60 data• Shows Effectiveness of Adjustment
NCI 60: Views using DWD Dir’ns (focus on
biology)
DWD Visualization of NCI 60 Data
• Most cancer types clearly distinct(Renal, CNS, Ovar, Leuk, Colon, Melan)
• Using these carefully chosen directions
• Others less clear cut– NSCLC (at least 3 subtypes)– Breast (4 published subtypes)
• DWD adjustment was very effective(very few black connectors visible)
DWD Views of NCI 60 DataInteresting Question:
Which clusters are really there?
Issues:
• DWD great at finding dir’ns of separation
• And will do so even if no real structure
• Is this happening here?
• Or: which clusters are important?
• What does “important” mean?
Real Clusters in NCI 60 Data
Simple Visual Approach:
• Randomly relabel data (Cancer Types)
• Recompute DWD dir’ns & visualization
• Get heuristic impression from this
Deeper Approach
• Formal Hypothesis Testing
(Done later)
Random Relabelling #1
Random Relabelling #2
Random Relabelling #3
Random Relabelling #4
Revisit Real Data
Revisit Real Data (Cont.)Heuristic Results:
Strong Clust’s Weak Clust’s Not Clust’s
Melanoma C N S NSCLC
Leukemia Ovarian Breast
Renal Colon
Later: will find way to quantify these ideas
i.e. develop statistical significance
Needed final verification of Cross-platform
Normal’n
• Is statistical power actually improved?
• Is there benefit to data combo by DWD?
• More data more power?
• Will study later now
Real Clusters in NCI 60 Data?
From Aug. 30: Simple Visual Approach: Randomly relabel data (Cancer Types) Recompute DWD dir’ns & visualization Get heuristic impression from this Some types appeared signif’ly different Others did not
Deeper Approach:
Formal Hypothesis Testing
HDLSS Hypothesis Testing
Approach: DiProPerm Test
DIrection – PROjection – PERMutation
Ideas:Find an appropriate Direction vectorProject data into that 1-d subspaceConstruct a 1-d test statisticAnalyze significance by Permutation
HDLSS Hypothesis Testing – DiProPerm test
DiProPerm Test
Context:
Given 2 sub-populations, X & Y
Are they from the same distribution?
Or significantly different?
H0: LX = LY vs. H1: LX ≠ LY
HDLSS Hypothesis Testing – DiProPerm test
Reasonable Direction vectors:
Mean Difference
SVM
Maximal Data Piling
DWD (used in the following)
Any good discrimination direction…
HDLSS Hypothesis Testing – DiProPerm test
Reasonable Projected 1-d statistics:
Two sample t-test (used here)
Chi-square test for different variances
Kolmogorov - Smirnov
Any good distributional test…
HDLSS Hypothesis Testing – DiProPerm test
DiProPerm Test Steps:1. For original data:
Find Direction vector Project Data, Compute True Test Statistic
2. For (many) random relabellings of data: Find Direction vector Project Data, Compute Perm’d Test Stat
3. Compare: True Stat among population of Perm’d
Stat’s Quantile gives p-value
HDLSS Hypothesis Testing – DiProPerm test
Remarks: Generally can’t use standard null dist’ns… e.g. Student’s t-table, for t-statistic Because Direction and Projection give
nonstandard context I.e. violate traditional assumptions E.g. DWD finds separating directions Giving completely invalid test This motivates Permutation approach
DiProPerm Simple Example 1, Totally Separate
Clearly Distinct Populations in This ExampleIgnore this “Extreme Labelling” for nowWill become important later…
DiProPerm Simple Example 1, Totally Separate
DiProPerm Simple Example 1, Totally Separate
DiProPerm Simple Example 1, Totally Separate
DiProPerm Simple Example 1, Totally Separate
DiProPerm Simple Example 1, Totally Separate
.
.
.
Repeat this 1,000 times
To get:
DiProPerm Simple Example 1, Totally Separate
DiProPerm Simple Example 1, Totally Separate
Results:
Random relabelling gives much smaller Ts
Quantiles (over 1000 sim’s) give p-val of 0
I.e. Strongly conclusive
Conclude sub-populations are different
DiProPerm Simple Example 2, No Difference
Indistinct subpopulations in this ExampleAgain ignore “Extreme Labelling”
(will study later)Explore DiProPerm Test
DiProPerm Simple Example 2, No Difference
DiProPerm Simple Example 2, No Difference
DiProPerm Simple Example 2, No Difference
DiProPerm Simple Example 2, No Difference
DiProPerm Simple Example 2, No Difference
DiProPerm Simple Example 2, No Difference
Results:
Random relabelling gives similar T-stats
So ordinary T-stat is not significant
No strong evidence that:
sub-populations are different
Needed final verification of Cross-platform
Normal’n
• Is statistical power actually improved?
• Is there benefit to data combo by DWD?
• More data more power?
• Will study later now
Needed final verification of Cross-platform
Normal’nFor each cancer type, study each of:• Combined Data (data top left)• cDNA Only Data (data top center)• Affy Only Data (data bottom center)Test results:• Combined Data (DiProPerm top right)• Affy only (DiProPerm bottom right)• No cDNA (since clearly worse T-stats)
Improved Statistical Power - NCI 60 Melanoma
Improved Statistical Power - NCI 60 Leukemia
Improved Statistical Power - NCI 60 NSCLC
Improved Statistical Power - NCI 60 Renal
Improved Statistical Power - NCI 60 CNS
Improved Statistical Power - NCI 60 Ovarian
Improved Statistical Power - NCI 60 Colon
Improved Statistical Power - NCI 60 Breast
Improved Statistical Power - Summary
Type cDNA -t Affy -t Comb -t
Affy-P Comb-P
Melanoma
36.8 39.9 51.8 e-7 0
Leukemia
18.3 23.8 27.5 0.12 0.00001
NSCLC 17.3 25.1 23.5 0.18 0.02
Renal 15.6 20.1 22.0 0.54 0.04
CNS 13.4 18.6 18.9 0.62 0.21
Ovarian 11.2 20.8 17.0 0.21 0.27
Colon 10.3 17.4 16.3 0.74 0.58
Breast 13.8 19.6 19.3 0.51 0.16
Needed final verification of Cross-platform
Normal’nSummary of Results:• T-statistics
– cDNA T-stats always substantially smaller– So never so powerful as Affy– Conclude Affy gives stronger results– But Combined is not comparable– Because based on different sample sizes– Need p-value to put on same scale
Needed final verification of Cross-platform
Normal’nSummary of Results:• P-values
– Combined Results Better for 7 out of 8 cases– Combined signficant, Affy not, in 3 cases
(Lukemia, NSCLC, Renal)– Shows combining platforms often worthwhile
(because more data gives more power)
Comparison with previous heuristics…
Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):
Strong Clust’s Weak Clust’s Not Clust’s
Melanoma C N S NSCLC
Leukemia Ovarian Breast
Renal Colon
Statistically Sign’t (as expected)
Not Sign’t (as expected)
Surprising result (not consistent with vis’n)
Needed final verification of Cross-platform
Normal’nNext time:• Put sample sizes after each in
previous slide (and maybe in table of results) to show how that impacts results– Note this gives good explanation, as
shown in next class meeting…
• Explore with more views, why NSCLC is significant, but Breast is not…