Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics

• In spirit of classical math’sl statistics

• But limit as , not usual

• Saw variation goes into random rotation

• Modulo rotation, have fixed structure

• Convergence to vertices of unit simplex

• Gave statistical insights

(e.g. methods all come together for high d)

d n

HDLSS Asymptotics

Interesting Idea from Travis Gaydos:

Interpret from viewpoint of dual space

Recall from Aug. 25: for

• Distance to origin:

• Pairwise Distance:

• Angle from origin:

INZZ d ,0~,21

)1(2/1

2/1

1

2, p

d

ijij

OdZZ

)1(2 2/12/1

1

22,1,21 p

d

iji OdZZZZ

)(90cos, 2/1

1 21

2,1,121

dO

ZZ

ZZZZAngle p

d

i

ji

http://www.stat-or.unc.edu/webspace/courses/marron/UNCstor891OODA-2007/10-25-07.ppt

HDLSS Asymptotics – Dual View

Study these for simple Examples:

Over range

Look in dual space:

• Dimension (easy to visualize)

• Entries of appear as points:

• Relate HDLSS phenomena to these points

INZZ d ,0~,21

1000,300,100,39,10,3,1d

2n21

& ZZ

diZZ ii ,,1,, 21









Notes:

• Upper left: Dual view of data

• Upper right: Dual view of squares

• Lower: renormalized to see data addition:

• Lower right: study distance to origin

• Lower left: study pairwise distance

22,

21, , ii ZZ

22,

21, , ii ZZ

dZdZ ii2

2,2

1, ,


Summary of insights:

• Distance to Origin

– Study via

– Expected to converge to 1

– Upper & Lower Right shows squares

– Average shown in green:

– Can see convergence to 1 (stable green lines)

)1(2/1

2/1

1

2, p

d

ijij

OdZZ

2/1

1

2, 1

1

dOZd p

d

iji

22,

21, , ii ZZ

d

ijiZd 1

2,

1



• Pairwise Distance:

– Study via

– Expected to converge to

– Lower Left shows Sqrt(Average)

– Can see convergence (stable red line)

2

)1(2 2/12/1

1

22,1,21 p

d

iji OdZZZZ

)(21 2/12/1

2/1

1

22,1,

dOZZ

d p

d

iji



• Angle from origin:

– Study via

– Expected to converge to 0

– Upper left text shows Convergence

)(90cos, 2/1

1 21

2,1,121

dO

ZZ

ZZZZAngle p

d

i

ji

)(0 2/1

1

2,1,

dOd

ZZp

d

i

ji


Would be interesting to try:

• Study (i.e. explore conditions for):

– Consistency

– Strong Inconsistency

for PCA direction vectors, from this viewpoint

Perhaps other things as well…

NCI 60 DataRecall from: • Aug. 28• Aug. 30

NCI 60 Cancer Cell Lines Microarray Data

• Explored Data Combination

• cDNA & Affymetrix Measurements

• Right answer is known



Interesting Benchmark Data Set

• NCI 60 Cell Lines

– Interesting benchmark, since same cells

– Data Web available:– http://discover.nci.nih.gov/datasetsNature2000.js

p

– Both cDNA and Affymetrix Platforms

• Different from Breast Cancer Data

– Which had no common samples

http://discover.nci.nih.gov/datasetsNature2000.jsp

http://discover.nci.nih.gov/datasetsNature2000.jsp

NCI 60: Raw Data, Platform Colored

NCI 60: Raw Data

NCI 60: Raw Data, Before DWD Adjustment

NCI 60: Before & After DWD adjustment

NCI 60

Leave out many slides studied on

8/28/07

NCI 60: Fully Adjusted Data, Platform Colored

NCI 60: Fully Adjusted Data, Melanoma Cluster

BREAST.MDAMB435BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257

NCI 60: Fully Adjusted Data, Leukemia Cluster

LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR

Another DWD Appl’n: Visualization

• Recall PCA limitations• DWD uses class info• Hence can “better separate known

classes”• Do this for pairs of classes

(DWD just on those, ignore others)• Carefully choose pairs in NCI 60 data• Shows Effectiveness of Adjustment

NCI 60: Views using DWD Dir’ns (focus on

biology)

DWD Visualization of NCI 60 Data

• Most cancer types clearly distinct(Renal, CNS, Ovar, Leuk, Colon, Melan)

• Using these carefully chosen directions

• Others less clear cut– NSCLC (at least 3 subtypes)– Breast (4 published subtypes)

• DWD adjustment was very effective(very few black connectors visible)

DWD Views of NCI 60 DataInteresting Question:

Which clusters are really there?

Issues:

• DWD great at finding dir’ns of separation

• And will do so even if no real structure

• Is this happening here?

• Or: which clusters are important?

• What does “important” mean?

Real Clusters in NCI 60 Data

Simple Visual Approach:

• Randomly relabel data (Cancer Types)

• Recompute DWD dir’ns & visualization

• Get heuristic impression from this

Deeper Approach

• Formal Hypothesis Testing

(Done later)

Random Relabelling #1




Revisit Real Data

Revisit Real Data (Cont.)Heuristic Results:

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma C N S NSCLC

Leukemia Ovarian Breast

Renal Colon

Later: will find way to quantify these ideas

i.e. develop statistical significance

Needed final verification of Cross-platform

Normal’n

• Is statistical power actually improved?

• Is there benefit to data combo by DWD?

• More data more power?

• Will study later now

Real Clusters in NCI 60 Data?

From Aug. 30: Simple Visual Approach: Randomly relabel data (Cancer Types) Recompute DWD dir’ns & visualization Get heuristic impression from this Some types appeared signif’ly different Others did not

Deeper Approach:

Formal Hypothesis Testing


HDLSS Hypothesis Testing

Approach: DiProPerm Test

DIrection – PROjection – PERMutation

Ideas:Find an appropriate Direction vectorProject data into that 1-d subspaceConstruct a 1-d test statisticAnalyze significance by Permutation

HDLSS Hypothesis Testing – DiProPerm test

DiProPerm Test

Context:

Given 2 sub-populations, X & Y

Are they from the same distribution?

Or significantly different?

H0: LX = LY vs. H1: LX ≠ LY


Reasonable Direction vectors:

Mean Difference

SVM

Maximal Data Piling

DWD (used in the following)

Any good discrimination direction…


Reasonable Projected 1-d statistics:

Two sample t-test (used here)

Chi-square test for different variances

Kolmogorov - Smirnov

Any good distributional test…


DiProPerm Test Steps:1. For original data:

Find Direction vector Project Data, Compute True Test Statistic

2. For (many) random relabellings of data: Find Direction vector Project Data, Compute Perm’d Test Stat

3. Compare: True Stat among population of Perm’d

Stat’s Quantile gives p-value


Remarks: Generally can’t use standard null dist’ns… e.g. Student’s t-table, for t-statistic Because Direction and Projection give

nonstandard context I.e. violate traditional assumptions E.g. DWD finds separating directions Giving completely invalid test This motivates Permutation approach

DiProPerm Simple Example 1, Totally Separate

Clearly Distinct Populations in This ExampleIgnore this “Extreme Labelling” for nowWill become important later…






.

.

.

Repeat this 1,000 times

To get:



Results:

Random relabelling gives much smaller Ts

Quantiles (over 1000 sim’s) give p-val of 0

I.e. Strongly conclusive

Conclude sub-populations are different

DiProPerm Simple Example 2, No Difference

Indistinct subpopulations in this ExampleAgain ignore “Extreme Labelling”

(will study later)Explore DiProPerm Test







Results:

Random relabelling gives similar T-stats

So ordinary T-stat is not significant

No strong evidence that:

sub-populations are different


Normal’n

• Is statistical power actually improved?

• Is there benefit to data combo by DWD?

• More data more power?

• Will study later now


Normal’nFor each cancer type, study each of:• Combined Data (data top left)• cDNA Only Data (data top center)• Affy Only Data (data bottom center)Test results:• Combined Data (DiProPerm top right)• Affy only (DiProPerm bottom right)• No cDNA (since clearly worse T-stats)

Improved Statistical Power - NCI 60 Melanoma

Improved Statistical Power - NCI 60 Leukemia

Improved Statistical Power - NCI 60 NSCLC

Improved Statistical Power - NCI 60 Renal

Improved Statistical Power - NCI 60 CNS

Improved Statistical Power - NCI 60 Ovarian

Improved Statistical Power - NCI 60 Colon

Improved Statistical Power - NCI 60 Breast

Improved Statistical Power - Summary

Type cDNA -t Affy -t Comb -t

Affy-P Comb-P

Melanoma

36.8 39.9 51.8 e-7 0

Leukemia

18.3 23.8 27.5 0.12 0.00001

NSCLC 17.3 25.1 23.5 0.18 0.02

Renal 15.6 20.1 22.0 0.54 0.04

CNS 13.4 18.6 18.9 0.62 0.21

Ovarian 11.2 20.8 17.0 0.21 0.27

Colon 10.3 17.4 16.3 0.74 0.58

Breast 13.8 19.6 19.3 0.51 0.16


Normal’nSummary of Results:• T-statistics

– cDNA T-stats always substantially smaller– So never so powerful as Affy– Conclude Affy gives stronger results– But Combined is not comparable– Because based on different sample sizes– Need p-value to put on same scale


Normal’nSummary of Results:• P-values

– Combined Results Better for 7 out of 8 cases– Combined signficant, Affy not, in 3 cases

(Lukemia, NSCLC, Renal)– Shows combining platforms often worthwhile

(because more data gives more power)

Comparison with previous heuristics…

Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma C N S NSCLC

Leukemia Ovarian Breast

Renal Colon

Statistically Sign’t (as expected)

Not Sign’t (as expected)

Surprising result (not consistent with vis’n)


Normal’nNext time:• Put sample sizes after each in

previous slide (and maybe in table of results) to show how that impacts results– Note this gives good explanation, as

shown in next class meeting…

• Explore with more views, why NSCLC is significant, but Breast is not…

Documents

Object Orie’d Data Analysis, Last Time