74
Object Orie’d Data Analysis, Last Time HDLSS Asymptotics • In spirit of classical math’sl statistics • But limit as , not usual • Saw variation goes into random rotation • Modulo rotation, have fixed structure • Convergence to vertices of unit simplex • Gave statistical insights (e.g. methods all come together for high d) d n

Object Orie’d Data Analysis, Last Time

  • Upload
    neona

  • View
    12

  • Download
    1

Embed Size (px)

DESCRIPTION

Object Orie’d Data Analysis, Last Time. HDLSS Asymptotics In spirit of classical math’sl statistics But limit as , not usual Saw variation goes into random rotation Modulo rotation, have fixed structure Convergence to vertices of unit simplex Gave statistical insights - PowerPoint PPT Presentation

Citation preview

Page 1: Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics

• In spirit of classical math’sl statistics

• But limit as , not usual

• Saw variation goes into random rotation

• Modulo rotation, have fixed structure

• Convergence to vertices of unit simplex

• Gave statistical insights

(e.g. methods all come together for high d)

d n

Page 2: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics

Interesting Idea from Travis Gaydos:

Interpret from viewpoint of dual space

Recall from Aug. 25: for

• Distance to origin:

• Pairwise Distance:

• Angle from origin:

INZZ d ,0~,21

)1(2/1

2/1

1

2, p

d

ijij

OdZZ

)1(2 2/12/1

1

22,1,21 p

d

iji OdZZZZ

)(90cos, 2/1

1 21

2,1,121

dO

ZZ

ZZZZAngle p

d

i

ji

Page 3: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Study these for simple Examples:

Over range

Look in dual space:

• Dimension (easy to visualize)

• Entries of appear as points:

• Relate HDLSS phenomena to these points

INZZ d ,0~,21

1000,300,100,39,10,3,1d

2n21

& ZZ

diZZ ii ,,1,, 21

Page 4: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Page 5: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Page 6: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Page 7: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Page 8: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Page 9: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Page 10: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Page 11: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Notes:

• Upper left: Dual view of data

• Upper right: Dual view of squares

• Lower: renormalized to see data addition:

• Lower right: study distance to origin

• Lower left: study pairwise distance

22,

21, , ii ZZ

22,

21, , ii ZZ

dZdZ ii2

2,2

1, ,

Page 12: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Summary of insights:

• Distance to Origin

– Study via

– Expected to converge to 1

– Upper & Lower Right shows squares

– Average shown in green:

– Can see convergence to 1 (stable green lines)

)1(2/1

2/1

1

2, p

d

ijij

OdZZ

2/1

1

2, 1

1

dOZd p

d

iji

22,

21, , ii ZZ

d

ijiZd 1

2,

1

Page 13: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Summary of insights:

• Pairwise Distance:

– Study via

– Expected to converge to

– Lower Left shows Sqrt(Average)

– Can see convergence (stable red line)

2

)1(2 2/12/1

1

22,1,21 p

d

iji OdZZZZ

)(21 2/12/1

2/1

1

22,1,

dOZZ

d p

d

iji

Page 14: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Summary of insights:

• Angle from origin:

– Study via

– Expected to converge to 0

– Upper left text shows Convergence

)(90cos, 2/1

1 21

2,1,121

dO

ZZ

ZZZZAngle p

d

i

ji

)(0 2/1

1

2,1,

dOd

ZZp

d

i

ji

Page 15: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Would be interesting to try:

• Study (i.e. explore conditions for):

– Consistency

– Strong Inconsistency

for PCA direction vectors, from this viewpoint

Perhaps other things as well…

Page 16: Object Orie’d Data Analysis, Last Time

NCI 60 DataRecall from: • Aug. 28• Aug. 30

NCI 60 Cancer Cell Lines Microarray Data

• Explored Data Combination

• cDNA & Affymetrix Measurements

• Right answer is known

Page 17: Object Orie’d Data Analysis, Last Time

Interesting Benchmark Data Set

• NCI 60 Cell Lines

– Interesting benchmark, since same cells

– Data Web available:– http://discover.nci.nih.gov/datasetsNature2000.js

p

– Both cDNA and Affymetrix Platforms

• Different from Breast Cancer Data

– Which had no common samples

Page 18: Object Orie’d Data Analysis, Last Time

NCI 60: Raw Data, Platform Colored

Page 19: Object Orie’d Data Analysis, Last Time

NCI 60: Raw Data

Page 20: Object Orie’d Data Analysis, Last Time

NCI 60: Raw Data, Before DWD Adjustment

Page 21: Object Orie’d Data Analysis, Last Time

NCI 60: Before & After DWD adjustment

Page 22: Object Orie’d Data Analysis, Last Time

NCI 60

Leave out many slides studied on

8/28/07

Page 23: Object Orie’d Data Analysis, Last Time

NCI 60: Fully Adjusted Data, Platform Colored

Page 24: Object Orie’d Data Analysis, Last Time

NCI 60: Fully Adjusted Data, Melanoma Cluster

BREAST.MDAMB435BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257

Page 25: Object Orie’d Data Analysis, Last Time

NCI 60: Fully Adjusted Data, Leukemia Cluster

LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR

Page 26: Object Orie’d Data Analysis, Last Time

Another DWD Appl’n: Visualization

• Recall PCA limitations• DWD uses class info• Hence can “better separate known

classes”• Do this for pairs of classes

(DWD just on those, ignore others)• Carefully choose pairs in NCI 60 data• Shows Effectiveness of Adjustment

Page 27: Object Orie’d Data Analysis, Last Time

NCI 60: Views using DWD Dir’ns (focus on

biology)

Page 28: Object Orie’d Data Analysis, Last Time

DWD Visualization of NCI 60 Data

• Most cancer types clearly distinct(Renal, CNS, Ovar, Leuk, Colon, Melan)

• Using these carefully chosen directions

• Others less clear cut– NSCLC (at least 3 subtypes)– Breast (4 published subtypes)

• DWD adjustment was very effective(very few black connectors visible)

Page 29: Object Orie’d Data Analysis, Last Time

DWD Views of NCI 60 DataInteresting Question:

Which clusters are really there?

Issues:

• DWD great at finding dir’ns of separation

• And will do so even if no real structure

• Is this happening here?

• Or: which clusters are important?

• What does “important” mean?

Page 30: Object Orie’d Data Analysis, Last Time

Real Clusters in NCI 60 Data

Simple Visual Approach:

• Randomly relabel data (Cancer Types)

• Recompute DWD dir’ns & visualization

• Get heuristic impression from this

Deeper Approach

• Formal Hypothesis Testing

(Done later)

Page 31: Object Orie’d Data Analysis, Last Time

Random Relabelling #1

Page 32: Object Orie’d Data Analysis, Last Time

Random Relabelling #2

Page 33: Object Orie’d Data Analysis, Last Time

Random Relabelling #3

Page 34: Object Orie’d Data Analysis, Last Time

Random Relabelling #4

Page 35: Object Orie’d Data Analysis, Last Time

Revisit Real Data

Page 36: Object Orie’d Data Analysis, Last Time

Revisit Real Data (Cont.)Heuristic Results:

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma C N S NSCLC

Leukemia Ovarian Breast

Renal Colon

Later: will find way to quantify these ideas

i.e. develop statistical significance

Page 37: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’n

• Is statistical power actually improved?

• Is there benefit to data combo by DWD?

• More data more power?

• Will study later now

Page 38: Object Orie’d Data Analysis, Last Time

Real Clusters in NCI 60 Data?

From Aug. 30: Simple Visual Approach: Randomly relabel data (Cancer Types) Recompute DWD dir’ns & visualization Get heuristic impression from this Some types appeared signif’ly different Others did not

Deeper Approach:

Formal Hypothesis Testing

Page 39: Object Orie’d Data Analysis, Last Time

HDLSS Hypothesis Testing

Approach: DiProPerm Test

DIrection – PROjection – PERMutation

Ideas:Find an appropriate Direction vectorProject data into that 1-d subspaceConstruct a 1-d test statisticAnalyze significance by Permutation

Page 40: Object Orie’d Data Analysis, Last Time

HDLSS Hypothesis Testing – DiProPerm test

DiProPerm Test

Context:

Given 2 sub-populations, X & Y

Are they from the same distribution?

Or significantly different?

H0: LX = LY vs. H1: LX ≠ LY

Page 41: Object Orie’d Data Analysis, Last Time

HDLSS Hypothesis Testing – DiProPerm test

Reasonable Direction vectors:

Mean Difference

SVM

Maximal Data Piling

DWD (used in the following)

Any good discrimination direction…

Page 42: Object Orie’d Data Analysis, Last Time

HDLSS Hypothesis Testing – DiProPerm test

Reasonable Projected 1-d statistics:

Two sample t-test (used here)

Chi-square test for different variances

Kolmogorov - Smirnov

Any good distributional test…

Page 43: Object Orie’d Data Analysis, Last Time

HDLSS Hypothesis Testing – DiProPerm test

DiProPerm Test Steps:1. For original data:

Find Direction vector Project Data, Compute True Test Statistic

2. For (many) random relabellings of data: Find Direction vector Project Data, Compute Perm’d Test Stat

3. Compare: True Stat among population of Perm’d

Stat’s Quantile gives p-value

Page 44: Object Orie’d Data Analysis, Last Time

HDLSS Hypothesis Testing – DiProPerm test

Remarks: Generally can’t use standard null dist’ns… e.g. Student’s t-table, for t-statistic Because Direction and Projection give

nonstandard context I.e. violate traditional assumptions E.g. DWD finds separating directions Giving completely invalid test This motivates Permutation approach

Page 45: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

Clearly Distinct Populations in This ExampleIgnore this “Extreme Labelling” for nowWill become important later…

Page 46: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

Page 47: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

Page 48: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

Page 49: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

Page 50: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

.

.

.

Repeat this 1,000 times

To get:

Page 51: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

Page 52: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

Results:

Random relabelling gives much smaller Ts

Quantiles (over 1000 sim’s) give p-val of 0

I.e. Strongly conclusive

Conclude sub-populations are different

Page 53: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 2, No Difference

Indistinct subpopulations in this ExampleAgain ignore “Extreme Labelling”

(will study later)Explore DiProPerm Test

Page 54: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 2, No Difference

Page 55: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 2, No Difference

Page 56: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 2, No Difference

Page 57: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 2, No Difference

Page 58: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 2, No Difference

Page 59: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 2, No Difference

Results:

Random relabelling gives similar T-stats

So ordinary T-stat is not significant

No strong evidence that:

sub-populations are different

Page 60: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’n

• Is statistical power actually improved?

• Is there benefit to data combo by DWD?

• More data more power?

• Will study later now

Page 61: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’nFor each cancer type, study each of:• Combined Data (data top left)• cDNA Only Data (data top center)• Affy Only Data (data bottom center)Test results:• Combined Data (DiProPerm top right)• Affy only (DiProPerm bottom right)• No cDNA (since clearly worse T-stats)

Page 62: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - NCI 60 Melanoma

Page 63: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - NCI 60 Leukemia

Page 64: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - NCI 60 NSCLC

Page 65: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - NCI 60 Renal

Page 66: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - NCI 60 CNS

Page 67: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - NCI 60 Ovarian

Page 68: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - NCI 60 Colon

Page 69: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - NCI 60 Breast

Page 70: Object Orie’d Data Analysis, Last Time

Improved Statistical Power - Summary

Type cDNA -t Affy -t Comb -t

Affy-P Comb-P

Melanoma

36.8 39.9 51.8 e-7 0

Leukemia

18.3 23.8 27.5 0.12 0.00001

NSCLC 17.3 25.1 23.5 0.18 0.02

Renal 15.6 20.1 22.0 0.54 0.04

CNS 13.4 18.6 18.9 0.62 0.21

Ovarian 11.2 20.8 17.0 0.21 0.27

Colon 10.3 17.4 16.3 0.74 0.58

Breast 13.8 19.6 19.3 0.51 0.16

Page 71: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’nSummary of Results:• T-statistics

– cDNA T-stats always substantially smaller– So never so powerful as Affy– Conclude Affy gives stronger results– But Combined is not comparable– Because based on different sample sizes– Need p-value to put on same scale

Page 72: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’nSummary of Results:• P-values

– Combined Results Better for 7 out of 8 cases– Combined signficant, Affy not, in 3 cases

(Lukemia, NSCLC, Renal)– Shows combining platforms often worthwhile

(because more data gives more power)

Comparison with previous heuristics…

Page 73: Object Orie’d Data Analysis, Last Time

Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma C N S NSCLC

Leukemia Ovarian Breast

Renal Colon

Statistically Sign’t (as expected)

Not Sign’t (as expected)

Surprising result (not consistent with vis’n)

Page 74: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’nNext time:• Put sample sizes after each in

previous slide (and maybe in table of results) to show how that impacts results– Note this gives good explanation, as

shown in next class meeting…

• Explore with more views, why NSCLC is significant, but Breast is not…