Statistical Analysis of cDNA Microarray Data: Challenges and Solutions

Preview:

DESCRIPTION

Statistical Analysis of cDNA Microarray Data: Challenges and Solutions. Toni Reverter CSIRO – Livestock Industries. AAHL Seminar - 12 Dec. 2002. Logical. cDNA. Distribution. Quantitative Computer Sci. Statisticians Mathematicians ……. Non-Q Biochemists Physiologists Pathologists ……. - PowerPoint PPT Presentation

Citation preview

Statistical Analysis ofcDNA Microarray Data:Challenges and Solutions

Toni Reverter

CSIRO – Livestock Industries

AAHL Seminar - 12 Dec. 2002

Challenges

Time Dependent Data Dependent Human Dependent

Chronology Paradigm Skill Integration

Distribution

Source Size

Logical1800s – DATA

30-60s – METHODS

50-70s – SOFTWARE

1980s – COMPUTER

cDNA

QuantitativeComputer Sci.StatisticiansMathematicians …….

Non-QBiochemistsPhysiologistsPathologists …….

Historical Excitement Balance Interdisciplinary

AAHL Seminar - 12 Dec. 2002

EGG BANANA

“banana omelette”

Human Dependent

Challenges

Historical

•Traditionally: Statistics grew alongside Agriculture

“Introduction to Statistical Analysis”

•Nowadays: Statistics alongside (Bio)Technology

•Law of Large Numbers•Central Limit Theorem•Pythagoras Theorem

SST = SSM + SSE

d ab

Hysterical

AAHL Seminar - 12 Dec. 2002

Human Dependent

Challenges

Excitement (source of)

Eg. Always log spot intensities and ratios

T Speed. “Hints and Prejudices” •Biochemist: My software does it, therefore it’s great!•Statistician: Well, I need further evidence to be convinced

0)ln(

01

)(

x

xx

n

jj

n

jj xxx

nl

11

2)()( )ln()1()(1

ln2

1)(

Eg. Keren Byrne’s Data

AAHL Seminar - 12 Dec. 2002

Human Dependent

Challenges

Balance

•Too many Statisticians:

Evidence: It takes 1 ship, 10 days to cross the oceanQuestion: How many days does it take for 10 ships to cross the ocean?

Evidence: It takes 1 builder, 10 days to build a wallQuestion: How many days does it take for 10 builders to build a wall?

AAHL Seminar - 12 Dec. 2002

Human Dependent

Challenges

Balance

•Too many Statisticians:

PHD SCHOLARSHIPStatistical Science Program

MATHEMATICAL SCIENCES INSTITUTETHE AUSTRALIAN NATIONAL UNIVERSITY

Stipend $22,771 (2002 rate, indexed annually, tax free)

A PhD Scholarship (APAI) is being offered by the Mathematical Sciences Institute at The ANU. An ARC Linkage Grant held by Professors Peter Hall (ANU) and Don Poskitt (Monash University), in conjunction with BAE Systems, Melbourne, will fund the scholarship.

The research problem is in the area of stochastic control applied to ship motion, and involves the development and implementation of both parametric and nonparametric methods. The successful applicant will have a strong interest in statistical methodology, computational techniques, theoretical analysis, and the development of statistical research problems.

AAHL Seminar - 12 Dec. 2002

Human Dependent

Challenges

Balance

•Too many Biochemists:

Treated?No Yes

No

Yes

100

120

150

120

Die

d?

Survival Rates:

Treated = 150/270 = 55.55%

Non-Tr = 100/220 = 45.45%

Women?No

Yes

60

100

30

60

No Yes

Survival Rates:

Treated = 30/90 = 33.33%

Non-Tr = 60/160 = 37.50%12.5% Decrease!

Men?No

Yes

40

20

120

60

No Yes

Survival Rates:

Treated = 120/180 = 66.66%

Non-Tr = 40/60 = 66.66%No Difference!

AAHL Seminar - 12 Dec. 2002

22%Increase!

Human Dependent

Challenges

Balance

•Too many Biochemists:

*****

***

**

** *

**

*

*****

***

**

** *

**

*

r = 0.87

r = 0.00

r = 0.00

x

y

AAHL Seminar - 12 Dec. 2002

Human Dependent

Challenges

Interdisciplinary Skills

Minimal knowledge of the application discipline is needed

…..failing that, the Statisticians will win, ..…but with the wrong weapons.

1. Amount of Expression = Amount of Response2. Same cut-off point to judge all genes3. Over-emphasis in normalization (Thus, reject “Boutique Arrays”)4. Over-emphasis in variance stabilization

AAHL Seminar - 12 Dec. 2002

Human Dependent

Challenges

Interdisciplinary Skills

Ex.2: Ralf Moser’s Data

*****

***

**

** *

**

*

**

**

** *

*

*

*

*

% Lung Disease

Wt Gain, Kg

Ex.1: What’s a Steer?

Minimal knowledge of theapplication discipline is needed:“Animal Breeding & Genetics”

Options:1. % Gain vs. % Disease2. Medians instead of Means3. Regression coefficients*

AAHL Seminar - 12 Dec. 2002

Solutions

Disease

Wt Gain, KgO

B

A

AB

O: Control (Untreated)A: Treatment AB: Treatment BAB: Both Treatments

Model: O = A = + B = + AB = + + +

)()(. ABAAB

AABA GLogRLog

G

RLogM

estimates

The ratio:

A - AB = -( + )

AAHL Seminar - 12 Dec. 2002

Solutions

Error

M

M

M

M

M

M

M

M

M

M

M

M

BAB

ABB

AAB

ABA

AB

BA

OAB

ABO

OB

BO

OA

AO

101

101

110

110

011

011

111

111

010

010

001

001

.

.

.

.

.

.

.

.

.

.

.

.

EXM

MXXX TT 1)(ˆ

O

B

A

AB

AAHL Seminar - 12 Dec. 2002

Solutions

O

B

A

AB

O

B

A

AB

O

B

A

AB

Reference Loop All-Pairs

Variance of Estimated Effects(Relative to the All-Pairs)

Reference

1132

Loop

4/31

8/31

All-Pairs

1121

Main effect of AMain effect of BInteraction ABContrast A-B

AAHL Seminar - 12 Dec. 2002

Solutions

Probability of both Female?

Case 1. No Information …………………………1/4

Case 2. The one on the left is female …………1/2

Case 3. One of them is female ………….………1/3

AAHL Seminar - 12 Dec. 2002

Solutions

EXM

EWwGgSsXM MXXX TT ̂)(

3 Equations

MW

MG

MS

MX

w

g

s

WWGWSWXW

WGGGSGXG

WSGSSSXS

WXGXSXXX

'

'

'

'

''''

''''

''''

''''

> 35,000 Equations !

AAHL Seminar - 12 Dec. 2002

Solutions

Clever Programming Tailored to your needs

N=1

for filename in R16T0S1.gpr R16T0S2.gpr R16T24S1.gpr R16T24S2.gpr S32T0S1.gpr S32T0S2.gpr S32T24S1.gpr S32T24S2.gprdo

# Get valid readings, compute log ratios

awk 'NR>30 && $NF>=0 && $4!="no_spot" && \ substr($4,1,5)!="score" && substr($4,1,5)!="custo" && \ substr($4,1,6)!="spotre" && $9>$12 && $18>$21 \ {print $4, $9-$12, $18-$21, \ log($9-$12)/log(2.0), log($18-$21)/log(2.0)}' \ $filename | sort > junk1

awk '$2!=$3 {print $0, $4-$5, 0.5*($4+$5)}' junk1 > junk2

# get the median of log ratios

REC=`wc -l junk2 | awk '{print int($1/2)}'`MED=`sort -n +5 junk2 | awk -v rec=$REC 'NR==rec {print $6}'`echo "Median of file" $filename " = " $MED

# Global normalization: substract the median to each log ratio

awk -v median=$MED -v slide=$N \ '{print "Slide_"slide, int(slide/2+.5), $1, $6-median}' junk2 | \ sort +2 > dat.$N

N=`expr $N + 1`

done

cat dat.1 dat.2 dat.3 dat.4 dat.5 dat.6 dat.7 dat.8 > total.dat

AAHL Seminar - 12 Dec. 2002

Solutions

Clever Programming Tailored to your needs

T24 - T0

-4

-2

0

2

4

-4 -2 0 2 4

Resistant

Dise

ase

Interaction Solutions

Your Needs: “Important values are…”1. Away from (0,0)2. In quadrants 1 and 4.

Generate a new variable:

+1.0*[(R24-R0)+(S0-S24)] if R0<R24 & S0>S24

+0.5*[(R24-R0)+(S24-S0)] if R0<R24 & S0<S24

-0.5*[(R0-R24)+(S0-S24)] if R0>R24 & S0>S24

-1.0*[(R0-R24)+(S24-S0)] if R0>R24 & S0<S24

…then apply model-based clustering.

AAHL Seminar - 12 Dec. 2002

Solutions

Clever Programming Tailored to your needs

AAHL Seminar - 12 Dec. 2002

-4

-2

0

2

4

6

-4 -2 0 2 4 6

Su

sce

ptib

le

Resistant

Differential Expression T24-T0

Solutions

Clever Programming Tailored to your needs

Get to know/use all the available options

1. t-Statistics: StandardPenalised

2. Clustering: Location-Based (k-Means, …)Model-Based (Mixtures of Distributions)

3. ANOVA (Linear Models)

High

Medium

Low

Keren’s

Ralf’s

AAHL Seminar - 12 Dec. 2002

Conclusions

Statistical Analysis of cDNA Microarray Data:

GENERAL:1. Still in its infancy (…possibly even embryonic stage)2. Many decisions have a heuristic rather than a theoretical foundation3. No hope for a “One size fits all” software4. Safer to aim towards “Tailor to one’s needs”5. Integration of interdisciplinary skills is a must

LIVESTOCK SPECIES:1. Tailing humans (…at the moment)2. Strong background knowledge of genetics accumulated3. Journals will soon be inundated4. CLI has the opportunity to participate

AAHL Seminar - 12 Dec. 2002

Recommended