33
Transforms and other Transforms and other prestidigitations—or prestidigitations—or new twists in new twists in imputation. imputation. Albert R. Stage Albert R. Stage

Transforms and other prestidigitations—or new twists in imputation

  • Upload
    adelio

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Transforms and other prestidigitations—or new twists in imputation. Albert R. Stage. Imputation:. To use what we know about “everywhere” that may be useful, but not very interesting- the X’s, To fill in detail that is prohibitive to obtain, except on a sample- the Y’s, - PowerPoint PPT Presentation

Citation preview

Page 1: Transforms and other prestidigitations—or new twists in imputation

Transforms and other Transforms and other prestidigitations—or new prestidigitations—or new

twists in imputation. twists in imputation.

Albert R. StageAlbert R. Stage

Page 2: Transforms and other prestidigitations—or new twists in imputation

Imputation:Imputation:

• To use what we know about “everywhere” that may be useful, but not very interesting- the X’s,

• To fill in detail that is prohibitive to obtain, except on a sample- the Y’s,

• By finding surrogates based on similarity of the X’s.

Page 3: Transforms and other prestidigitations—or new twists in imputation

TopicsTopics

• Measures of similarity (a few in particular)

• Alternative MSN distance function leading to some improved estimates

• Transformations that improve resolution– On the X-side (known everywhere)– On the Y-side (known for sample only)

Page 4: Transforms and other prestidigitations—or new twists in imputation

Distance measures for interval and Distance measures for interval and ratio scale variables (Podani 2000)ratio scale variables (Podani 2000)

• Euclidean/Mahalanobis • Chord • Angular• Geodesic• Manhattan• Canberra• Clark• Bray-Curtis• Marczewski-Steinhaus• 1-Kulczynski

• Pinkham-Pearson• Gleason• Ellenberg• Pandeya• Chi-square • 1-Correlation• 1-similarity ratio• Kendall difference• Faith intermediate • Uppsala coefficient

Page 5: Transforms and other prestidigitations—or new twists in imputation

Distance measures for binary Distance measures for binary variables Podani (2000)variables Podani (2000)

Symmetric for 0/1• Simple matching• Euclidean• Rogers-Tanimoto• Sokal-Sneath• Anderberg I• Anderberg II• Correlation• Yule I• Yule II• Hamann

Asymmetric for 0/1• Baroni-Urbani-Buser I• Baroni-Urbani-Buser II• Russell-Rao• Faith I• Faith II

• Ignore 0• Jaccard• Sorenson• Chord• Kulczynski• Sokal-Sneath II• Mountford

Page 6: Transforms and other prestidigitations—or new twists in imputation

Distance function in matrix notationDistance function in matrix notation

D2iu = mini [ (Xi-Xu) W (Xi-Xu)’ ]

– Where, for• Euclidean distance: W = I (Identity matrix)• Mahalanobis distance: W = Inverse

covariance matrix)• MSN (1995): W = ’ with:

= matrix of coefficients of canonical variatesdiagonal matrix of canonical correlations

Page 7: Transforms and other prestidigitations—or new twists in imputation

Why Why Weight with Canonical Analysis?Weight with Canonical Analysis?

• Not degraded by non-informative X’s

Effect of adding 2 random X's(Number of original X's = 21)

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

RM

SE

re

lati

ve

to

Ma

ha

lan

ob

is

Mahalanobis

MSN

Page 8: Transforms and other prestidigitations—or new twists in imputation

Why Why Weight with Canonical Analysis?Weight with Canonical Analysis?

• Not affected by non-informative Y’s if number of canonical pairs is determined by test of significance on rank.

Effect of adding 2 random Y's(Number of original Y's =15)

-0.2

-0.1

0

0.1

0.2

0.3

RM

SE

re

lati

ve

to

Ma

ha

lan

ob

is

MSN0

MSN0 with 2 randomY's

MSN1

MSN1 with 2 random Y's

Page 9: Transforms and other prestidigitations—or new twists in imputation

Comparison of MSN Distance Comparison of MSN Distance FunctionsFunctions

• Moeur and Stage 1995– Assumes Y’s are “true”

– Searches for closest linear combination of Y’s

– Set of near neighbors sensitive to lower order canonical correlatrions

• Stage 2003 – Assumes Y’s include

measurement error– Searches for closest

linear combination of predicted Y’s

– Set of near neighbors less sensitive to random elements “swept” into lower order canonical corr.

Page 10: Transforms and other prestidigitations—or new twists in imputation

New regression alternative:New regression alternative:

d ij 2 = (Xi - Xj) [ (I- 2 )]-1 ’ (Xi - Xj )’

is the diagonal matrix of canonical

correlations for k =

W 1 2/

/

/

1 11 0 0 0

0 0 0

0 0 1 0

0

0 0 0 0 0

k k

ii

k

ii

s

1 1

/ P R O P V A R

Page 11: Transforms and other prestidigitations—or new twists in imputation

Effect of change:Effect of change:

• No change if only first canonical pair is used.

• Regression alternative gives more relative weight to higher correlated pairs.

• Effects on Root-Mean-Square Error of imputation are mixed: e.g. the following three data-sets---

Page 12: Transforms and other prestidigitations—or new twists in imputation

Statistics for three data setsStatistics for three data sets

Utah Tally Lake User’s Guide

Canonical pairs (s) 9 8 7

Number of Y’s 15 8 17

Number of X’s (p) 12 20 7

Number of obs. (n) 1076 847 197

n/(p*s+s) 13.3 5.04 3.52

Page 13: Transforms and other prestidigitations—or new twists in imputation

Canonical pair

Utah Tally LakeUser’s

Guide

2Rel.Wgt.

New/old2

Rel.Wgt.

New/old2

Rel.Wgt.

New/old

1 0.465 1.00 0.626 1.00 0.691 1.00

2 0.159 0.64 0.348 0.57 0.454 0.57

3 0.125 0.61 0.327 0.56 0.247 0.41

4 0.042 0.56 0.227 0.49 0.219 0.40

Total 0.863 1.861 1.823

Change in Relative Weights Depends on 2

Page 14: Transforms and other prestidigitations—or new twists in imputation

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

Pro

po

rtio

nal C

han

ge in

M

sq

r (N

ew

/Old

)

Utah FIA Data

Prop. Var = 0.99 Prop. Var = 0.90

Page 15: Transforms and other prestidigitations—or new twists in imputation

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

Pro

po

rtio

na

l C

ha

ng

e

in M

sq

r (N

ew

/Old

)Tally Lake, Montana

Prop. Var = 0.99 Prop. Var = 0.90

Page 16: Transforms and other prestidigitations—or new twists in imputation

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Pro

po

rtio

nal

Ch

an

ge i

n

Msq

r (N

ew

/Old

)

MSN User's Guide Example

Prop. Var = 0.99 Prop. Var = 0.90

Page 17: Transforms and other prestidigitations—or new twists in imputation

Transforming X-variablesTransforming X-variables

• To predict discrete classes of modal species composition (MSC) with Euclidean or Mahalanobis distance.

• To predict continuous variables of species composition

Page 18: Transforms and other prestidigitations—or new twists in imputation

Variable 1

Variable 2

Ref. A

Ref. B

Euclidean vs. Cosine (Spectral angle )Euclidean vs. Cosine (Spectral angle )

Euclidean

Spectral angle

Target Obs.

Page 19: Transforms and other prestidigitations—or new twists in imputation

Euclidean distance function with cosine transformation

co s(a )

x x

x x

ij

ik jkk 1

p

ik2

jk2

k 1

p

k 1

p

Z X / X ' Xi i i iLet:

d 2 (Z Z )' I (Z Z ) 2(1 cos(a))ij i j i j

d 2 (Z Z )' I (Z Z ) 2(1 cos(a))ij i j i j

Page 20: Transforms and other prestidigitations—or new twists in imputation

EEffect of using cosine transformation of ffect of using cosine transformation of TM data on classification accuracy*TM data on classification accuracy*

Attribute Untransformed Cosine trans.

Plant Assoc. Grp. (Oregon) **

(Mahalanobis)0.340 0.363

Modal Spp. Comp.(Oregon)** (Mahalanobis)

0.276 0.335

Modal Spp. Comp. (Minn.)***

(Euclidean)0.320 .328

* Kappa statistics **TM data ***TM+ Enhanced data

Page 21: Transforms and other prestidigitations—or new twists in imputation

Transforming the Y-variablesTransforming the Y-variables

• Variance considerations—want homogeneity• And a logical functional form for Y = f(X)

– Transformations of species composition• Logarithm of species basal area• Percent basal area by species• Cosine spectral angle• Logistic

– Evaluated by predicting discrete Plant Association Group (PAG), Users’ Guide example data

(Oregon)

Page 22: Transforms and other prestidigitations—or new twists in imputation

Proportion of Species A, Species B

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120

Elevation

Spp Anorth

Spp Asouth

Spp Bnorth

Spp Bsouth

Page 23: Transforms and other prestidigitations—or new twists in imputation

Composition transformations:Composition transformations:

• Logistic:= ln[(Total BA – spp BA)/spp BA]

= ln( Total BA – spp BA) – ln(spp BA)

Represented in MSN by two separate variables.

• Cosine Spectral Angle:

= Spp BA / (spp BA)2

Page 24: Transforms and other prestidigitations—or new twists in imputation

Predicting Plant Assoc. Grp. - Users' Guide data (Std Error Kappa = 0.06)

0

0.1

0.2

0.3

0.4

0.5

Mahal ln BA BA% Cos trans Logistic

Transformations of species volumes

Kap

pa

stat

isti

c

Page 25: Transforms and other prestidigitations—or new twists in imputation

Species volumes transformed to cosine of spectral angle - Tally Lake, Montana

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

CuF

t

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p vo

l) r

elat

ive

to M

ahal

anob

is

Cosinetransform ofspp vol

Page 26: Transforms and other prestidigitations—or new twists in imputation

Augmented by two "instrumental" variables

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is Cosine

transform ofspp vol

adding tot voland crown covto spectraltransform ofspp vol

Page 27: Transforms and other prestidigitations—or new twists in imputation

Gaussian (logarithmic) vs. Logistic

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

ln of sppvolumes

logistictransform ofspp vol (twoterm)

Page 28: Transforms and other prestidigitations—or new twists in imputation

Comparing transformations

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

Cosinetransform ofspp vol

Logistictransform ofspp vol (twoterm)

Page 29: Transforms and other prestidigitations—or new twists in imputation

Comparing transformations

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

cosinetransform ofspp vol

adding tot voland crown covto spectraltransform ofspp volln of sppvolumes

logistictransform ofspp vol (twoterm)

Page 30: Transforms and other prestidigitations—or new twists in imputation

Implications of transformingImplications of transforming

• Imputed value derived from the neighbor, not directly from the model as in regression.

• Neighbor selection may be improved by transforming Y’s and X’s .

• Multivariate Y’s can resolve some indeterminacies from functions having extreme-value points (maxima or minima).

Page 31: Transforms and other prestidigitations—or new twists in imputation

MSN Software Now Includes MSN Software Now Includes Alternative Distance Functions:Alternative Distance Functions:

• Both canonical-correlation based distance functions.

• Euclidean distance on normalized X’s.

• Mahalanobis distance on normalized X’s.

• You supply a weight matrix of your derivation.

• K-nearest neighbors identification.

Page 32: Transforms and other prestidigitations—or new twists in imputation

So ??So ??

• Of the many methods available for imputation of attributes, no one alternative is clearly superior for all data sets.

Page 33: Transforms and other prestidigitations—or new twists in imputation

• E-mail: [email protected]

• On the Web:

• In print:Crookston, N.L., Moeur, M. and Renner, D.L. 2002.

User’s guide to the Most Similar Neighbor Imputation Program Version 2. Gen. Tech. Rpt. RMRS-GTR-96. Ogden, UT: USDA Rocky Mountain Research Station 35p.

Software AvailabilitySoftware Availability

http://forest.moscowfsl.wsu.edu/gems/msn.html.