Different kind of distance and Statistical Distance

  • View
    23

  • Download
    1

  • Category

    Science

Preview:

Citation preview

WELCOME TO MY PRESENTATION

ON STATISTICAL DISTANCE

Md. Menhazul AbedinM.Sc. Student

Dept. of StatisticsRajshahi UniversityMob: 01751385142

Email: menhaz70@gmail.com

Objectives

• To know about the meaning of statistical distance and it’s relation and difference with general or Euclidean distance

Content Definition of Euclidean distance Concept & intuition of statistical distance Definition of Statistical distance Necessity of statistical distance Concept of Mahalanobis distance (population

&sample) Distribution of Mahalanobis distance Mahalanobis distance in RAcknowledgement

Euclidean Distance from origin

(0,0)

(X,Y)

X

Y

Euclidean Distance

P(X,Y) Y O (0,0) X By Pythagoras =

Euclidean Distance

Specific point

we see that two specific points in each picture

Our problem is to determine the length between two points .

But how ??????????

Assume that these pictures are placed in two dimensional spaces and points are joined by a straight line

Let 1st point is (,) and 2nd point is () then distance is

D= )

What will be happen when dimension is three

Distanse in

Distance is given by

• Points are (x1,x2,x3) and (y1,y2,y3)

For n dimension it can be written as the following expression and

named as Euclidian distance

2222

211

2121

)()()(),(

),,,(),,,,(

pp

pp

yxyxyxQPd

yyyQxxxP

05/01/2023 14

Properties of Euclidean Distance and Mathematical Distance

• Usual human concept of distance is Eucl. Dist.• Each coordinate contributes equally to the distance

2222

211

2121

)()()(),(

),,,(),,,,(

pp

pp

yxyxyxQPd

yyyQxxxP

14

Mathematicians, generalizing its three properties ,

1) d(P,Q)=d(Q,P).

2) d(P,Q)=0 if and only if P=Q and

3) d(P,Q)=<d(P,R)+d(R,Q) for all R, define distance

on any set.

P(X1,Y1) Q(X2,Y2)

R(Z1,Z2))

R(Z1,Z2)

Taxicab Distance :Notion Red: Manhattan distance.

Green: diagonal, straight-

line distance

Blue, yellow: equivalent Manhattan distances.

• The Manhattan distance is the simple sum of the horizontal and vertical components, whereas

the diagonal distance might be computed by applying the Pythagorean Theorem .

• Red: Manhattan distance.• Green: diagonal, straight-line distance.• Blue, yellow: equivalent Manhattan distances.

• Manhattan distance 12 unit

• Diagonal or straight-line distance or Euclidean distance is =6 We observe that Euclidean distance is less than Manhattan distance

Taxicab/Manhattan distance :Definition

(p1,p2))

(q1,q2)│𝑝1−𝑞2│

│p2-q2│

Manhattan Distance

• The taxicab distance between (p1,p2) and (q1,q2) is │p1-q1│+│p2-q2│

Relationship between Manhattan & Euclidean distance.

7 Block

6 Block

Relationship between Manhattan & Euclidean distance.

• It now seems that the distance from A to C is 7 blocks, while the distance from A to B is 6 blocks.

• Unless we choose to go off-road, B is now closer to A than C.

• Taxicab distance is sometimes equal to Euclidean distance, but otherwise it is greater than Euclidean distance.

Euclidean distance <Taxicab distanceIs it true always ???Or for n dimension ???

Proof……..

Absolute values guarantee non-negative value

Addition property of inequality

Continued………..

Continued………..

For high dimension

• It holds for high dimensional case • Σ │ Σ │ + 2Σ│Which implies Σ││

05/01/2023

Statistical Distance• Weight coordinates subject to a great deal of

variability less heavily than those that are not highly variable

Who is nearer to

data set if it were

point?

Same distance from

origin

• Here

variability in x1 axis > variability in x2 axis Is the same distance meaningful from

origin ??? Ans: noBut, how we take into account the different variability ????Ans : Give different weights on axes.

05/01/2023

Statistical Distance for Uncorrelated Data

22

22

11

212*

22*

1

222*2111

*1

21

),(

/,/

)0,0(),,(

sx

sxxxPOd

sxxsxx

OxxP

weight

Standardization

all point that have coordinates (x1,x2) and are a constant squared distance , from the origin must satisfy =But … how to choose c ????? It’s a problem Choose c as 95% observation fall in this area ….

= >

05/01/2023

Ellipse of Constant Statistical Distance for Uncorrelated Data

11sc 11sc

22sc

22sc

x1

x2

0

• This expression can be generalized as ……… statistical distance from an arbitrary point P=(x1,x2) to any fixed point Q=(y1,y2)

;lk;lk; For P dimension……………..

Remark : 1) The distance of P to the origin O is obtain by setting all 2) If all are equal Euclidean distance formula is appropriate

Scattered Plot for Correlated Measurements

• How do you measure the statistical distance of the above data set ??????

• Ans : Firstly make it uncorrelated .

• But why and how………???????

• Ans: Rotate the axis keeping origin fixed.

05/01/2023

Scattered Plot for Correlated Measurements

Rotation of axes keeping origin fixed

O M R X1

N Q

~𝑥1

P(x1,x2)x2

~𝑥2

𝜃

𝜃

x=OM =OR-MR =cos– sin…. (i) y=MP =QR+NP = sin cos……….(ii)

• The solution of the above equations

Choice of

What will you choice ? How will you do it ?

Data matrix → Centeralized data matrix → Covariance of data matrix → Eigen vector

Theta = angle between 1st eigen vector and [1,0] or angle between 2nd eigen vector and [0,1]

Why is that angle between 1st eigen vector and [0,1] or angle between 2nd eigen vector and [1,0] ?? Ans: Let B be a (p by p) positive definite matrix with eigenvalues λ1λ2λ3λp> and associated normalized eigenvectors .Then attained when x= attained when x=

attained when x=

Choice of #### Excercise 16.page(309).Heights in inches (x) & Weights in pounds(y). An Introduction to Statistics and Probability M.Nurul Islam ####### x=c(60,60,60,60,62,62,62,64,64,64,66,66,66,66,68,68,68,70,70,70);xy=c(115,120,130,125,130,140,120,135,130,145,135,170,140,155,150,160,175,180,160,175);y ############V=eigen(cov(cdata))$vectors;Vas.matrix(cdata)%*%Vplot(x,y)

data=data.frame(x,y);dataas.matrix(data)colMeans(data)xmv=c(rep(64.8,20));xmv ### x mean vector ymv=c(rep(144.5,20));ymv ### y mean vector meanmatrix=cbind(xmv,ymv);meanmatrixcdata=data-meanmatrix;cdata ### mean centred data plot(cdata) abline(h=0,v=0)

cor(cdata)

• ##################

cov(cdata)

eigen(cov( cdata))

xx1=c(1,0);xx1

xx2=c(0,1);xx2

vv1=eigen(cov(cdata))$vectors[,1];vv1

vv2=eigen(cov(cdata))$vectors[,2];vv2

################theta = acos( sum(xx1*vv1) / ( sqrt(sum(xx1 * xx1)) * sqrt(sum(vv1 * vv1)) ) );theta

theta = acos( sum(xx2*vv2) / ( sqrt(sum(xx2 * xx2)) * sqrt(sum(vv2 * vv2)) ) );theta

###############xx=cdata[,1]*cos( 1.41784)+cdata[,2]*sin( 1.41784);xxyy=-cdata[,1]*sin( 1.41784)+cdata[,2]*cos( 1.41784);yyplot(xx,yy)abline(h=0,v=0)

V=eigen(cov(cdata))$vectors;Vtdata=as.matrix(cdata)%*%V;tdata ### transformed datacov(tdata)round(cov(tdata),14)cor(tdata)plot(tdata)abline(h=0,v=0)round(cor(tdata),16)

• ################ comparison of both method ############

comparison=tdata - as.matrix(cbind(xx,yy));comparisonround(comparison,4)

########### using package. md from original data #####

md=mahalanobis(data,colMeans(data),cov(data),inverted =F);md ## md =mahalanobis distance

######## mahalanobis distance from transformed data ######## tmd=mahalanobis(tdata,colMeans(tdata),cov(tdata),inverted =F);tmd

###### comparison ############ md-tmd

Mahalanobis distance : Manually mu=colMeans(tdata);muincov=solve(cov(tdata));incovmd1=t(tdata[1,]-mu)%*%incov%*%(tdata[1,]-mu);md1md2=t(tdata[2,]-mu)%*%incov%*%(tdata[2,]-mu);md2md3=t(tdata[3,]-mu)%*%incov%*%(tdata[3,]-mu);md3............. ……………. ………….. md20=t(tdata[20,]-mu)%*%incov%*%(tdata[20,]-mu);md20md for package and manully are equal

tdatas1=sd(tdata[,1]);s1s2=sd(tdata[,2]);s2xstar=c(tdata[,1])/s1;xstarystar=c(tdata[,2])/s2;ystar

md1=sqrt((-1.46787309)^2 + (0.1484462)^2);md1md2=sqrt((-1.22516896 )^2 + ( 0.6020111 )^2);md2………. ………… ……………..Not equal to above distances……..Why ???????Take into account mean

05/01/2023

Statistical Distance under Rotated Coordinate System

22222112

2111

212

211

22

22

11

21

21

2),(

cossin~sincos~~~

~~

),(

)~,~(),0,0(

xaxxaxaPOd

xxxxxxsx

sxPOd

xxPO

are sample variances

• After some manipulation this can be written in terms of origin variables

Whereas

Proof…………• = =

= + 2 + = = - 2 +

Continued………….

=

Continued………….

05/01/2023

General Statistical Distance

)])((2))((2))((2

)(

)()([

),(

]222

[),(

),,,(),0,,0,0(),,,,(

11,1

331113221112

2

22222

21111

1,131132112

22222

2111

2121

pppppp

pppp

pppp

ppp

pp

yxyxayxyxayxyxa

yxa

yxayxa

QPd

xxaxxaxxa

xaxaxaPOd

yyyQOxxxP

• The above distances are completely determined by the coefficients(weights) These are can be arranged in rectangular array as

this array (matrix) must be symmetric positive definite.

Why Positive definite ???? Let A be a positive definite matrix .

A=C’C X’AX= X’C’CX = (CX)’(CX) = Y’Y It obeys all the distance property. X’AX is distance ,For different A it gives different distance .

• Why positive definite matrix ????????• Ans: Spectral decomposition : the spectral

decomposition of a kk symmetric matrix A is given by

• Where are pair of eigenvalues and eigenvectors.

And And if pd & invertible .

4.0 4.5 5.0 5.5 6.02

3

4

5

λ1λ2

𝑒1

𝑒2

• Suppose p=2. The distance from origin is

By spectral decomposition

X1

X2𝐶√ λ1

𝐶√ λ2

Another property is

Thus

We use this property in Mahalanobis distance

05/01/2023

Necessity of Statistical Distance

Center of gravity

Another point

• Consider the Euclidean distances from the point Q to the points P and the origin O.

• Obviously d(PQ) > d (QO )

But, P appears to be more like the points in the cluster than does the origin .

If we take into account the variability of the points in cluster and measure distance by statistical distance , then Q will be closer to P than O .

Mahalanobis distance

• The Mahalanobis distance is a descriptive statistic that provides a relative measure of a data point's distance from a common point. It is a unitless measure introduced by P. C. Mahalanobis in 1936

Intuition of Mahalanobis Distance • Recall the eqution

d(O,P)= => = Where x= , A=

Intuition of Mahalanobis Distance

d(O,P)= Where ; A=

Intuition of Mahalanobis Distance

where, A=

Mahalanobis Distance

• Mahalanobis used ,inverse of covariance matrix instead of A

• Thus ……………..(1)

• And used instead of y ………..(2)

Mah-alan-obis

dist-ance

Mahalanobis Distance

• The above equations are nothing but Mahalanobis Distance ……

• For example, suppose we took a single observation from a bivariate population with Variable X and Variable Y, and that our two variables had the following characteristics

• single observation, X = 410 and Y = 400 The Mahalanobis distance for that single value as:

• ghk

1.825

• Therefore, our single observation would have a distance of 1.825 standardized units from the mean (mean is at X = 500, Y = 500).

• If we took many such observations, graphed them and colored them according to their Mahalanobis values, we can see the elliptical Mahalanobis regions come out

• The points are actually distributed along two primary axes:

If we calculate Mahalanobis distances for each of these points and shade them according to their distance value, we see clear elliptical patterns emerge:

• We can also draw actual ellipses at regions of constant Mahalanobis values:

68% obs

95% obs

99.7% obs

• Which ellipse do you choose ??????Ans : Use the 68-95-99.7 rule .

1) about two-thirds (68%) of the points should be within 1 unit of the origin (along the axis). 2) about 95% should be within 2 units 3)about 99.7 should be within 3 units

If normal

Sample Mahalanobis Distancce • The sample Mahalanobis distance is made by

replacing by S and by • i.e (X- )’ (X- )

For sample

(X- )’ (X- )

Distribution of mahalanobis distance

Distribution of mahalanobis distance Let be in dependent observation from any population with meanand finite (nonsingular) covariance Σ . Then is approximately and is approximately for n-p large This is nothing but central limit theorem

Mahalanobis distance in R

• ########### Mahalanobis Distance ##########

• x=rnorm(100);x

• dm=matrix(x,nrow=20,ncol=5,byrow=F);dm ##dm = data matrix

• cm=colMeans(dm);cm ## cm= column means

• cov=cov(dm);cov ##cov = covariance matrix

• incov=solve(cov);incov ##incov= inverse of

covarianc matrix

Mahalanobis distance in R• ####### MAHALANOBIS DISTANCE : MANUALY ######

• @@@ Mahalanobis distance of first • observation@@@@@@• ob1=dm[1,];ob1 ## first observation • mv1=ob1-cm;mv1 ## deviatiopn of first observation from center of gravity • md1=t(mv1)%*%incov%*%mv1;md1 ## mahalanobis distance of first observation from center of gravity •

Mahalanobis distance in R• @@@@@@ Mahalanobis distance of second observation@@@@@

• ob2=dm[2,];ob2 ## second observation • mv2=ob2-cm;mv2 ## deviatiopn of second • observation from • center of gravity • md2=t(mv2)%*%incov%*%mv2;md2 ##mahalanobis distance of second observation from center of gravity ................ ……………… …..……………

Mahalanobis distance in R ………....... ……………… ……………

@@@@@ Mahalanobis distance of 20th observation@@@@@• Ob20=dm[,20];ob20 [## 20th observation • mv20=ob20-cm;mv20 ## deviatiopn of 20th observation from center of gravity • md20=t(mv20)%*%incov%*%mv20;md20 ## mahalanobis distance of 20thobservation from center of gravity

Mahalanobis distance in R

####### MAHALANOBIS DISTANCE : PACKAGE ########

• md=mahalanobis(dm,cm,cov,inverted =F);md ## md =mahalanobis distance• md=mahalanobis(dm,cm,cov);md

Another example

• x <- matrix(rnorm(100*3), ncol = 3)

• Sx <- cov(x)

• D2 <- mahalanobis(x, colMeans(x), Sx)

• plot(density(D2, bw = 0.5), main="Squared Mahalanobis distances, n=100, p=3") • qqplot(qchisq(ppoints(100), df = 3), D2, main = expression("Q-Q plot of Mahalanobis" * ~D^2 * " vs. quantiles of" * ~ chi[3]^2))

• abline(0, 1, col = 'gray')• ?? mahalanobis

Acknowledgement

Prof . Mohammad Nasser . Richard A. Johnson & Dean W. Wichern . & others

THANK YOU ALL

Necessity of Statistical Distance

In home Mother

In mess Female

maid

Student in mess

Recommended