Understanding Data in High Dimensions

7/27/2019 Understanding Data in High Dimensions

1/4

UNDERSTANDING DATA IN HIGH DIMENSIONS

Multivariate statistical analysis is concerned with analyzing and understanding

data in high dimensions. We suppose that we are given a set fxign i=1 of n

observations of a variable vector X in Rp. That is, we suppose that each

observation xi has p dimension

xi = (xi1; xi2; :::; xip);

and that it is an observed value of a variable vector X 2 Rp. Therefore, X is

composed of p

random variables:

X = (X1;X2; :::;Xp)

where Xj , for j = 1; : : : ; p, is a one-dimensional random variable. How do we

begin to

Analyze this kind of data? Before we investigate questions on what inferences we

can reachfrom the data, we should think about how to look at the data. This

involves descriptivetechniques. Questions that we could answer by descriptive

techniques are:

_ Are there components of X that are more spread out than others?

_ Are there some elements of X that indicate subgroups of the data?

_ Are there outliers in the components of X?

_ How \normal" is the distribution of the data?

_ Are there low-dimensional" linear combinations of X that show non-normal"

behavior?

One difficulty of descriptive methods for high dimensional data is the human

perceptionalsystem. Point clouds in two dimensions are easy to understand and

to interpret. Withmodern interactive computing techniques we have the


2/4

possibility to see real time 3D rotationsand thus to perceive also three-

dimensional data. A\sliding technique" as described inHardle and Scott (1992)

may give insight into four dimensional structures by presentingdynamic 3D

density contours as the fourth variable is changed over its range.

A qualitative jump in presentation difficulties occurs for dimensions greater than

or equal to5, unless the high-dimensional structure can be mapped into lower-

dimensional components(Klinke and Polzehl, 1995). Features like clustered

subgroups or outliers, however, can bedetected using a purely graphical analysis.

In this chapter, we investigate the basic descriptive and graphical techniques

allowing simpleexploratory data analysis. We begin the exploration of a data set

using boxplots. A boxplotis a simple univariate device that detects outliers

component by component and that cancompare distributions of the data among

different groups. Next several multivariate techniquesare introduced (Flury faces,

Andrews' curves and parallel coordinate plots) whichprovide graphical displays

addressing the questions formulated above. The advantages andthe

disadvantages of each of these techniques are stressed.

Two basic techniques for estimating densities are also presented: histograms and

kerneldensities. A density estimate gives a quick insight into the shape of the

distribution ofthe data. We show that kernel density estimates overcome some of

the drawbacks of thehistograms.

Finally, scatterplots are shown to be very useful for plotting bivariate or trivariate

variablesagainst each other: they help to understand the nature of the

relationship among variablesin a data set and allow detecting groups or clusters

of points. Draftman plots or matrix plotsare the visualization of several bivariate


3/4

scatterplots on the same display. They help detectstructures in conditional

dependences by brushing across the plots.

We have seen in the previous chapters how very simple graphical devices can

help in understanding

the structure and dependency of data. The graphical tools were based on

eitherunivariate (bivariate) data representations or on \slick" transformations of

multivariate informationperceivable by the human eye. Most of the tools are

extremely useful in a modeling step, but unfortunately, do not give the full picture

of the data set. One reason for this isthat the graphical tools presented capture

only certain dimensions of the data and do notnecessarily concentrate on those

dimensions or subparts of the data under analysis that carrythe maximum

structural information. In Part III of this book, powerful tools for reducingthe

dimension of a data set will be presented. In this chapter, as a starting point,

simple andbasic tools are used to describe dependency. They are constructed

from elementary facts ofprobability theory and introductory statistics (for

example, the covariance and correlationbetween two variables).

In the preceeding chapter we saw how the multivariate normal distribution

comes into playin many applications. It is useful to know more about this

distribution, since it is oftena good approximate distribution in many situations.


4/4

Another reason for considering themultinormal distribution relies on the fact that

it has many appealing properties: it is stableunder linear transforms, zero

correlation corresponds to independence, the marginals and allthe conditionals

are also multivariate normal variates, etc. The mathematical properties ofthe

multinormal make analyses much simpler.

In this chapter we will _rst concentrate on the probabilistic properties of the

multinormal,then we will introduce two \companion" distributions of the

multinormal which naturallyappear when sampling from a multivariate normal

population: the Wishart and the Hotellingdistributions.

Documents

Understanding Data in High Dimensions