Upload
yaronbaba
View
216
Download
0
Embed Size (px)
Citation preview
7/27/2019 Understanding Data in High Dimensions
1/4
UNDERSTANDING DATA IN HIGH DIMENSIONS
Multivariate statistical analysis is concerned with analyzing and understanding
data in high dimensions. We suppose that we are given a set fxign i=1 of n
observations of a variable vector X in Rp. That is, we suppose that each
observation xi has p dimension
xi = (xi1; xi2; :::; xip);
and that it is an observed value of a variable vector X 2 Rp. Therefore, X is
composed of p
random variables:
X = (X1;X2; :::;Xp)
where Xj , for j = 1; : : : ; p, is a one-dimensional random variable. How do we
begin to
Analyze this kind of data? Before we investigate questions on what inferences we
can reachfrom the data, we should think about how to look at the data. This
involves descriptivetechniques. Questions that we could answer by descriptive
techniques are:
_ Are there components of X that are more spread out than others?
_ Are there some elements of X that indicate subgroups of the data?
_ Are there outliers in the components of X?
_ How \normal" is the distribution of the data?
_ Are there low-dimensional" linear combinations of X that show non-normal"
behavior?
One difficulty of descriptive methods for high dimensional data is the human
perceptionalsystem. Point clouds in two dimensions are easy to understand and
to interpret. Withmodern interactive computing techniques we have the
7/27/2019 Understanding Data in High Dimensions
2/4
possibility to see real time 3D rotationsand thus to perceive also three-
dimensional data. A\sliding technique" as described inHardle and Scott (1992)
may give insight into four dimensional structures by presentingdynamic 3D
density contours as the fourth variable is changed over its range.
A qualitative jump in presentation difficulties occurs for dimensions greater than
or equal to5, unless the high-dimensional structure can be mapped into lower-
dimensional components(Klinke and Polzehl, 1995). Features like clustered
subgroups or outliers, however, can bedetected using a purely graphical analysis.
In this chapter, we investigate the basic descriptive and graphical techniques
allowing simpleexploratory data analysis. We begin the exploration of a data set
using boxplots. A boxplotis a simple univariate device that detects outliers
component by component and that cancompare distributions of the data among
different groups. Next several multivariate techniquesare introduced (Flury faces,
Andrews' curves and parallel coordinate plots) whichprovide graphical displays
addressing the questions formulated above. The advantages andthe
disadvantages of each of these techniques are stressed.
Two basic techniques for estimating densities are also presented: histograms and
kerneldensities. A density estimate gives a quick insight into the shape of the
distribution ofthe data. We show that kernel density estimates overcome some of
the drawbacks of thehistograms.
Finally, scatterplots are shown to be very useful for plotting bivariate or trivariate
variablesagainst each other: they help to understand the nature of the
relationship among variablesin a data set and allow detecting groups or clusters
of points. Draftman plots or matrix plotsare the visualization of several bivariate
7/27/2019 Understanding Data in High Dimensions
3/4
scatterplots on the same display. They help detectstructures in conditional
dependences by brushing across the plots.
We have seen in the previous chapters how very simple graphical devices can
help in understanding
the structure and dependency of data. The graphical tools were based on
eitherunivariate (bivariate) data representations or on \slick" transformations of
multivariate informationperceivable by the human eye. Most of the tools are
extremely useful in a modeling step, but unfortunately, do not give the full picture
of the data set. One reason for this isthat the graphical tools presented capture
only certain dimensions of the data and do notnecessarily concentrate on those
dimensions or subparts of the data under analysis that carrythe maximum
structural information. In Part III of this book, powerful tools for reducingthe
dimension of a data set will be presented. In this chapter, as a starting point,
simple andbasic tools are used to describe dependency. They are constructed
from elementary facts ofprobability theory and introductory statistics (for
example, the covariance and correlationbetween two variables).
In the preceeding chapter we saw how the multivariate normal distribution
comes into playin many applications. It is useful to know more about this
distribution, since it is oftena good approximate distribution in many situations.
7/27/2019 Understanding Data in High Dimensions
4/4
Another reason for considering themultinormal distribution relies on the fact that
it has many appealing properties: it is stableunder linear transforms, zero
correlation corresponds to independence, the marginals and allthe conditionals
are also multivariate normal variates, etc. The mathematical properties ofthe
multinormal make analyses much simpler.
In this chapter we will _rst concentrate on the probabilistic properties of the
multinormal,then we will introduce two \companion" distributions of the
multinormal which naturallyappear when sampling from a multivariate normal
population: the Wishart and the Hotellingdistributions.