Understanding Data in High Dimensions

Embed Size (px)

Citation preview

  • 7/27/2019 Understanding Data in High Dimensions

    1/4

    UNDERSTANDING DATA IN HIGH DIMENSIONS

    Multivariate statistical analysis is concerned with analyzing and understanding

    data in high dimensions. We suppose that we are given a set fxign i=1 of n

    observations of a variable vector X in Rp. That is, we suppose that each

    observation xi has p dimension

    xi = (xi1; xi2; :::; xip);

    and that it is an observed value of a variable vector X 2 Rp. Therefore, X is

    composed of p

    random variables:

    X = (X1;X2; :::;Xp)

    where Xj , for j = 1; : : : ; p, is a one-dimensional random variable. How do we

    begin to

    Analyze this kind of data? Before we investigate questions on what inferences we

    can reachfrom the data, we should think about how to look at the data. This

    involves descriptivetechniques. Questions that we could answer by descriptive

    techniques are:

    _ Are there components of X that are more spread out than others?

    _ Are there some elements of X that indicate subgroups of the data?

    _ Are there outliers in the components of X?

    _ How \normal" is the distribution of the data?

    _ Are there low-dimensional" linear combinations of X that show non-normal"

    behavior?

    One difficulty of descriptive methods for high dimensional data is the human

    perceptionalsystem. Point clouds in two dimensions are easy to understand and

    to interpret. Withmodern interactive computing techniques we have the

  • 7/27/2019 Understanding Data in High Dimensions

    2/4

    possibility to see real time 3D rotationsand thus to perceive also three-

    dimensional data. A\sliding technique" as described inHardle and Scott (1992)

    may give insight into four dimensional structures by presentingdynamic 3D

    density contours as the fourth variable is changed over its range.

    A qualitative jump in presentation difficulties occurs for dimensions greater than

    or equal to5, unless the high-dimensional structure can be mapped into lower-

    dimensional components(Klinke and Polzehl, 1995). Features like clustered

    subgroups or outliers, however, can bedetected using a purely graphical analysis.

    In this chapter, we investigate the basic descriptive and graphical techniques

    allowing simpleexploratory data analysis. We begin the exploration of a data set

    using boxplots. A boxplotis a simple univariate device that detects outliers

    component by component and that cancompare distributions of the data among

    different groups. Next several multivariate techniquesare introduced (Flury faces,

    Andrews' curves and parallel coordinate plots) whichprovide graphical displays

    addressing the questions formulated above. The advantages andthe

    disadvantages of each of these techniques are stressed.

    Two basic techniques for estimating densities are also presented: histograms and

    kerneldensities. A density estimate gives a quick insight into the shape of the

    distribution ofthe data. We show that kernel density estimates overcome some of

    the drawbacks of thehistograms.

    Finally, scatterplots are shown to be very useful for plotting bivariate or trivariate

    variablesagainst each other: they help to understand the nature of the

    relationship among variablesin a data set and allow detecting groups or clusters

    of points. Draftman plots or matrix plotsare the visualization of several bivariate

  • 7/27/2019 Understanding Data in High Dimensions

    3/4

    scatterplots on the same display. They help detectstructures in conditional

    dependences by brushing across the plots.

    We have seen in the previous chapters how very simple graphical devices can

    help in understanding

    the structure and dependency of data. The graphical tools were based on

    eitherunivariate (bivariate) data representations or on \slick" transformations of

    multivariate informationperceivable by the human eye. Most of the tools are

    extremely useful in a modeling step, but unfortunately, do not give the full picture

    of the data set. One reason for this isthat the graphical tools presented capture

    only certain dimensions of the data and do notnecessarily concentrate on those

    dimensions or subparts of the data under analysis that carrythe maximum

    structural information. In Part III of this book, powerful tools for reducingthe

    dimension of a data set will be presented. In this chapter, as a starting point,

    simple andbasic tools are used to describe dependency. They are constructed

    from elementary facts ofprobability theory and introductory statistics (for

    example, the covariance and correlationbetween two variables).

    In the preceeding chapter we saw how the multivariate normal distribution

    comes into playin many applications. It is useful to know more about this

    distribution, since it is oftena good approximate distribution in many situations.

  • 7/27/2019 Understanding Data in High Dimensions

    4/4

    Another reason for considering themultinormal distribution relies on the fact that

    it has many appealing properties: it is stableunder linear transforms, zero

    correlation corresponds to independence, the marginals and allthe conditionals

    are also multivariate normal variates, etc. The mathematical properties ofthe

    multinormal make analyses much simpler.

    In this chapter we will _rst concentrate on the probabilistic properties of the

    multinormal,then we will introduce two \companion" distributions of the

    multinormal which naturallyappear when sampling from a multivariate normal

    population: the Wishart and the Hotellingdistributions.