Data Prep - Concentartion of Values (Excel)

8/2/2019 Data Prep - Concentartion of Values (Excel)

http://slidepdf.com/reader/full/data-prep-concentartion-of-values-excel 1/6

Data Preparation – Concentration of Values ‐1‐ © Spider Financial Corp, 2012

Concentration of Values

In this issue, the fourth tutorial in our data preparation series, we cover data sets for which values are concentrated in a tight range (e.g. proportions), or widely ‐dispersed over several orders of magnitude (e.g. populations, income, rainfall volume, etc.).

For this tutorial, we’ll start by going through a few different value concentration cases: constrained values, mean/variance relationship, etc. Next, we explain their impact on analysis and forecast, and,

finally, we present common transformation methods and discuss concerns for mapping results from transformed a data scale back to raw data.

BackgroundOccasionally, we face a time series sample in which values are naturally restricted to a given range. For instance, mortality rates are restricted between zero and 1, and a trading strategy with a stop ‐loss order floors the downside while keeping the upside uncapped; the opposite is the case where a limited sell order is used.

Other examples of time series data include the following:

1. Proportions (restricted between 0 and 1, not including the end points) 2. Count data (i.e. integers) 3. Positive value

4. Non ‐negative values

Furthermore, a data set whose values span several orders of magnitude can prove to be problematic for modeling and forecasting. Examples of such data include Income, population, rainfall volume, etc.




Why do we care?First, the time series model does not assume any bounds or limits on values that the time series can take, so using those models for a constrained data set may yield poor fitting.

Second, having a floor or a ceiling level in in the data set affects the symmetry (or lack of skew) of the values around the mean. This phenomenon can also be difficult to capture using time series models.

Third, a relationship between observation level and local variance may develop and, for the same

reasons above, we’ll have to stabilize the variance before doing anything else.




Examine for concentration of valuesIn the field of data mining, values concentration is often referred to as “values clustering”; there is a huge volume of literature about grouping, testing, analyzing, etc.

Fortunately, we may be able to get away with a visual examination of the time plot of the data and/or distribution histogram.

Questions:

(1) Is the variance changing in relation to the observation levels?

(2) Are the data values capped or floor ‐leveled? Please note that the actual level may not be precise enough to allow for potential slippage.

(3) Does the distribution show a skew in either direction?

In the case of variance stabilization, the aim of a variance ‐stabilizing transformation is to find a simple

function ƒ to apply to values { }t x in a data set to create new values ( )t t y f x such that the variability

of the values of y is not related to their mean value.

We have a values concentration issue in our data... Now what?Again, the answer is simple: transform the data to a symmetric homoscedastic distribution.

The difficult question is, (Q1 ) how do I transform the data?

To make things more interesting, the final estimates we obtain is affected when we use transformed

data in

our

analysis

original

analysis.

For

example,

a logarithmic

transformation

is

often

useful

for

data

which have a positive skew to induce symmetry. If we take the forecast mean on the transformed scale

and transform by taking the antilog, we get the median which (in this case) is less than the mean

forecast of the raw data.

Q2 : how can I construct the forecast (mean) value and confidence interval limits from the transformed

forecast data?

Once we transform the data to a symmetric homoscedastic distribution, we construct a confidence interval for the forecast. Next, the confidence interval is transformed back to the original scale using the inverse of the transformation that was applied to the data.

This works beautifully for the interval limits and the median, but care must be taken when we interpret the average of the confidence interval.

TransformationThere are several transformation algorithms to choose from, but care must be taken to choose the best

one to treat the root problem. To pick an optimal algorithm, we need to ask a few question: (1) are we




trying to induce symmetry , (2) are we trying to force a normal ‐like distribution or (3) do we wish to stabilize the variance?

1. Logarithmic transformation ln( )t t y x a

t x a

The logarithmic transformation is often used to induce symmetry in the data and stabilize the variance 1. It is often favored because its results are easy to interpret.

The confidence interval limits can be transformed back, so that the median remains the same throughout the transformation, while the average is not the same. We’ll need to compute its value separately.

Note: In the Airline passenger problem, we used the log of passenger data to stabilize the variance. When we constructed the confidence interval, the forecast value of the transformed data is, by definition, a Gaussian distribution, so the passenger forecast is log‐normally distributed.

The confidence interval average (or mean) is calculated as follow:

2

2T lY

T l x e

2. Square root transformationWe use the square root transformation for non ‐negative valued time series.

The square root (and Anscombe) transformation is often applied to stabilize the mean/variance

dependency in Poisson ‐type data.

The rationale for this originally sprang from the fact that a dataset { } x is a realization of different

Poisson distributions (i.e. the distributions each have different mean values μ .); because the variance is

identical to the mean in a Poisson distribution, the variance varies with the mean. However, for the

simple variance ‐stabilizing transformation t t y x , the sample variance associated with observation

will be nearly constant.

Please note that an “Anscombe transform” is basically a special case of the square root transformation:

1A variance ‐stabilizing transformation aims to remove a mean/variance relationship, so that the variance becomes constant relative to the mean




32

8 y x

3. Logit transformationIf values are naturally restricted to be in the range 0 to 1, not including the end ‐points, then a logit

transformation may be appropriate. This transformation yields values in the range ( , ) .

( ) ln1

0 1

t

t

p Logit p

p

p

4. Multiplicative Inverse (reciprocal) transformation

1

t

y x

Where 0t x

The multiplicative inverse function is probably the simplest transform function as it is self ‐inverse.

5. Power (Box ‐Cox) TransformationThe Power and, especially Box‐Cox transformations, are often used in time series analysis to transform the data to induce symmetry and resemble a normal distribution.

( )1

0

ln( ) 0

y y

y

Note: The logarithmic transform is a special case of a Box‐Cox transformation.

The only constraint is 0t y , so for data series with negative values, we can add a constant such that

0t y

( )( ) 1

0

ln( ) 0

y y

y

The optimal parameters: and can be selected by maximizing the log‐likelihood function (LLF) of the transformed data (assuming a Gaussian distribution).




Note that it is not always necessary or desirable to transform a data set to resemble a normal distribution. However if symmetry or normality are desired, they can often be induced through one of

these power transformations.

Documents

Data Prep - Concentartion of Values (Excel)