The organization of this paper is as follows. Section 2 ... · sume uniform priors on the mean and Jeffreys priors on the variance, see Cheeseman & Stutz [3] for more details

Modeling financial data using clustering and

tree-based approachesFei CHEN, Stephen FIGLEWSKI, Andreas S. WEIGEND

Leonard N. Stern School of Business

New York University

44 West Fourth Street

New York, NY 10012EMail: {fchen\ sfiglews\ aweigend} @stern. nyu. edu

Steven R. WATERHOUSE

Ultimode Systems

Data Mining Tools

EMail: [email protected]

Abstract

This paper compares tree-based approaches to clustering. We model a set of3-million transactional T-bond futures data using these two techniques andcompare their predictive performance on trade profit. We illustrate their

respective strengths and weaknesses.

1 Problem

Financial data are usually modeled with supervised methods, where

functional dependencies are estimated with explicit targets (such asprofit). Unsupervised methods, in contrast, apply in cases where hid-den structures in the data need to be discovered without knowledge of

such pre-specified targets. This paper seeks to investigate these twoapproaches to financial modeling and demonstrate their respective

strengths and weaknesses.

Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517

The organization of this paper is as follows. Section 2 discussesthe data and pre-processing. Section 3 describes the two modeling

approaches. Section 4 presents comparison results. Section 5 con-

cludes.

2 Data

We contrast tree-based methods with clustering on T-bond futures

transaction data collected by the Commodity Futures Trading Com-

mission's (CFTC) Computerized Trade Reconstruction (CTR). Each

transaction record consists of a time, price, volume, buy or sell infor-

mation, account and clearing firm numbers.

defining a trade: local baseline

time

Figure 1: Local baseline definition of a trade: an identification of

two consecutive maximas or minimas defines a trade. In this plot,

transactions between (a, b) and (c, d) constitute two trades.

Trees and clustering are compared in terms of profit prediction

performance. This implies that the base unit of analysis is an indi-

vidual trade. A trade is defined to be a collection of transactions that

36


spans the period during which the trader enters and exits the market

at the same level of overall account position size. Figure 1 illustrates

this definition.

Trades are characterized by three types of variables: trade-specific

(long/short, relative opening position size, maximum position size,

trade length, exposure, profit) , market-specific (market volume, volatil-

ity and relative price), and previous-trade-specific (profit of previous

trade). For a list of input variables and their formal definitions, see

appendix.

We use Gaussian mixture models for clustering. In order to suit

the Gaussian assumptions, we take the logarithm of several variables:

maximum position size, exposure, trade length and volatility at open-

ing a trade.

In addition, profits of trades have to be normalized. This reduces

bias towards large and/or long trades. As importantly, normalizing

profit decreases the variance of big outliers and helps build more in-

terpret able trees. Regression trees rely on dependent variables for

optimization. Thus outliers of un-normalized profit present a big

problem. Since splits are made to reduce as much deviance as possi-

ble, the existence of large outliers leads to trees that absorb outliers in

most of their leaves, except a few containing a disproportional num-

ber of input patterns. Figure 2 shows such a tree. Note that 99% of

the data is contained in one leaf, and the regression values in all otherleaves lie in the range of outliers. A tree like this has little predictive

power, and does not embody any meaningful decision rules.

The unnormalized profit is extremely heavy-tailed with a mean

of 0.08 and a kurtosis of 12000. Several options exist for normaliza-

tion: maximum position size, trade exposure, or length. We chose

maximum position size, justified by an empirical examination whichconcluded that this normalization reduced kurtosis by a factor of 300,

and produced a tree that was most evenly distributed in terms of its

terminal leaf size (Figure 3).

Our analysis is based on one T-bond futures contract, consisting

of 3 million records. After pre-processing, we use randomly chosen

50% of the data as training, and the rest for evaluation.

37


maxPos<7.0

Deviance

maxPos: maximum pExpo: log(volatilit)cExpo: log(chronoLen: logjchronoUS: long or sho

- TOT EXDO<15frExpp<643 t 'iv B5<062990 17 I0.11 -63.7 Lenis.6

josition size{ exposure)ogical exposure)ogical length)rt

In1 Expo<2El8

r 5

naxf

EXD

1

X5751

OS<6.4CE>TTarPos S 4 r

^84

(po<13.6^5 '2 -335

5352

5-916.2

-223.2 126.5

15121-313.5

338.8 -192.3

Figure 2: Tree plot found with un-normalized profit. True condi-

tions follow the left binary branch and false the right. The italicized

numbers at terminal leaves show leaf size, while regular numbers in-

dicate regression value. Note two features: almost all data points are

contained in the left most leaf; all regression values lie in the outlierrange except for the left most one. The mean of the unconditional

profit is 0.08, with a standard deviation of 20.83.

3 Model Architecture, Objectives and Esti-

mation

With the same objective of establishing a predictive model of T-bond

futures trade profit, regression trees and clustering methods present

two entirely different approaches to making predictions. On the onehand, regression trees partition the input space according to similarvalues in the target (e.g. profit) space. In contrast, clustering parti-tions the input space according to the proximity of input patterns to

each other. We discuss these two models below.

38


max pos

•

chro expo

/

chro len

.

0 20 45 60 80 0 20 40 60 80 100 140 0 20 40 60 80 100 140leaf index

Figure 3: Plots of log terminal leaf sizes of trees obtained through nor-malizing profit by maximum position size versus chronological trade

exposure and length.

3.1 Trees

Tree-based modeling (Breiman, et al [2]) is an alternative technique

to linear models for regression problems. Compared with linear ad-

ditive models, trees present a collection of decision rules that are

more easily interpretable and evaluatable, while allowing for more

general interactions between independent variables. The set of rules

embodied in a tree is determined by recursive partitioning based on a

likelihood function. Specifically, deviance is used to determine which

partition is most likely given the data.The likelihood for regression is based on Gaussians. It consists

of an error model on the dependent variable yi ~ AT( ,cr ), and astructural component m dependent upon the input pattern xW. The

deviance for ith pattern is defined as

2/x) - (Vi ~ Pif- (1)

39


This is the log-likelihood scaled by cr\ which is assumed constant

across all patterns. The mean parameter is also constant at a given

node, thus the maximum likelihood estimate for a node is the mean

of all the patterns contained within. The deviance for a node is then

defined as

;%), (2)

and the deviance of a split node is the sum of the deviances of its

children nodes,

L R

The best split maximizes the difference

AD = D(jS;?/)-D( , ;!/). (4)

3.2 Clustering

Clustering is an unsupervised learning technique. The inputs areviewed as a set of unlabeled examples forming clusters in the space

defined by the attribute variables. The goal is to group patterns

that are "close" to each other into clusters, while at the same time tomaximize the differences among the clusters (Banfield & Raftery [1]).

We assume that the input patterns are generated by a finite mix-

ture of distributions (Titterington, Smith & Makov [5]). Except for

the long/short variable which is modeled as a binomial, all of the

mixture distributions are assumed to be Gaussians. A Gaussian is

parameterized by its mean and covariance matrix. We estimate theseparameters by maximizing the likelihood of data given the model.

The probability of observing an input x is f(x) = ]T)fcLi P(* Ifc)P(fc), where K is the number of mixing Gaussians. We assume an

independent Gaussian noise model:

f (x | 6) = , 1 ,,i exp (-l(x - /nEt)-i(x - /)) , (5)(27r)T | S* |2 V 2 J

where M is the dimensionality of x and the parameters of kth Gaus-sian are mean //* and covariance S*. The likelihood of data given

model is

£ = n p(*®) =2=1

40


While a spherical structure of the covariance matrix (identity

matrix scaled by a constant) is clearly too inflexible, full covariance

matrices, allowing for correlations between input variables, are not

desirable either. We restrict our model to diagonal matrices, with

the understanding that the input variables are chosen and rendered

as uncorrelated as possible.

( v\ 0 ... 0 \

o 4 ... o p,

V 0 0 ... VM /

Reasonable assumptions have to be made on the priors. We as-

sume uniform priors on the mean and Jeffreys priors on the variance,

see Cheeseman & Stutz [3] for more details. The optimal number of

clusters depends heavily upon the resolution of measurement error in

the input space. In the case of financial data, we use sample variance

as a proxy to measurement errors of the input variables.

In estimating a mixture of distributions, we face two sets of un-

knowns: the posterior probability of a data point given the model,

and the parameters of the model. We estimate them iteratively usingthe Expectation-Maximization algorithm (Dempster, Laird & Ru-

bin [4]).

3.3 Prediction

Predicting profit using a regression tree is straightforward. An out-of-sample pattern is dropped down the tree until it reaches a terminal

leaf for satisfying a chain of conditions. The prediction for the pattern

is then the mean profit associated with that leaf.Predicting using Gaussian mixtures is a little more involved. An

out-of-sample pattern is evaluated by the model, generating a prob-

ability vector of K elements, with the fcth element being the prob-

ability, P(fc|xW), that the new data pattern x* belongs to cluster

k.Two kinds of predictions can be made: a mean prediction and

a full density estimation of profit given x^. The mean prediction is

simply the sum of weighted cluster profit mean,

K

41


If we assume the profit of each cluster is distributed as a Gaus-

sian, a full density estimation of profit is a mixture of Gaussians,

parameterized by clusters' mean and variance,

(9)k=l

4 Evaluation

This section shows tree and clustering results obtained from training

on randomly selected 50% of the data set and evaluated on the rest.

It has to be noted immediately that true out-of-sample evaluationneeds to be based on different contracts, thus the results reported

here might not be generalizable.

4.1 Trees

Figure 4 shows a regression tree built on the sample data. A sur-

prising variable involved in the splits is the long/short dimension.

Furthermore, we note that all long trades on average earn a profitwhile short trades lose. This result suggests that the market priceof the contract we are analyzing went up during its trading period,

hence long trades tend to do well.

In order to provide a benchmark against the performance of tree

predictions, the same training data set is also modeled with a simple

linear regression, with normalized profit being the dependent vari-able. Figure 5 compares the prediction results between tree and lin-

ear regression. The quantization in the linear regression prediction is

attributed to the presence of the binary variable long or short. As in

the case of tree regression, the linear model picked out this variableas being the most significant in predicting profit (with a coefficient

of 0.6 vs essentially 0 for rest of the variables).

In terms of overall performance, the tree fares much better, with

a %MS (#NMS = E (W - &)Vlf (%/, - ) of 0.53 vs linearregression's 0.967. This is understood to be due to the fact thattrees handle outliers much more comfortably by considering them in

different segments of the model. In contrast, linear regression builds

a global model and does not deal with the issue of outliers well at all.

42


short (true) _/S<0 long (false)63120

Exp|o<10.536540

diffPrk)e30<0.39

309400.008

3557-0.2

max 'os<1.4

Expp<45.5

Exp|x11.726570

diffPri4e30<0.22[2441(0

18580 58740.0006 0.1

US:Expo:diffPrice:maxPos:

max >os<0.82120

»os<1.8xpp<194.6

long or shortlog(exposure in volatility time)price at reversal - price at openingmaximum position size

2.2307 29 1196 [21210.7 2.6 0.2 37 175

2.4 0.9

Figure 4: A tree plot. It aliced numbers are size of the leaves. Regular

numbers at terminal leaves are regression values. Only 4 variables are

significant in predicting profit: long/short, log (volatility exposure),

reversal and open price difference, and log (maximum position size).

Note that on average all long trades earn profit, while all short trades

sustain a loss.

4.2 Gaussian Mixtures

Twenty five clusters were found using clustering. Tables 1 and 2

contain for each cluster the mean value of each clustering variable.

Figure 6 shows the distributions of predictions by Gaussian mix-

ture clustering, linear regression and tree. Figure 7 compares the pre-

dictions of three methods. Note that profit was not included in train-

ing Gaussian mixtures, which is only done to the input space. Not

surprisingly, out-of-sample performance is lacking ( NMS ~ 0.993).

The predicted range of profit is quite small (-0.05 to 0.2), al-

though true profit varies much more (-7 to 7). Considering the un-conditional mean profit of the sample is 0.014, this suggests that the

clusters model only a small region in the profit space where most

43


predictions of profit/(max position size)

tree

s

I

- 5 0 5observed value

- 5 0 5observed value

Figure 5: Prediction performance comparison of normalized profit.The top figure shows predictions of tree versus linear regression. The

bottom two figures show tree and linear regression predictions plotted

against normalized profit.

trades lie, making it the most sensible area to base predictions upon.

Essentially we see no predictive power of Gaussian mixtures.

5 Conclusions

This paper compares clustering and tree-based approaches to model-ing financial data, and demonstrates that understanding data intel-

ligently necessitate applications of different tools.

Both modeling approaches are local, in the sense of partitioning

input space into regions of similar qualities. One the one hand, treesprovide an interpretable method that are suitable for identifying out-liers in the dependent variable space, thus supplying an exploratory

data analysis tool and helping discover better formulated hypothesis.

However, the lack of full consideration of independent variables is

44


distribution of normalized profit

10*;

10*

-5 oobserved value

10*

10*v

10*2

10*

10*(-0.05 0.0 0.05 0.10 0.15 0.20 0.25

clustering

10*2

10*2

10*

10*C0.0 0.5

I'm reg1.0

10*'

10*:

10*2

10*

- 3 - 2 - 1 0 1 2 3tree

Figure 6: Distributions of normalized profit. Note the tree approxi-

mates the range of true value the closet, giving rise to its high

value.

the tree-based method's weakness. The strength of clustering, on the

other hand, lies in its probabilistic approach to analyzing input vari-

ables. Although clusters are not as interpret able, they reveal hiddenstructures in the input space, at the same time solving the outlier

problem by segmenting it away into separate clusters.

It is thus desirable to solve regression problems locally, condi-

tional on the inputs: growing local trees in clusters, or finding local

clusters in leaves. Such a blend of supervised and unsupervised learn-

ing schemes exist in the hierarchical mixture of experts architecture,

the application of which is the natural next step in this project.

References

[1] J. D. Banfield and A. E. Raftery. Model-based Gaussian and

non-Gaussian clustering. Biometrics, 49:803-822, 1992.

45


observed value, tree and lin reg vs cluster prediction

•0 35 ' 0.05 ' 0~/T5cluster -0.05 ' 0.05 ' 0.15

clusterDD5 ' 0103 ' OlYET

cluster

Figure 7: Out-of-sample performance of Gaussian mixture model

compared with tree and linear regression. The three figures are true

normalized profit, tree prediction, and linear regression predictionplotted against cluster prediction.

[2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi-

fication and Regression Trees (CART). Wadsworth, Pacific Grove,CA, 1984.

[3] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass):theory and results. In Usama M. Fayyad, Gregory Piatetsky-

Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors,

Knowledge Discovery in Data Bases //, chapter 6, pages 153-180.

AAAI Press / The MIT Press, Menlo Park, CA, 1995.

[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likeli-hood from incomplete data via the EM algorithm. Journal of theRoyal Statistical Society B, 39:1-38, 1977.

[5] D. M. Titterington, A. F. M Smith, and U. E. Makov. Statistical

46


Analysis of Finite Mixture Distributions. John Wiley, JNew York,

1985.

A Feature Vectors

Trades are characterized by trade-specific, market-specific and previous-

trade-specific variables. The trade- specific variables reflect the state

of each trade The exogenous variables characterize the working en-

vironment in which a trade takes place. The previous-trade-specific

variable makes a first order Markov assumption: current trades only

depend upon immediately preceeding trades.We define notations first. Let t denote the time. Let to be the

time at which a trade opens, and te be the time a trade closes. The

sign of a trade, long or short, is +1 or -1 respectively. Let at denote

the signed transaction size at t, and pt be the transaction price.

An "active minute" is defined to be a minute in which at least

one transaction takes place in the full data set. Let minj denote

the minimum price of the price series of the 30 active minutes ending

at t. Similarly, maxj denotes the maximum price of the same price

series.Two time scales are used to obtain some of the variables. Chrono-

logical time is measured in minutes. Volatility time is measured by

the cumulative sum of squared relative returns since the first trans-

action in the 9209 contract.The cumulative sum function of &» cumsum(&), is

1=1

When & — at, the cumulative sum becomes the position size at t,

which is denoted by st*Using these definitions, the trade-specific variables for clustering

are:

• maximum absolute position size, max \st\ ;duration of trade

. . , . , . _ \*to\ _ .• opening relative to trade maximum, - ; — r ,

maXduration of trade |«t|

• exposure, the area under position size versus chronological and

volatility time;

47


• sign of trade, long = +1 and short = -1 ;

• length of trade, tg — t<>, as measured in volatility time.

The market-specific variables include information on prices, em-

pirical volatilities, and trading volume.

Prices are given at the time of opening a trade. It is expressed as

the relative location of the current price with respect to the minimum

and maximum of the local bar of the last 30 minutes, and is scaled

to lie between 1 and -1: (In cases when the denominator is zero, we

set the relative price to zero.)

. (30) \pt -- (30)

'

In addition, ratios of moving averages of prices are computed,

based on a 60 minute window and a 1440 (one day) window. This

variable is included at the time of opening of a trade.

Volatility and volume at opening a trade are computed with an

exponential filter capturing information on the time scale of 30 ac-

tive minutes (volume and volatility) and 3000 minutes (volume), cor-

responding to one week. The exponential filter for squared relativereturn is

z; = (1 - A) (log(p,) - log(p,_i))2 + A %_i , (12)

where the decay parameter, A, is either 0.936 or 0.999, corresponding

to 30 or 3000 active minutes respectively, zt is the filtered series, and

volatility vt — \/~zt- We analogously filter the volume series.The set of previous-trade-specific variable currently only contains

one member: the profit of trade immediately preceeding the current

one within the same account.Below we give a brief numeric description of the data in a table,

conditioned on whether a trade is profitable.

48


Table 1: Table of mean of trade-specific variables used in clustering.

Clusters are arranged in increasing cluster size N. The column labels

are: cluster size TV, percent long, log(maximum position size), relative

open position, log (chronological exposure of trade), log (chronological

exposure), log(volatility exposure), previous trade profit. The last

column is profit, not used in clustering.

12345678910111213141516171819202122232425

N

28456498103108145366368400447544655761765974113613791644180120422235325057925958

%L

0.430.51).79).6()1410.690.530.510.520.520.62

0.530.500.500.540.500.000.500.000.000.570.520.470.470.43

maxPos

0.575.171.620.415.120.323.593.974.091.660.321.423.320.500.790.592.310.620.550.780.672.851.880.491.95

entPos

0.860.390.500.890.390.940.490.660.510.670.930.760.540.910.750.890.650.850.940.910.880.590.720.910.62

IcExpo8.7214.574.069.3612.013.0813.057.2511.7911.277.846.1610.382.659.854.07-69.078.26-69.07-69.074.146.166.941.773.57

IvExpo

3.452275.290.095.90222.750.03320.043.0384.1451.311.62

1.1618.030.0314.220.100.002.460.000.000.121.091.620.010.06

IvLen

1.9114.240.024.511.710.0214.770.082.0315.911.260.451.130.028.540.060.001.480.000.000.070.120.480.000.01

jrevP

0.00-11.010.01-0.03-24.660.040.210.701.440.010.080.030.540.030.140.060.020.000.040.030.020.130.130.030.05

P-0.040.220.03-0.06-0.01-0.030.190.010.020.060.010.02

0.010.010.160.020.000.060.000.000.010.010.010.000.01

49


Table 2: Table of mean of market-specific variables used in clustering.

Column labels are cluster size JV, log(open volatility), open relative

price, revere relative price minus open relative price, open 30 minute

market volume, open 3000 minute volume, open moving average price

ratio of 60 minutes vs. 1440 minutes.

1

2345678910111213141516171819202122232425

N28456498103108145366368400447544655761765974113613791644180120422235325957925958

lentVolat-7.74-7.83-8.05-9.08-7.83-8.05-7.79-7.75-7.89-7.68-8.02-7.57-7.96-7.79-7.83-7.98-7.79-7.98-7.86-7.94-8.02-7.95-7.90-7.89-7.88

entPrice0.080.480.100.620.500.890.550.510.51

L_0.580.900.450.490.440.600.900.540.180.770.200.130.180.860.520.53

diffPrice0.680.000.47-0.150.04-0.65-0.04-0.01

L^O.OO-0.06-0.520.000.020.02-0.08-0.710.000.440.000.000.510.31-0.40-0.01-0.01

entVolSO20221604124975317541325155018501555143114438401473111516131375174513581521141712671459149214851522

entVolSOOO259890923381970745770945911666791138887458912934936898920836898905893929915

entAvg0.991.001.001.001.001.001.001.001.001.001.000.991.000.991.001.001.001.001.001.001.001.001.001.001.00

50


Table 3: Table of trade-specific variables, conditioned upon profitabil-

ity (+: profitable, -: unprofitable). Mean and standard error (in

parenthesis) are reported. The variables are (from left to right): long

or short, maximum position size, relative open position, chronologi-

cal length (in minutes), chronological exposure, log (volatility length),

log(volatility exposure).

+

—

%L0.56(0.01)0.44(0.01)

maxPos13.53(0.18)14.83(0.25)

relOpen0.726(0.002)0.721(0.002)

cLen

21bU(39)2281(40)

cExpo22044(1438)24053(1293)

IvLen1.12(0.02)1.17(0.02)

IvExpo11.74(0.82)12.75(0.74)

Table 4: Table of market-specific variables, conditioned upon prof-

itability. The variables are (from left to right): log(open volatility),

open price, reversal price, 30 minute open volume, 3000 minute open

volume.

+

—

lentVolat-7.881(0.004)-7.862(0.004)

entPrice0.515(0.002)0.508(0.002)

revPrice0.511(0.002)0.503(0.002)

entVolSO1487(5)1503

(6)

entVolSOOO854

(1)859

(1)

51


Documents

The organization of this paper is as follows. Section 2 ... · sume uniform priors on the mean and Jeffreys priors on the variance, see Cheeseman & Stutz [3] for more details