Modeling financial data using clustering and
tree-based approachesFei CHEN, Stephen FIGLEWSKI, Andreas S. WEIGEND
Leonard N. Stern School of Business
New York University
44 West Fourth Street
New York, NY 10012EMail: {fchen\ sfiglews\ aweigend} @stern. nyu. edu
Steven R. WATERHOUSE
Ultimode Systems
Data Mining Tools
EMail: [email protected]
Abstract
This paper compares tree-based approaches to clustering. We model a set of3-million transactional T-bond futures data using these two techniques andcompare their predictive performance on trade profit. We illustrate their
respective strengths and weaknesses.
1 Problem
Financial data are usually modeled with supervised methods, where
functional dependencies are estimated with explicit targets (such asprofit). Unsupervised methods, in contrast, apply in cases where hid-den structures in the data need to be discovered without knowledge of
such pre-specified targets. This paper seeks to investigate these twoapproaches to financial modeling and demonstrate their respective
strengths and weaknesses.
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
The organization of this paper is as follows. Section 2 discussesthe data and pre-processing. Section 3 describes the two modeling
approaches. Section 4 presents comparison results. Section 5 con-
cludes.
2 Data
We contrast tree-based methods with clustering on T-bond futures
transaction data collected by the Commodity Futures Trading Com-
mission's (CFTC) Computerized Trade Reconstruction (CTR). Each
transaction record consists of a time, price, volume, buy or sell infor-
mation, account and clearing firm numbers.
defining a trade: local baseline
time
Figure 1: Local baseline definition of a trade: an identification of
two consecutive maximas or minimas defines a trade. In this plot,
transactions between (a, b) and (c, d) constitute two trades.
Trees and clustering are compared in terms of profit prediction
performance. This implies that the base unit of analysis is an indi-
vidual trade. A trade is defined to be a collection of transactions that
36
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
spans the period during which the trader enters and exits the market
at the same level of overall account position size. Figure 1 illustrates
this definition.
Trades are characterized by three types of variables: trade-specific
(long/short, relative opening position size, maximum position size,
trade length, exposure, profit) , market-specific (market volume, volatil-
ity and relative price), and previous-trade-specific (profit of previous
trade). For a list of input variables and their formal definitions, see
appendix.
We use Gaussian mixture models for clustering. In order to suit
the Gaussian assumptions, we take the logarithm of several variables:
maximum position size, exposure, trade length and volatility at open-
ing a trade.
In addition, profits of trades have to be normalized. This reduces
bias towards large and/or long trades. As importantly, normalizing
profit decreases the variance of big outliers and helps build more in-
terpret able trees. Regression trees rely on dependent variables for
optimization. Thus outliers of un-normalized profit present a big
problem. Since splits are made to reduce as much deviance as possi-
ble, the existence of large outliers leads to trees that absorb outliers in
most of their leaves, except a few containing a disproportional num-
ber of input patterns. Figure 2 shows such a tree. Note that 99% of
the data is contained in one leaf, and the regression values in all otherleaves lie in the range of outliers. A tree like this has little predictive
power, and does not embody any meaningful decision rules.
The unnormalized profit is extremely heavy-tailed with a mean
of 0.08 and a kurtosis of 12000. Several options exist for normaliza-
tion: maximum position size, trade exposure, or length. We chose
maximum position size, justified by an empirical examination whichconcluded that this normalization reduced kurtosis by a factor of 300,
and produced a tree that was most evenly distributed in terms of its
terminal leaf size (Figure 3).
Our analysis is based on one T-bond futures contract, consisting
of 3 million records. After pre-processing, we use randomly chosen
50% of the data as training, and the rest for evaluation.
37
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
maxPos<7.0
Deviance
maxPos: maximum pExpo: log(volatilit)cExpo: log(chronoLen: logjchronoUS: long or sho
- TOT EXDO<15frExpp<643 t 'iv B5<062990 17 I0.11 -63.7 Lenis.6
josition size{ exposure)ogical exposure)ogical length)rt
In1 Expo<2El8
r 5
naxf
EXD
1
X5751
OS<6.4CE>TTarPos S 4 r
^84
(po<13.6^5 '2 -335
5352
5-916.2
-223.2 126.5
15121-313.5
338.8 -192.3
Figure 2: Tree plot found with un-normalized profit. True condi-
tions follow the left binary branch and false the right. The italicized
numbers at terminal leaves show leaf size, while regular numbers in-
dicate regression value. Note two features: almost all data points are
contained in the left most leaf; all regression values lie in the outlierrange except for the left most one. The mean of the unconditional
profit is 0.08, with a standard deviation of 20.83.
3 Model Architecture, Objectives and Esti-
mation
With the same objective of establishing a predictive model of T-bond
futures trade profit, regression trees and clustering methods present
two entirely different approaches to making predictions. On the onehand, regression trees partition the input space according to similarvalues in the target (e.g. profit) space. In contrast, clustering parti-tions the input space according to the proximity of input patterns to
each other. We discuss these two models below.
38
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
max pos
•
chro expo
/
chro len
.
0 20 45 60 80 0 20 40 60 80 100 140 0 20 40 60 80 100 140leaf index
Figure 3: Plots of log terminal leaf sizes of trees obtained through nor-malizing profit by maximum position size versus chronological trade
exposure and length.
3.1 Trees
Tree-based modeling (Breiman, et al [2]) is an alternative technique
to linear models for regression problems. Compared with linear ad-
ditive models, trees present a collection of decision rules that are
more easily interpretable and evaluatable, while allowing for more
general interactions between independent variables. The set of rules
embodied in a tree is determined by recursive partitioning based on a
likelihood function. Specifically, deviance is used to determine which
partition is most likely given the data.The likelihood for regression is based on Gaussians. It consists
of an error model on the dependent variable yi ~ AT( ,cr ), and astructural component m dependent upon the input pattern xW. The
deviance for ith pattern is defined as
2/x) - (Vi ~ Pif- (1)
39
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
This is the log-likelihood scaled by cr\ which is assumed constant
across all patterns. The mean parameter is also constant at a given
node, thus the maximum likelihood estimate for a node is the mean
of all the patterns contained within. The deviance for a node is then
defined as
;%), (2)
and the deviance of a split node is the sum of the deviances of its
children nodes,
L R
The best split maximizes the difference
AD = D(jS;?/)-D( , ;!/). (4)
3.2 Clustering
Clustering is an unsupervised learning technique. The inputs areviewed as a set of unlabeled examples forming clusters in the space
defined by the attribute variables. The goal is to group patterns
that are "close" to each other into clusters, while at the same time tomaximize the differences among the clusters (Banfield & Raftery [1]).
We assume that the input patterns are generated by a finite mix-
ture of distributions (Titterington, Smith & Makov [5]). Except for
the long/short variable which is modeled as a binomial, all of the
mixture distributions are assumed to be Gaussians. A Gaussian is
parameterized by its mean and covariance matrix. We estimate theseparameters by maximizing the likelihood of data given the model.
The probability of observing an input x is f(x) = ]T)fcLi P(* Ifc)P(fc), where K is the number of mixing Gaussians. We assume an
independent Gaussian noise model:
f (x | 6) = , 1 ,,i exp (-l(x - /nEt)-i(x - /)) , (5)(27r)T | S* |2 V 2 J
where M is the dimensionality of x and the parameters of kth Gaus-sian are mean //* and covariance S*. The likelihood of data given
model is
£ = n p(*®) =2=1
40
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
While a spherical structure of the covariance matrix (identity
matrix scaled by a constant) is clearly too inflexible, full covariance
matrices, allowing for correlations between input variables, are not
desirable either. We restrict our model to diagonal matrices, with
the understanding that the input variables are chosen and rendered
as uncorrelated as possible.
( v\ 0 ... 0 \
o 4 ... o p,
V 0 0 ... VM /
Reasonable assumptions have to be made on the priors. We as-
sume uniform priors on the mean and Jeffreys priors on the variance,
see Cheeseman & Stutz [3] for more details. The optimal number of
clusters depends heavily upon the resolution of measurement error in
the input space. In the case of financial data, we use sample variance
as a proxy to measurement errors of the input variables.
In estimating a mixture of distributions, we face two sets of un-
knowns: the posterior probability of a data point given the model,
and the parameters of the model. We estimate them iteratively usingthe Expectation-Maximization algorithm (Dempster, Laird & Ru-
bin [4]).
3.3 Prediction
Predicting profit using a regression tree is straightforward. An out-of-sample pattern is dropped down the tree until it reaches a terminal
leaf for satisfying a chain of conditions. The prediction for the pattern
is then the mean profit associated with that leaf.Predicting using Gaussian mixtures is a little more involved. An
out-of-sample pattern is evaluated by the model, generating a prob-
ability vector of K elements, with the fcth element being the prob-
ability, P(fc|xW), that the new data pattern x* belongs to cluster
k.Two kinds of predictions can be made: a mean prediction and
a full density estimation of profit given x^. The mean prediction is
simply the sum of weighted cluster profit mean,
K
41
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
If we assume the profit of each cluster is distributed as a Gaus-
sian, a full density estimation of profit is a mixture of Gaussians,
parameterized by clusters' mean and variance,
(9)k=l
4 Evaluation
This section shows tree and clustering results obtained from training
on randomly selected 50% of the data set and evaluated on the rest.
It has to be noted immediately that true out-of-sample evaluationneeds to be based on different contracts, thus the results reported
here might not be generalizable.
4.1 Trees
Figure 4 shows a regression tree built on the sample data. A sur-
prising variable involved in the splits is the long/short dimension.
Furthermore, we note that all long trades on average earn a profitwhile short trades lose. This result suggests that the market priceof the contract we are analyzing went up during its trading period,
hence long trades tend to do well.
In order to provide a benchmark against the performance of tree
predictions, the same training data set is also modeled with a simple
linear regression, with normalized profit being the dependent vari-able. Figure 5 compares the prediction results between tree and lin-
ear regression. The quantization in the linear regression prediction is
attributed to the presence of the binary variable long or short. As in
the case of tree regression, the linear model picked out this variableas being the most significant in predicting profit (with a coefficient
of 0.6 vs essentially 0 for rest of the variables).
In terms of overall performance, the tree fares much better, with
a %MS (#NMS = E (W - &)Vlf (%/, - ) of 0.53 vs linearregression's 0.967. This is understood to be due to the fact thattrees handle outliers much more comfortably by considering them in
different segments of the model. In contrast, linear regression builds
a global model and does not deal with the issue of outliers well at all.
42
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
short (true) _/S<0 long (false)63120
Exp|o<10.536540
diffPrk)e30<0.39
309400.008
3557-0.2
max 'os<1.4
Expp<45.5
Exp|x11.726570
diffPri4e30<0.22[2441(0
18580 58740.0006 0.1
US:Expo:diffPrice:maxPos:
max >os<0.82120
»os<1.8xpp<194.6
long or shortlog(exposure in volatility time)price at reversal - price at openingmaximum position size
2.2307 29 1196 [21210.7 2.6 0.2 37 175
2.4 0.9
Figure 4: A tree plot. It aliced numbers are size of the leaves. Regular
numbers at terminal leaves are regression values. Only 4 variables are
significant in predicting profit: long/short, log (volatility exposure),
reversal and open price difference, and log (maximum position size).
Note that on average all long trades earn profit, while all short trades
sustain a loss.
4.2 Gaussian Mixtures
Twenty five clusters were found using clustering. Tables 1 and 2
contain for each cluster the mean value of each clustering variable.
Figure 6 shows the distributions of predictions by Gaussian mix-
ture clustering, linear regression and tree. Figure 7 compares the pre-
dictions of three methods. Note that profit was not included in train-
ing Gaussian mixtures, which is only done to the input space. Not
surprisingly, out-of-sample performance is lacking ( NMS ~ 0.993).
The predicted range of profit is quite small (-0.05 to 0.2), al-
though true profit varies much more (-7 to 7). Considering the un-conditional mean profit of the sample is 0.014, this suggests that the
clusters model only a small region in the profit space where most
43
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
predictions of profit/(max position size)
tree
s
I
- 5 0 5observed value
- 5 0 5observed value
Figure 5: Prediction performance comparison of normalized profit.The top figure shows predictions of tree versus linear regression. The
bottom two figures show tree and linear regression predictions plotted
against normalized profit.
trades lie, making it the most sensible area to base predictions upon.
Essentially we see no predictive power of Gaussian mixtures.
5 Conclusions
This paper compares clustering and tree-based approaches to model-ing financial data, and demonstrates that understanding data intel-
ligently necessitate applications of different tools.
Both modeling approaches are local, in the sense of partitioning
input space into regions of similar qualities. One the one hand, treesprovide an interpretable method that are suitable for identifying out-liers in the dependent variable space, thus supplying an exploratory
data analysis tool and helping discover better formulated hypothesis.
However, the lack of full consideration of independent variables is
44
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
distribution of normalized profit
10*;
10*
-5 oobserved value
10*
10*v
10*2
10*
10*(-0.05 0.0 0.05 0.10 0.15 0.20 0.25
clustering
10*2
10*2
10*
10*C0.0 0.5
I'm reg1.0
10*'
10*:
10*2
10*
- 3 - 2 - 1 0 1 2 3tree
Figure 6: Distributions of normalized profit. Note the tree approxi-
mates the range of true value the closet, giving rise to its high
value.
the tree-based method's weakness. The strength of clustering, on the
other hand, lies in its probabilistic approach to analyzing input vari-
ables. Although clusters are not as interpret able, they reveal hiddenstructures in the input space, at the same time solving the outlier
problem by segmenting it away into separate clusters.
It is thus desirable to solve regression problems locally, condi-
tional on the inputs: growing local trees in clusters, or finding local
clusters in leaves. Such a blend of supervised and unsupervised learn-
ing schemes exist in the hierarchical mixture of experts architecture,
the application of which is the natural next step in this project.
References
[1] J. D. Banfield and A. E. Raftery. Model-based Gaussian and
non-Gaussian clustering. Biometrics, 49:803-822, 1992.
45
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
observed value, tree and lin reg vs cluster prediction
•0 35 ' 0.05 ' 0~/T5cluster -0.05 ' 0.05 ' 0.15
clusterDD5 ' 0103 ' OlYET
cluster
Figure 7: Out-of-sample performance of Gaussian mixture model
compared with tree and linear regression. The three figures are true
normalized profit, tree prediction, and linear regression predictionplotted against cluster prediction.
[2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi-
fication and Regression Trees (CART). Wadsworth, Pacific Grove,CA, 1984.
[3] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass):theory and results. In Usama M. Fayyad, Gregory Piatetsky-
Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors,
Knowledge Discovery in Data Bases //, chapter 6, pages 153-180.
AAAI Press / The MIT Press, Menlo Park, CA, 1995.
[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likeli-hood from incomplete data via the EM algorithm. Journal of theRoyal Statistical Society B, 39:1-38, 1977.
[5] D. M. Titterington, A. F. M Smith, and U. E. Makov. Statistical
46
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
Analysis of Finite Mixture Distributions. John Wiley, JNew York,
1985.
A Feature Vectors
Trades are characterized by trade-specific, market-specific and previous-
trade-specific variables. The trade- specific variables reflect the state
of each trade The exogenous variables characterize the working en-
vironment in which a trade takes place. The previous-trade-specific
variable makes a first order Markov assumption: current trades only
depend upon immediately preceeding trades.We define notations first. Let t denote the time. Let to be the
time at which a trade opens, and te be the time a trade closes. The
sign of a trade, long or short, is +1 or -1 respectively. Let at denote
the signed transaction size at t, and pt be the transaction price.
An "active minute" is defined to be a minute in which at least
one transaction takes place in the full data set. Let minj denote
the minimum price of the price series of the 30 active minutes ending
at t. Similarly, maxj denotes the maximum price of the same price
series.Two time scales are used to obtain some of the variables. Chrono-
logical time is measured in minutes. Volatility time is measured by
the cumulative sum of squared relative returns since the first trans-
action in the 9209 contract.The cumulative sum function of &» cumsum(&), is
1=1
When & — at, the cumulative sum becomes the position size at t,
which is denoted by st*Using these definitions, the trade-specific variables for clustering
are:
• maximum absolute position size, max \st\ ;duration of trade
. . , . , . _ \*to\ _ .• opening relative to trade maximum, - ; — r ,
maXduration of trade |«t|
• exposure, the area under position size versus chronological and
volatility time;
47
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
• sign of trade, long = +1 and short = -1 ;
• length of trade, tg — t<>, as measured in volatility time.
The market-specific variables include information on prices, em-
pirical volatilities, and trading volume.
Prices are given at the time of opening a trade. It is expressed as
the relative location of the current price with respect to the minimum
and maximum of the local bar of the last 30 minutes, and is scaled
to lie between 1 and -1: (In cases when the denominator is zero, we
set the relative price to zero.)
. (30) \pt -- (30)
'
In addition, ratios of moving averages of prices are computed,
based on a 60 minute window and a 1440 (one day) window. This
variable is included at the time of opening of a trade.
Volatility and volume at opening a trade are computed with an
exponential filter capturing information on the time scale of 30 ac-
tive minutes (volume and volatility) and 3000 minutes (volume), cor-
responding to one week. The exponential filter for squared relativereturn is
z; = (1 - A) (log(p,) - log(p,_i))2 + A %_i , (12)
where the decay parameter, A, is either 0.936 or 0.999, corresponding
to 30 or 3000 active minutes respectively, zt is the filtered series, and
volatility vt — \/~zt- We analogously filter the volume series.The set of previous-trade-specific variable currently only contains
one member: the profit of trade immediately preceeding the current
one within the same account.Below we give a brief numeric description of the data in a table,
conditioned on whether a trade is profitable.
48
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
Table 1: Table of mean of trade-specific variables used in clustering.
Clusters are arranged in increasing cluster size N. The column labels
are: cluster size TV, percent long, log(maximum position size), relative
open position, log (chronological exposure of trade), log (chronological
exposure), log(volatility exposure), previous trade profit. The last
column is profit, not used in clustering.
12345678910111213141516171819202122232425
N
28456498103108145366368400447544655761765974113613791644180120422235325057925958
%L
0.430.51).79).6()1410.690.530.510.520.520.62
0.530.500.500.540.500.000.500.000.000.570.520.470.470.43
maxPos
0.575.171.620.415.120.323.593.974.091.660.321.423.320.500.790.592.310.620.550.780.672.851.880.491.95
entPos
0.860.390.500.890.390.940.490.660.510.670.930.760.540.910.750.890.650.850.940.910.880.590.720.910.62
IcExpo8.7214.574.069.3612.013.0813.057.2511.7911.277.846.1610.382.659.854.07-69.078.26-69.07-69.074.146.166.941.773.57
IvExpo
3.452275.290.095.90222.750.03320.043.0384.1451.311.62
1.1618.030.0314.220.100.002.460.000.000.121.091.620.010.06
IvLen
1.9114.240.024.511.710.0214.770.082.0315.911.260.451.130.028.540.060.001.480.000.000.070.120.480.000.01
jrevP
0.00-11.010.01-0.03-24.660.040.210.701.440.010.080.030.540.030.140.060.020.000.040.030.020.130.130.030.05
P-0.040.220.03-0.06-0.01-0.030.190.010.020.060.010.02
0.010.010.160.020.000.060.000.000.010.010.010.000.01
49
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
Table 2: Table of mean of market-specific variables used in clustering.
Column labels are cluster size JV, log(open volatility), open relative
price, revere relative price minus open relative price, open 30 minute
market volume, open 3000 minute volume, open moving average price
ratio of 60 minutes vs. 1440 minutes.
1
2345678910111213141516171819202122232425
N28456498103108145366368400447544655761765974113613791644180120422235325957925958
lentVolat-7.74-7.83-8.05-9.08-7.83-8.05-7.79-7.75-7.89-7.68-8.02-7.57-7.96-7.79-7.83-7.98-7.79-7.98-7.86-7.94-8.02-7.95-7.90-7.89-7.88
entPrice0.080.480.100.620.500.890.550.510.51
L_0.580.900.450.490.440.600.900.540.180.770.200.130.180.860.520.53
diffPrice0.680.000.47-0.150.04-0.65-0.04-0.01
L^O.OO-0.06-0.520.000.020.02-0.08-0.710.000.440.000.000.510.31-0.40-0.01-0.01
entVolSO20221604124975317541325155018501555143114438401473111516131375174513581521141712671459149214851522
entVolSOOO259890923381970745770945911666791138887458912934936898920836898905893929915
entAvg0.991.001.001.001.001.001.001.001.001.001.000.991.000.991.001.001.001.001.001.001.001.001.001.001.00
50
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517
Table 3: Table of trade-specific variables, conditioned upon profitabil-
ity (+: profitable, -: unprofitable). Mean and standard error (in
parenthesis) are reported. The variables are (from left to right): long
or short, maximum position size, relative open position, chronologi-
cal length (in minutes), chronological exposure, log (volatility length),
log(volatility exposure).
+
—
%L0.56(0.01)0.44(0.01)
maxPos13.53(0.18)14.83(0.25)
relOpen0.726(0.002)0.721(0.002)
cLen
21bU(39)2281(40)
cExpo22044(1438)24053(1293)
IvLen1.12(0.02)1.17(0.02)
IvExpo11.74(0.82)12.75(0.74)
Table 4: Table of market-specific variables, conditioned upon prof-
itability. The variables are (from left to right): log(open volatility),
open price, reversal price, 30 minute open volume, 3000 minute open
volume.
+
—
lentVolat-7.881(0.004)-7.862(0.004)
entPrice0.515(0.002)0.508(0.002)
revPrice0.511(0.002)0.503(0.002)
entVolSO1487(5)1503
(6)
entVolSOOO854
(1)859
(1)
51
Transactions on Information and Communications Technologies vol 19 © 1998 WIT Press, www.witpress.com, ISSN 1743-3517