Online Random Forest

7/28/2019 Online Random Forest

1/23

000001002003004005006007008009010011012013014015016017018

019020021022023024025026027028029030031

032033034035036037038039040041042043044

045046047048049050051052053054

0000000000000000000

0000000000000

0000000000000

1111111111

Consistency of Online Random Forests

Abstract

As a testament to their success, the theoryof random forests has long been outpaced bytheir application in practice. In this paper,we take a step towards narrowing this gapby providing a consistency result for onlinerandom forests.

1. Introduction

Random forests are a class of ensemble method whosebase learners are a collection of randomized treepredictors, which are combined through averaging.The original random forests framework described inBreiman (2001) has been extremely influential (Svet-nik et al., 2003; Prasad et al., 2006; Cutler et al., 2007;Shotton et al., 2011; Criminisi et al., 2011).

Despite their extensive use in practical settings, verylittle is known about the mathematical properties ofthese algorithms. A recent paper by one of the leadingtheoretical experts states that

Despite growing interest and practical use,there has been little exploration of the sta-tistical properties of random forests, and lit-tle is known about the mathematical forcesdriving the algorithm (Biau, 2012).

Theoretical work in this area typically focuses on styl-ized versions of the random forests algorithms used inpractice. For example, Biau et al. (2008) prove theconsistency of a variety of ensemble methods built byaveraging base classifiers. Two of the models theystudy are direct simplifications of the forest growingalgorithms used in practice; the others are stylized

neighbourhood averaging rules, which can be viewedas simplifications of random forests through the lensof Lin & Jeon (2002).

In this paper we make further steps towards narrowingthe gap between theory and practice. In particular, we

Preliminary work. Under review by the International Con-ference on Machine Learning (ICML). Do not distribute.

present what is to the best of our knowledge the firstconsistency result for online random forests.

We show that the theory provides guidance for de-signing online random forest algorithms. A few simpleexperiments with our algorithm confirm the require-ments for consistency predicted by the theory. Theexperiments also highlight some theoretical and prac-tical problems that need to be addressed.

2. Related Work

Different variants of random forests are distinguishedby the methods they use for growing the trees. Themodel described in Breiman (2001) builds each treeon a bootstrapped sample of the training set using theCART methodology (Breiman et al., 1984). The opti-mization in each leaf that searches for the optimal splitpoint is restricted to a random selection of features, orlinear combinations of features.

The framework of Criminisi et al. (2011) operatesslightly differently. Instead of choosing only features atrandom, this framework chooses entire decisions (i.e.both a feature or combination of features and a thresh-

old together) at random and optimizes only over thisset. They also offer a variety of different objectiveswhich can be optimized to split each leaf, dependingon the task at hand (e.g. classification vs manifoldlearning). Unlike the work of Breiman (2001), thisframework chooses not to include bagging, preferringinstead to train each tree on the entire data set and in-troduce randomness only in the splitting process. Theauthors argue that without bagging their model ob-tains max-margin properties.

In addition to the frameworks mentioned above, manypractitioners introduce their own variations on the ba-

sic random forests algorithm, tailored to their specificproblem domain. A variant from Bosch et al. (2007)is especially similar to the technique we use in this pa-per: When growing a tree the authors randomly selectone third of the training data to determine the struc-ture of the tree and use the remaining two thirds tofit the leaf estimators. However, the authors considerthis only as a technique for introducing randomnessinto the trees, whereas in our model the partitioning


2/23

110111112113114115116117118119120121122123124125126127128

129130131132133134135136137138139140141

142143144145146147148149150151152153154

155156157158159160161162163164

1111111111111111111

1111111111111

1112222222222

2222222222


of data plays a central role in consistency.

In addition to these offline methods, several re-searchers have focused on building online versions ofrandom forests. Online models are attractive becausethey do not require that the entire training set be ac-cessible at once. These models are appropriate for

streaming settings where training data is generatedover time and should be incorporated into the modelas quickly as possible. Several variants of online de-cision tree models are present in the MOA system ofBifet et al. (2010).

The primary difficulty with building online decisiontrees is their recursive nature. Data encountered oncea split has been made cannot be used to correct earlierdecisions. A notable approach to this problem is theHoeffding tree (Domingos & Hulten, 2000) algorithm,which works by maintaining several candidate splits ineach leaf. The quality of each split is estimated online

as data arrive in the leaf, but since the entire trainingset is not available these quality measures are only es-timates. The Hoeffding bound is employed in each leafto control the amount of data which must be collectedto ensure that the split chosen on the basis of theseestimates is the true best split with high probability.Domingos & Hulten (2000) prove that under reason-able assumptions the online Hoeffding tree convergesto the offline tree with high probability. The Hoeffdingtree algorithm is implemented in the system of Bifetet al. (2010).

Alternative methods for controlling tree growth in anonline setting have also been explored. Saffari et al.(2009) use the online bagging technique of Oza & Rus-sel (2001) and control leaf splitting using two param-eters, in their online random forest. One parameterspecifies the minimum number of data points whichmust be seen in a leaf before it can be split, and an-other specifies a minimum quality threshold that thebest split in a leaf must reach. This is similar in flavorto the technique used by Hoeffding trees, but tradestheoretical guarantees for more interpretable parame-ters.

One active avenue of research in online random forestsinvolves tracking non-stationary distributions, also

known as concept drift. Many of the online techniquesincorporate features designed for this problem (Gamaet al., 2005; Abdulsalam, 2008; Saffari et al., 2009;Bifet et al., 2009; 2012). However, tracking of non-stationarity is beyond the scope of this paper.

The most well known theoretical result for randomforests is that of Breiman (2001), which gives an up-per bound on the generalization error of the forest in

terms of the correlation and strength of trees. Fol-lowing Breiman (2001), an interpretation of randomforests as an adaptive neighborhood weighting schemewas published in Lin & Jeon (2002). This was fol-lowed by the first consistency result in this area fromBreiman (2004), which proves consistency of a simpli-fied model of the random forests used in practice. Inthe context of quantile regression the consistency ofa certain model of random forests has been shown byMeinshausen (2006). A model of random forests forsurvival analysis was shown to be consistent in Ish-waran & Kogalur (2010).

Significant recent work in this direction comes fromBiau et al. (2008) who prove the consistency of a vari-ety of ensemble methods built by averaging base clas-sifiers, as is done in random forests. A key featureof the consistency of the tree construction algorithmsthey present is a proposition that states that if thebase classifier is consistent then the forest, which takes

a majority vote of these classifiers, is itself consistent.

The most recent theoretical study, and the one whichachieves the closest match between theory and prac-tice, is that of Biau (2012). The most significant wayin which their model differs from practice is that itrequires a second data set which is not used to fit theleaf predictors in order to make decisions about vari-able importance when growing the trees. One of theinnovations of the model we present in this paper is away to circumvent this limitation in an online settingwhile maintaining consistency.

3. Random Forests

In this section we briefly review the random forestsframework. For a more comprehensive review we re-fer the reader to Breiman (2001) and Criminisi et al.(2011).

Random forests are built by combining the predictionsof several trees, each of which is trained in isolation.Unlike in boosting (Schapire & Freund, 2012) wherethe base classifiers are trained and combined using asophisticated weighting scheme, typically the trees aretrained independently and the predictions of the trees

are combined through a simple majority vote.There are three main choices to be made when con-structing a random tree. These are (1) the method forsplitting the leafs, (2) the type of predictor to use ineach leaf, and (3) the method for injecting randomnessinto the trees.

Specifying a method for splitting leafs requires select-ing the shapes of candidate splits as well as a method


3/23

220221222223224225226227228229230231232233234235236237238

239240241242243244245246247248249250251

252253254255256257258259260261262263264

265266267268269270271272273274

2222222222222222222

2222223333333

3333333333333

3333333333


for evaluating the quality of each candidate. Typicalchoices here are to use axis aligned splits, where dataare routed to sub-trees depending on whether or notthey exceed a threshold value in a chosen dimension; orlinear splits, where a linear combination of features arethresholded to make a decision. The threshold valuein either case can be chosen randomly or by optimizinga function of the data in the leafs.

Figure 1. Three potential splits for a leaf node and the classhistograms for the children each split would create. Therightmost split creates the purest children and will havethe greatest information gain.

In order to split a leaf, a collection of candidate splitsare generated and a criterion is evaluated to choosebetween them. A simple strategy is to choose amongthe candidates uniformly at random, as in the mod-els analyzed in Biau et al. (2008). A more commonapproach is to choose the candidate split which opti-mizes a purity function over the leafs that would becreated. Typical choices here are to maximize the in-

formation gain, or the Gini gain (Hastie et al., 2013).This situation is illustrated in Figure 1.

The most common choice for predictors in each leafis to use the majority vote over the training pointswhich fall in that leaf. Criminisi et al. (2011) explorethe use of several different leaf predictors for regressionand manifold learning, but these generalizations arebeyond the scope of this paper. We consider majorityvote classifiers in our model.

Injecting randomness into the tree construction canhappen in many ways. The choice of which dimensionsto use as split candidates at each leaf can be random-

ized, as well as the choice of coefficients for randomcombinations of features. In either case, thresholdscan be chosen either randomly or by optimization oversome or all of the data in the leaf.

Another common method for introducing randomnessis to build each tree using a bootstrapped or sub-sampled data set. In this way, each tree in the forestis trained on slightly different data, which introducesdifferences between the trees.

4. Online Random Forests with Stream

Partitioning

In this section we describe the workings of our onlinerandom forest algorithm. A more precise (pseudo-code) description of the training procedure can befound in Appendix A.

4.1. Forest Construction

The random forest classifier is constructed by buildinga collection of random tree classifiers in parallel. Eachtree is built independently and in isolation from theother trees in the forest. Unlike many other randomforest algorithms we do not preform bootstrapping orsubsampling at this level; however, the individual treeseach have their own optional mechanism for subsam-pling the data they receive.

4.2. Tree Construction

Each node of the tree is associated with a rectangularsubset of RD, and at each step of the constructionthe collection of cells associated with the leafs of thetree forms a partition of RD. The root of the treeis RD itself. At each step we receive a data point(Xt, Yt) from the environment. Each point is assignedto one of two possible streams at random with fixedprobability. We denote stream membership with thevariable It {s, e}. How the tree is updated at eachtime step depends on which stream the correspondingdata point is assigned to.

We refer to the two streams as the structure streamand the estimation stream; points assigned to thesestreams are structure and estimation points, respec-tively. These names reflect the different uses of thetwo streams in the construction of the tree:

Structure points are allowed to influence the struc-ture of the tree partition, i.e. the locations of candidatesplit points and the statistics used to choose betweencandidates, but they are not permitted to influence thepredictions that are made in each leaf of the tree.

Estimation points are not permitted to influence theshape of the tree partition, but can be used to estimate

class membership probabilities in whichever leaf theyare assigned to.

Only two streams are needed to build a consistent for-est, but there is no reason we cannot have more. Forinstance, we explored the use of a third stream forpoints that the tree should ignore completely, whichgives a form of online sub-sampling in each tree. Wefound empirically that including this third streamhurts performance of the algorithm, but its presence


4/23

330331332333334335336337338339340341342343344345346347348

349350351352353354355356357358359360361

362363364365366367368369370371372373374

375376377378379380381382383384

3333333333333334444

4444444444444

4444444444444

4444444444


or absence does not affect the theoretical properties.

4.3. Leaf Splitting Mechanism

When a leaf is created the number of candidatesplit dimensions for the new leaf is set to min(1 +Poisson(), D), and this many distinct candidate di-

mensions are selected uniformly at random. We thencollect m candidate splits in each candidate dimen-sion (m is a parameter of the algorithm) by projectingthe first m structure points to arrive in the newly cre-ated leaf onto the candidate dimensions. We maintainseveral structural statistics for each candidate split.Specifically, for each candidate split we maintain classhistograms for each of the new leafs it would create, us-ing data from the estimation stream. We also maintainstructural statistics, computed from data in the struc-ture stream, which can be used to choose between thecandidate splits. The specific form of the structuralstatistics does not affect the consistency of our model,

but it is crucial that they depend only on data in thestructure stream.

Finally, we require two additional conditions whichcontrol when a leaf at depth d is split:

1. Before a candidate split can be chosen, the classhistograms in each of the leafs it would createmust incorporate information from at least (d)estimation points.

2. If any leaf receives more than (d) estimationpoints, and the previous condition is satisfied for

any candidate split in that leaf, then when thenext structure point arrives in this leaf it mustbe split regardless of the state of the structuralstatistics.

The first condition ensures that leafs are not split toooften, and the second condition ensures that no branchof the tree ever stops growing completely. In order toensure consistency we require that (d) mono-tonically in d. We also require that (d) (d) forconvenience.

When a structure point arrives in a leaf, if the first

condition is satisfied for some candidate split then theleaf may optionally be split at the corresponding point.The decision of whether to split the leaf or wait tocollect more data is made on the basis of the structuralstatistics collected for the candidate splits in that leaf.

4.4. Structural Statistics

In each candidate child we maintain an estimate of theposterior probability of each class, as well as the total

number of points we have seen fall in the candidatechild, both counted from the structure stream. In or-der to decide if a leaf should be split, we compute theinformation gain for each candidate split which satis-fies condition 1 from the previous section,

I(S) = H(A) |A|

|A|H(A)

|A|

|A|H(A) .

Here S is the candidate split, A is the cell belongingto the leaf to be split, and A and A are the twoleafs that would be created if A were split at S. Thefunction H(A) is the discrete entropy, computed overthe labels of the structure points which fall in the cellA.

We select the candidate split with the largest informa-tion gain for splitting, provided this split achieves aminimum threshold in information gain, . The valueof is a parameter of our algorithm.

4.5. Prediction

At any time the online forest can be used to makepredictions for unlabelled data points using the modelbuilt from the labelled data it has seen so far. To makea prediction for a query point x at time t, each treecomputes, for each class k,

kt (x) =1

Ne(At(x))

(X,Y)At(x)

I=e

I {Y = k} ,

where At(x) denotes the leaf of the tree containing xat time t, and Ne(At(x)) is the number of estimation

points which have been counted in At(x) during itslifetime. Similarly, the sum is over the labels of thesepoints. The tree prediction is then the class whichmaximizes this value:

gt(x) = arg maxk

{kt (x)} .

The forest predicts the class which receives the mostvotes from the individual trees.

Note that this requires that we maintain class his-tograms from both the structure and estimationstreams separately for each candidate child in thefringe of the tree. The counts from the structure

stream are used to select between candidate splitpoints, and the counts from the estimation stream areused to initialize the parameters in the newly createdleafs after a split is made.

4.6. Memory Management

The typical approach to building trees online, whichis employed in Domingos & Hulten (2000) and Saf-fari et al. (2009), is to maintain a fringe of candidate


5/23

440441442443444445446447448449450451452453454455456457458

459460461462463464465466467468469470471

472473474475476477478479480481482483484

485486487488489490491492493494

4444455555555555555

5555555555555

5555555555555

5555555555


children in each leaf of the tree. The algorithm col-lects statistics in each of these candidate children untilsome (algorithm dependent) criterion is met, at whichpoint a pair of candidate children is selected to replacetheir parent. The selected children become leafs in thenew tree, acquiring their own candidate children, andthe process repeats. Our algorithm also uses this ap-proach.

The difficulty here is that the trees must be grownbreadth first, and maintaining the fringe of potentialchildren is very memory intensive when the trees arelarge. Our algorithm also suffers from this deficiency,as maintaining the fringe requires O(cmd) statistics ineach leaf, where d is the number of candidate split di-mensions, m is the number of candidate split points(i.e. md pairs of candidate children per leaf) and cis the number of classes in the problem. The num-ber of leafs grows exponentially fast with tree depth,meaning that for deep trees this memory cost becomes

prohibitive.

Offline forests do not suffer from this problem, becausethey are able to grow the trees depth first. Since theydo not need to accumulate statistics for more thanone leaf at a time, the cost of computing even sev-eral megabytes of statistics per split is negligible. Al-though the size of the trees still grows exponentiallywith depth, this memory cost is dwarfed by the sav-ings from not needing to store split statistics for allthe leafs.

In practice the memory problem is resolved either bygrowing small trees, as in Saffari et al. (2009), or bybounding the number of nodes in the fringe of the tree,as in Domingos & Hulten (2000). Other models ofstreaming random forests, such as those discussed inAbdulsalam (2008), build trees in sequence instead ofin parallel, which reduces the total memory usage.

Our algorithm makes use of a bounded fringe andadopts the technique of Domingos & Hulten (2000) tocontrol the policy for adding and removing leafs fromthe fringe.

In each tree we partition the leafs into two sets: wehave a set of active leafs, for which we collect split

statistics as described in earlier sections, and a set ofinactive leafs for which we store only two numbers.We call the set of active leafs the fringe of the tree,and describe a policy for controlling how inactive leafsare added to the fringe.

In each inactive leaf At we store the following twoquantities

p(At) which is an estimate of(At) = P (X At),

and

e(At) which is an estimate of e(A) =P (gt(X) = Y | X At).

Both of these are estimated based on the estima-tion points which arrive in At during its lifetime.

From these two numbers we form the statistic s(A) =p(A)e(A) (with corresponding true value s(A) =

p(A)e(A)) which is an upper bound on the improve-ment in error rate that can be obtained by splittingA.

Membership in the fringe is controlled by s(A). Whena leaf is split it relinquishes its place in the fringe andthe inactive leaf with the largest value of s(A) is chosento take its place. The newly created leafs from the splitare initially inactive and must compete with the otherinactive leafs for entry into the fringe.

Unlike Domingos & Hulten (2000), who use this tech-

nique only as a heuristic for managing memory use, weincorporate the memory management directly into ouranalysis. The analysis in Appendix B shows that ouralgorithm, including a limited size fringe, is consistent.

5. Theory

In this section we state our main theoretical results andgive an outline of the strategy for establishing consis-tency of our online random forest algorithm. In theinterest of space and clarity we do not include proofsin this section. Unless otherwise noted, the proofs ofall claims appear in Appendix B.

We denote the tree partition created by our online ran-dom forest algorithm from t data points as gt. As tvaries we obtain a sequence of classifiers, and we areinterested in showing that the sequence {gt} is consis-tent, or more precisely that the probability of error ofgt converges in probability to the Bayes risk, i.e.

L(gt) = P (gt(X, Z) = Y | Dt) L ,

as t . Here (X, Y) is a random test point and Zdenotes the randomness in the tree construction algo-rithm. Dt is the training set (of size t) and the proba-

bility in the convergence is over the random selectionof Dt. The Bayes risk is the probability of error ofthe Bayes classifier, which is the classifier that makespredictions by choosing the class with the highest pos-terior probability,

g(x) = arg maxk

P (Y = k | X = x) ,

(where ties are broken in favour of the smaller index).The Bayes risk L(g) = L is the minimum achievable


6/23

550551552553554555556557558559560561562563564565566567568

569570571572573574575576577578579580581

582583584585586587588589590591592593594

595596597598599600601602603604

6666666666666666666

6666666666666

6666666666666

6666666666


risk of any classifier for the distribution of (X, Y). Inorder to ease notation, we drop the explicit dependenceon Dt in the remainder of this paper. More informa-tion about this setting can be found in Devroye et al.(1996).

Our main result is the following theorem:

Theorem 1. Suppose the distribution ofX has a den-sity with respect to the Lebesgue measure and that thisdensity is bounded from above and below. Then theonline random forest classifier described in this paperis consistent.

The first step in proving Theorem 1 is to show that theconsistency of a voting classifier, such as a random for-est, follows from the consistency of the base classifiers.We prove the following proposition, which is a straight-forward generalization of a proposition from Biau et al.(2008), who prove the same result for two class ensem-bles.

Proposition 2. Assume that the sequence{gt} of ran-domized classifiers is consistent for a certain distribu-

tion of (X, Y). Then the voting classifier, g(M)t ob-

tained by taking the majority vote over M (not nec-essarily independent) copies of gt is also consistent.

With Proposition 2 established, the remainder of theeffort goes into proving the consistency of our tree con-struction.

The first step is to separate the stream splitting ran-domness from the remaining randomness in the tree

construction. We show that if a classifier is condition-ally consistent based on the outcome of some randomvariable, and the sampling process for this randomvariable generates acceptable values with probability1, then the resulting classifier is unconditionally con-sistent.

Proposition 3. Suppose {gt} is a sequence of classi-fiers whose probability of error converges conditionallyin probability to the Bayes risk L for a specified dis-tribution on(X, Y), i.e.

P (gt(X,Z,I) = Y | I) L

for all I I and that is a distribution on I. If(I) = 1 then the probability of error converges un-conditionally in probability, i.e.

P (gt(X,Z,I) = Y) L

In particular, {gt} is consistent for the specified distri-bution.

Proposition 3 allows us to condition on the randomvariables {It}t=1 which partition the data stream into

structure and estimation points in each tree. Providedthat the random partitioning process produces accept-able sequences with probability 1, it is sufficient toshow that the random tree classifier is consistent con-ditioned on such a sequence. In particular, in the re-mainder of the argument we assume that {It}

t=1 is a

fixed, deterministic sequence which assigns infinitelymany points to each of the structure and estimationstreams. We refer to such a sequence as a partitioningsequence.

S I E

Figure 2. The dependency structure of our algorithm. Srepresents the randomness in the structure of the tree par-tition, E represents the randomness in the leaf estimatorsand I represents the randomness in the partitioning of thedata stream. E and S are independent conditioned on I.

The reason this is useful is that conditioning on a par-titioning sequence breaks the dependence between thestructure of the tree partition and the estimators inthe leafs. This is a powerful tool because it gives usaccess to a class of consistency theorems which relyon this type of independence. However, before we areable to apply these theorems we must further reduceour problem to proving the consistency of estimatorsof the posterior distribution of each class.

Proposition 4. Suppose we have regression esti-mates, kt (x), for each class posterior

k(x) =P (Y = k | X = x), and that these estimates are eachconsistent. The classifier

gt(x) = arg maxk

{kt (x)}

(where ties are broken in favour of the smaller index)is consistent for the corresponding multiclass classifi-cation problem.

Proposition 4 allows us to reduce the consistency ofthe multiclass classifier to the problem of proving theconsistency of several two class posterior estimates.Given a set of classes {1, . . . , c} we can re-assign thelabels using the map (X, Y) (X, I {Y = k}) for anyk {1, . . . , c} in order to get a two class problem where

P (Y = 1 | X = x) in this new problem is equal to k(x)in the original multiclass problem. Thus to prove con-sistency of the multiclass classifier it is enough to showthat each of these two class posteriors is consistent. Tothis end we make use of the following theorem from De-vroye et al. (1996).

Theorem 5. Consider a partitioning classificationrule which builds a prediction t(x) of (x) =P (Y = 1 | X = x) by averaging the labels in each cell


7/23

660661662663664665666667668669670671672673674675676677678

679680681682683684685686687688689690691

692693694695696697698699700701702703704

705706707708709710711712713714

7777777777777777777

7777777777777

7777777777777

7777777777


of the partition. If the labels of the voting points donot influence the structure of the partition then

E [|t(x) (x)|] 0

provided that

1. diam(At(X)) 0 in probability,2. Ne(At(X)) in probability.

Proof. See Theorem 6.1 in Devroye et al. (1996).

Here At(X) refers to the cell of the tree partition con-taining a random test point X, and diam(A) indicatesthe diameter of set A. The diameter is defined as themaximum distance between any two points falling inA,

diam(A) = supx,yA

||x y|| .

The quantity Ne(At(X)) is the number of points con-tributing to the estimation of the posterior at X.

This theorem places two requirements on the cells ofthe partition. The first condition ensures that the cellsare sufficiently small that small details of the posteriordistribution can be represented. The second conditionrequires that the cells be large enough that we areable to obtain high quality estimates of the posteriorprobability in each cell.

The leaf splitting mechanism described in Section 4.3ensures that the second condition of Theorem 5 is sat-

isfied. However, showing that our algorithm satisfiesthe first condition requires significantly more work.The chief difficulty lies in showing that every leaf of thetree will be split infinitely often in probability. Oncethis claim is established a relatively straightforwardcalculation shows that the expected size of each di-mension of a leaf is reduced each time it is split.

So far we have described the approach to proving con-sistency of our algorithm with an unbounded fringe.If the tree is small (i.e. never has more leafs than themaximum fringe size) then the analysis is unchanged.However, since our trees are required to grow to un-

bounded size this is not possible. To handle this casewe derive an upper bound on the time required for aninactive leaf to enter the fringe. Once the leaf it re-mains there until it is split and the analysis from theunbounded fringe case applies.

These details are somewhat technical, so we refer theinterested reader to Appendix B for more information,as well as the proofs of the propositions stated in thissection.

6. Experiments

In this section we demonstrate some empirical resultson simple problems in order to illustrate the propertiesof our algorithm. We also provide a comparison to anexisting online random forest algorithm. Following thereview process we plan to release code to reproduce all

of the experiments in this section.

1 0

2

1 0

3

1 0

4

D a t a S i z e

0 . 3 5

0 . 4 0

0 . 4 5

0 . 5 0

0 . 5 5

0 . 6 0

0 . 6 5

0 . 7 0

A

c

c

u

r

a

c

y

F o r e s t a n d t r e e a c c u r a c y

T r e e s

F o r e s t

B a y e s

Figure 3. Prediction accuracy of the forest and the trees itaverages on a simple mixture of Gaussians problem. Thehorizontal line shows the accuracy of the Bayes classifieron this problem. We see that the accuracy of the forestconsistently dominates the expected accuracy of the trees.The forest in this example contains 100 trees. Error regionsshow one standard deviation computed over 10 runs.

6.1. Advantage of a Forest

Our first experiment demonstrates that although theindividual trees are consistent classifiers, empiricallythe performance of the forest is significantly betterthan each of the trees for problems with finite data.We demonstrate this on a synthetic five class mixtureof Gaussians problem with significant class overlap andvariation in prior weights.

From Figure 3 it is clear that the forest converges muchmore quickly than the individual trees. Result profilesof this kind are common in the boosting and randomforests literature; however, in practice one often usesinconsistent base classifiers in the ensemble (e.g. boost-

ing with decision stumps or random forests where thetrees are grown to full size). This experiment demon-strates that although our base classifiers provably con-verge, empirically there is still a benefit from averagingin finite time.

6.2. Growing leaves

Our next experiment demonstrates the importance ofthe condition that (d) , i.e. having the num-


8/23

770771772773774775776777778779780781782783784785786787788

789790791792793794795796797798799800801

802803804805806807808809810811812813814

815816817818819820821822823824

8888888888888888888

8888888888888

8888888888888

8888888888


103

104

105

106

Data Size

Bayes

Exce

ssError

Gap to Bayes Error

(d) = 1

(d) = 2d

Figure 4. Excess error above the Bayes risk for a simplesynthetic problem. The solid line shows the excess error fora forest where each tree is built to full depth. The dashedline shows a forest where each tree requires 2d examples ina leaf at level d in order to split. Both forests contain 100trees.

ber of data points in each leaf grow over time. Wedemonstrate this using a synthetic two class distri-bution specifically designed to exhibit problems when(d) does not grow.

In the distribution we construct, P (X = x) is uni-form on the unit square in R2, and the posterior

P (Y = 1 | X = x) = 0.5001 for all x. Figure 4 showsthe excess error of two forests trained on several datasets of different sizes sampled from this distribution.In one of the forests the trees are grown to full depth,while in the other we force the size of the leafs to in-crease with their depth in the tree.

As can be seen in Figure 4, building trees to full depthprevents the forest from making progress towards theBayes error over a huge range of data set sizes, whereasthe forest composed of trees with growing leafs steadilydecreases its excess error.

Admittedly, this scenario is quite artificial, and it canbe difficult to find real problems where the differenceis so pronounced. It is still an open question as towhether a forest can be made consistent by averagingover an infinite number of trees of full depth (althoughsee Breiman (2004) and Biau (2012) for results in thisdirection). The purpose of this example is to show thatin the common scenario where the number of trees isa fixed parameter of the algorithm, having leafs thatgrow over time is important.

1 0

2

1 0

3

D a t a S i z e

0 . 4

0 . 5

0 . 6

0 . 7

0 . 8

0 . 9

1 . 0

A

c

c

u

r

a

c

y

U S P S

O f f l i n e

O n l i n e

S a f f a r i e t a l . ( 2 0 0 9 )

Figure 5. Comparison between offline random forests andour online algorithm on the USPS data set. The onlineforest uses 10 passes through the data set. The third lineis our implementation of the algorithm from Saffari et al.(2009); the performance shown here is identical to whatthey report. Error regions show one standard deviation

computed over 10 runs.

6.3. Comparison to Offline

In our third experiment, we demonstrate that our on-line algorithm is able to achieve similar performanceto an offline implementation of random forests andalso compare to an existing online random forests al-gorithm on a small non-synthetic problem.

In particular, we demonstrate this on the USPS dataset from the LibSVM repository (Chang & Lin, 2011).

We have chosen the USPS data for this experimentbecause it allows us to compare our results directly tothose of Saffari et al. (2009), whose algorithm is verysimilar to our own. In the interest of comparabilitywe also use a forest of 100 trees and set the minimuminformation gain threshold ( in our model) to 0.1.We show results from both online algorithms with 10passes through the data.

Figure 5 shows that we are able to achieve performancevery similar to the offline random forest on the fulldata. The performance we achieve is identical to theperformance reported by Saffari et al. (2009) on this

data set.

6.4. Kinect application

For our final experiment we evaluate our online ran-dom forest algorithm on the challenging computer vi-sion problem of predicting human body part labelsfrom a depth image. Our procedure closely followsthe work of Shotton et al. (2011) which is used in thecommercially successful Kinect system. Applying the


9/23

880881882883884885886887888889890891892893894895896897898

899900901902903904905906907908909910911

912913914915916917918919920921922923924

925926927928929930931932933934

9999999999999999999

9999999999999

9999999999999

9999999999


Figure 6. Left: Depth, ground truth body parts and pre-dicted body parts. Right: A candidate feature specifiedby two offsets.

same approach as Shotton et al. (2011), our onlineclassifier predicts the body part label of a single pixelP in a depth image. To predict all the labels of adepth image, the classifier is applied to every pixel inparrallel.

For our dataset, we generate pairs of 640x480 resolu-tion depth and body part images by rendering randomposes from the CMU mocap dataset. The 19 bodyparts and one background class are represented by 20unique color identifiers in the body part image. Figure6 (left) visualizes the raw depth image, ground truthbody part labels and body parts predicted by our clas-sifier for one pose. During training, we sample 50 pix-

els without replacement for each body part class fromeach pose; thus, producing 1000 data points for eachdepth image. During testing we evaluate the predic-tion accuracy of all non background pixels as this pro-vides a more informative accuracy metric since mostof the pixels are background and are relatively easy topredict. For this experiment we use a stream of 1000poses for training and 500 poses for testing.

Each node of each decision tree computes the depthdifference between two pixels described by two off-sets from P (the pixel being classified). At trainingtime, candidate pairs of offsets are sampled from a 2-

dimensional Gaussian distributions with variance 75.0.The offsets are scaled by the depth of the pixel P toproduce depth invariant features. Figure 6 (right) vi-sualizes a candidate feature for the pixel in the greenbox. The resulting feature value is the depth differ-ence between the pixel in the red box and the pixel inthe white box.

In this experiment we construct a forest of 25 treeswith 2000 candidate offsets (), 10 candidate splits

0

2

1 0

3

1 0

4

1 0

5

1 0

6

D a t a S i z e

K i n e c t

(

d) = 1 0

( 1.

0 1

d

)

S a f f a r i e t a l . ( 2 0 0 9 )

Figure 7. Comparison of our online algorithm with Saffariet al. (2009) on the kinect application; Our algorithm doessignificantly better with less memory.

(m) and a minimum information gain of 0.01 ().For Saffari et al. (2009) we set the number of samplepoints required to split to 10 and for our own algo-rithm we set (d) = 10 (1.01d) and (d) = 4 (d).With this parameter setting each active leaf stores20 10 2000 2 = 400, 000 statistics which requires1.6MB of memory. By limiting the fringe to 1000 ac-tive leaves our algorithm requires 1.6GB of memory forleaf statistics. To limit the maximum memory used bySaffari et al. (2009) we set the maximum depth to 8which uses up to 25 28 = 6400 active leaves whichrequires up to 10GB of memory for leaf statistics.

Figure 7 shows that our algorithm achieves signifi-

cantly better accuracy while requiring less memory.However, our algorithm does not do as well when see-ing a small number of data points. This is likely aresult of separating data points into structure and es-timation streams and not including all leaves in theactive set.

7. Discussion and Future Work

In this paper we described an algorithm for buildingonline random forests and showed that our algorithmis consistent. To the best of our knowledge this is thefirst consistency result for online random forests.

The theory guides certain choices made when design-ing our algorithm, notably that it is necessary for theleafs in each tree to increase in size over time. Ourexperiments on simple problems confirm that this re-quirement is important.

Growing trees online in the obvious way requires largeamounts of memory, since the trees must be grownbreadth first and each leaf must store are large num-


10/23

990991992993994995996997998999100010011002100310041005100610071008

1009101010111012101310141015101610171018101910201021

1022102310241025102610271028102910301031103210331034

1035103610371038103910401041104210431044

1111111111111111111

1111111111111

1111111111111

1111111111


ber of statistics related to its potential children. Weincorporated a memory management technique fromDomingos & Hulten (2000) in order to limit the num-ber of leafs in the fringe of the tree. This refinementis important, since it enables our algorithm to growlarge trees. The analysis shows that our algorithm isstill consistent with this refinement.

Finally, our current algorithm is restricted to axisaligned splits. Many implementations of randomforests use more elaborate split shapes, such as randomlinear or quadratic combinations of features. Thesestrategies can be highly effective in practice, especiallyin sparse or high dimensional settings. Understandinghow to maintain consistency in these settings is an-other potentially interesting direction of inquiry.


11/23

1100110111021103110411051106110711081109111011111112111311141115111611171118

1119112011211122112311241125112611271128112911301131

1132113311341135113611371138113911401141114211431144

1145114611471148114911501151115211531154

1111111111111111111

1111111111111

1111111111111

1111111111


References

H. Abdulsalam. Streaming Random Forests. PhD thesis,Queens University, 2008.

G. Biau. Analysis of a Random Forests model. JMLR, 13(April):10631095, 2012.

G. Biau, L. Devroye, and G. Lugosi. Consistency of randomforests and other averaging classifiers. JMLR, 9:20152033, 2008.

A. Bifet, G. Holmes, and B. Pfahringer. MOA: MassiveOnline Analysis, a framework for stream classificationand clustering. In Workshop on Applications of PatternAnalysis, pp. 316, 2010.

A. Bifet, E. Frank, G. Holmes, and B. Pfahringer. Ensem-bles of Restricted Hoeffding Trees. ACM Transactionson Intelligent Systems and Technology, 3(2):120, Febru-ary 2012.

A. Bifet, G. Holmes, and B. Pfahringer. New ensemblemethods for evolving data streams. In ACM SIGKDD

Intl. Conference on Knowledge Discovery and Data Min-ing, 2009.

A. Bosch, A. Zisserman, and X. Munoz. Image classifica-tion using random forests and ferns. In InternationalConference on Computer Vision, pp. 18, 2007.

L. Breiman. Random forests. Machine Learning, 45(1):532, 2001.

L. Breiman. Consistency for a Simple Model of RandomForests. Technical report, University of California atBerkeley, 2004.

L. Breiman, J. Friedman, C. Stone, and R. Olshen. Clas-sification and Regression Trees. CRC Press LLC, Boca

Raton, Florida, 1984.

C. Chang and C. Lin. LIBSVM: A library for support vec-tor machines. ACM Transactions on Intelligent Systemsand Technology, 2:27:127:27, 2011.

G. Cormode. Sketch techniques for approximate query pro-cessing. Synposes for Approximate Query Processing:Samples, Histograms, Wavelets and Sketches, Founda-tions and Trends in Databases, 2011.

G. Cormode and S. Muthukrishnan. An improved datastream summary: the count-min sketch and its applica-tions. Journal of Algorithms, 55(1):5875, April 2005.

A. Criminisi, J. Shotton, and E. Konukoglu. Decision

forests: A unified framework for classification, regres-sion, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Com-puter Graphics and Vision, 7(2-3):81227, 2011.

D. Cutler, T. Edwards, and K. Beard. Random forestsfor classification in ecology. Ecology, 88(11):278392,November 2007.

L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic The-ory of Pattern Recognition. Springer-Verlag, New York,

USA, 1996.

P. Domingos and G. Hulten. Mining high-speed datastreams. In International Conference on Knowledge Dis-covery and Data Mining, pp. 7180. ACM, 2000.

J. Gama, P. Medas, and P. Rodrigues. Learning decisiontrees from dynamic data streams. In ACM symposium

on Applied computing, SAC 05, pp. 573577, New York,NY, USA, 2005. ACM.

T. Hastie, R. Tibshirani, and J. Friedman. The Elementsof Statistical Learning. Springer, 10 edition, 2013.

H. Ishwaran and U. Kogalur. Consistency of random sur-vival forests. Statistics and Probability Letters, 80:10561064, 2010.

Y. Lin and Y. Jeon. Random forests and adaptive nearestneighbors. Technical Report 1055, University of Wiscon-sin, 2002.

N. Meinshausen. Quantile regression forests. JMLR, 7:983999, 2006.

N. Oza and S. Russel. Online Bagging and Boosting. InArtificial Intelligence and Statistics, volume 3, 2001.

A. Prasad, L. Iverson, and A. Liaw. Newer Classificationand Regression Tree Techniques: Bagging and RandomForests for Ecological Prediction. Ecosystems, 9(2):181199, March 2006. ISSN 1432-9840.

A. Saffari, C. Leistner, J. Santner, M. Godec, andH. Bischof. On-line random forests. In InternationalConference on Computer Vision Workshops (ICCVWorkshops), pp. 13931400. IEEE, 2009.

R. Schapire and Y. Freund. Boosting: Foundations and Al-gorithms. MIT Press, Cambridge, Massachusetts, 2012.

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finoc-chio, R. Moore, A. Kipman, and A. Blake. Real-timehuman pose recognition in parts from single depth im-ages. CVPR, pp. 12971304, 2011.

V. Svetnik, A. Liaw, C. Tong, J. Culberson, R. Sheridan,and B. Feuston. Random forest: a classification andregression tool for compound classification and QSARmodeling. Journal of Chemical Information and Com-puter Sciences, 43(6):194758, 2003.


12/23

1210121112121213121412151216121712181219122012211222122312241225122612271228

1229123012311232123312341235123612371238123912401241

1242124312441245124612471248124912501251125212531254

1255125612571258125912601261126212631264

1111111111111111111

1111111111111

1111111111111

1111111111


A. Algorithm pseudo-code

Candidate split dimension A dimension along which a split may be made. Each leaf selects min(1 +Poisson(), D) of these when it is created.

Candidate split point One of the first m structure points to arrive in a leaf.

Candidate split A combination of a candidate split dimension and a position along that dimension to split.

These are formed by projecting each candidate split point into each candidate split dimension.

Candidate children Each candidate split in a leaf induces two candidate children for that leaf. These are alsoreferred to as the left and right child of that split.

Ne(A) is a count of estimation points in the cell A, and Ye(A) is the histogram of labels of these points in A.

Ns(A) is a count of structure point in the cell A, and Ys(A) is the histogram of labels of these points in A.

Algorithm 1 BuildTree

Require: Initially the tree has exactly one leaf (TreeRoot) which covers the whole spaceRequire: The dimensionality of the input, D. Parameters , m and .

SelectCandidateSplitDimensions(TreeRoot, min(1 + Poisson(), D))

for t = 1 . . . doReceive (Xt, Yt, It) from the environmentAt leaf containing Xtif It = estimation then

UpdateEstimationStatistics(At, (Xt, Yt))for all S CandidateSplits(At) do

for all A CandidateChildren(S) doif Xt A then

UpdateEstimationStatistics(A, (Xt, Yt))end if

end for

end for

else if It = structure thenif At has fewer than m candidate split points then

for all d CandidateSplitDimensions(At) doCreateCandidateSplit(At, d, dXt)

end for

end if

for all S CandidateSplits(At) dofor all A CandidateChildren(S) do

if Xt A thenUpdateStructuralStatistics(A, (Xt, Yt))

end if

end for

end for

if CanSplit(At) thenif ShouldSplit(At) thenSplit(At)

else if MustSplit(At) thenSplit(At)

end if

end if

end if

end for


13/23

1320132113221323132413251326132713281329133013311332133313341335133613371338

1339134013411342134313441345134613471348134913501351

1352135313541355135613571358135913601361136213631364

1365136613671368136913701371137213731374

1111111111111111111

1111111111111

1111111111111

1111111111


Algorithm 2 Split

Require: A leaf AS BestSplit(A)A LeftChild(A)SelectCandidateSplitDimensions(A, min(1 +Poisson(), D))A RightChild(A)SelectCandidateSplitDimensions(A, min(1 +Poisson(), D))return A, A

Algorithm 3 CanSplit

Require: A leaf Ad Depth(A)for all S CandidateSplits(A) do

if SplitIsValid(A, S) thenreturn true

end if

end forreturn false

Algorithm 4 SplitIsValid

Require: A leaf ARequire: A split S

d Depth(A)A LeftChild(S)A RightChild(S)return Ne(A) (d) and Ne(A) (d)

Algorithm 5 MustSplit

Require: A leaf Ad Depth(A)return Ne(A) (d)

Algorithm 6 ShouldSplit

Require: A leaf Afor all S CandidateSplits(A) do

if InformationGain(S) > thenif SplitIsValid(A, S) then

return true

end if

end if

end for

return false

Algorithm 7 BestSplit

Require: A leaf A

Require: At least one valid candidate split exists forAbest split nonefor all S CandidateSplits(A) do

if InformationGain(A, S) > InformationGain(A,best split) then

if SplitIsValid(A, S) thenbest split S

end if

end if

end for

return best split

Algorithm 8 InformationGain

Require: A leaf ARequire: A split S

A LeftChild(S)A RightChild(S)

return Entropy(Ys(A))Ns(A)

Ns(A) Entropy(Ys(A))

Ns(A)Ns(A) Entropy(Y

s(A))

Algorithm 9 UpdateEstimationStatistics

Require: A leaf ARequire: A point (X, Y)

Ne(A) Ne(A) + 1Ye(A) Ye(A) + Y

Algorithm 10 UpdateStructuralStatistics

Require: A leaf ARequire: A point (X, Y)

Ns(A) Ns(A) + 1Ys(A) Ys(A) + Y


14/23

1430143114321433143414351436143714381439144014411442144314441445144614471448

1449145014511452145314541455145614571458145914601461

1462146314641465146614671468146914701471147214731474

1475147614771478147914801481148214831484

1111111111111111111

1111111111111

1111111111111

1111111111


B. Proof of Consistency

B.1. A note on notation

A will be reserved for subsets of RD, and unless otherwise indicated it can be assumed that A denotes a cellof the tree partition. We will often be interested in the cell of the tree partition containing a particular point,which we denote A(x). Since the partition changes over time, and therefore the shape ofA(x) changes as well,

we use a subscript to disambiguate: At(x) is the cell of the partition containing x at time t. Cells in the treepartition have a lifetime which begins when they are created as a candidate child to an existing leaf and endswhen they are themselves split into two children. When referring to a point X At(x) it is understood that is restricted to the lifetime of At(x).

We treat cells of the tree partition and leafs of the tree defining it interchangeably, denoting both with anappropriately decorated A.

N generally refers to the number of points of some type in some interval of time. The various decorations theN receives specify which particular type of point or interval of time is being considered. A superscript alwaysdenotes type, so Nk refers to a count of points of type k. Two special types, e and s, are used to denoteestimation and structure points, respectively. Pairs of subscripts are used to denote time intervals, so Nka,bdenotes the number of points of type k which appear during the time interval [a, b]. We also use N as a functionwhose argument is a subset ofRD in order to restrict the counting spatially: Nea,b(A) refers to the number of

estimation points which fall in the set A during the time interval [a, b]. We make use of one additional variantofN as a function when its argument is a cell in the partition: when we write Nk(At(x)), without subscripts onN, the interval of time we count over is understood to be the lifetime of the cell At(x).

B.2. Preliminaries

Lemma 6. Suppose we partition a stream of data into c parts by assigning each point (Xt, Yt) to part It {1, . . . , c} with fixed probability pk, meaning that

Nka,b =b

t=a

I {It = k} . (1)

Then with probability 1, Nka,b for all k {1, . . . , c} as b a .

Proof. Note that P (It = 1) = p1 and these events are independent for each t. By the second Borel-Cantellilemma, the probability that the events in this sequence occur infinitely often is 1. The cases for It {2, . . . , c}are similar.

Lemma 7. Let Xt be a sequence of iid random variables with distribution , let A be a fixed set such that(A) > 0 and let {It} be a fixed partitioning sequence. Then the random variable

Nka,b(A) =

atb:It=k

I {Xt A}

is Binomial with parameters Nka,b and (A). In particular,

P

Nka,b(A)

(A)

2Nka,b

exp

(A)2

2Nka,b

which goes to 0 as b a , where Nka,b is the deterministic quantity defined as in Equation 1.

Proof. Nka,b(A) is a sum of iid indicator random variables so it is Binomial. It has the specified parameters

because it is a sum over Nka,b elements and P (Xt A) = (A). Moreover, E

Nka,b(A)

= (A)Nka,b so by

Hoeffdings inequality we have that

P

Nka,b(A) E

Nka,b(A)

Nka,b

= P

Nka,b(A) Nka,b((A) )

exp

22Nka,b

.

Setting = 12(A) gives the desired result.


15/23

1540154115421543154415451546154715481549155015511552155315541555155615571558

1559156015611562156315641565156615671568156915701571

1572157315741575157615771578157915801581158215831584

1585158615871588158915901591159215931594

1111111111111111111

1111111111111

1111111111111

1111111111


B.3. Proof of Proposition 2

Proof. Let g(x) denote the Bayes classifier. Consistency of {gt} is equivalent to saying that E [L(gt)] =P (gt(X, Z) = Y) L

. In fact, since P (gt(X, Z) = Y | X = x) P (g(X) = Y | X = x) for all x RD, consis-

tency of{gt} means that for -almost all x,

P (gt(X, Z) = Y | X = x) P (g(X) = Y | X = x) = 1 maxk

{k(x)}

Define the following two sets of indices

G = {k | k(x) = maxk

{k(x)}} ,

B = {k | k(x) < maxk

{k(x)}} .

Then

P (gt(X, Z) = Y | X = x) =k

P (gt(X, Z) = k | X = x)P (Y = k|X = x)

(1 maxk

{k(x)})kG

P (gt(X, Z) = k | X = x) +kB

P (gt(X, Z) = k | X = x) ,

which means it suffices to show that P

g(M)t (X, Z

M) = k | X = x

0 for all k B. However, using ZM to

denote M (possibly dependent) copies ofZ, for all k B,

P

g(M)t (x, Z

M) = k

= P

M

j=1

I {gt(x, Zj) = k} > maxc=k

Mj=1

I {gt(x, Zj) = c}

P

M

j=1

I {gt(x, Zj) = k} 1

By Markovs inequality,

E

Mj=1

I {gt(x, Zj) = k}

= MP (gt(x, Z) = k) 0


Proof. The sequence in question is uniformly integrable, so it is sufficient to show that E [P (gt(X,Z,I) = Y | I)] L implies the result, where the expectation is taken over the random selection of training set.

We can write

P (gt(X,Z,I) = Y) = E [P (gt(X,Z,I) = Y | I)]

=

I

P (gt(X,Z,I) = Y | I) (I) +

IcP (gt(X,Z,I) = Y | I) (I)

By assumption (Ic) = 0, so we have

limt

P (gt(X,Z,I) = Y) = limt

I

P (gt(X,Z,I) = Y | I) (I)


16/23

1650165116521653165416551656165716581659166016611662166316641665166616671668

1669167016711672167316741675167616771678167916801681

1682168316841685168616871688168916901691169216931694

1695169616971698169917001701170217031704

1111111111111111111

1111111111111

1111111111111

1111111111


Since probabilities are bounded in the interval [0, 1], the dominated convergence theorem allows us to exchangethe integral and the limit,

=

I

limt

P (gt(X,Z,I) = Y | I) (I)

and by assumption the conditional risk converges to the Bayes risk for all I I, so

= LI

(I)

= L

which proves the claim.


Proof. By definition, the rule

g(x) = arg maxk

{k(x)}

(where ties are broken in favour of smaller k) achieves the Bayes risk. In the case where all the k(x) are equalthere is nothing to prove, since all choices have the same probability of error. Therefore, suppose there is at leastone k such that k(x) < g(x)(x) and define

m(x) = g(x)(x) maxk

{k(x) | k(x) < g(x)(x)}

mt(x) = g(x)t (x) max

k{kt (x) |

k(x) < g(x)(x)}

The function m(x) 0 is the margin function which measures how much better the best choice is than the secondbest choice, ignoring possible ties for best. The function mt(x) measures the margin of gt(x). Ifmt(x) > 0 thengt(x) has the same probability of error as the Bayes classifier.

The assumption above guarantees that there is some such that m(x) > . Using C to denote the number ofclasses, by making t large we can satisfy

P

|kt (X) k(X)| < /2

1 /C

since kt is consistent. Thus

P

C

k=1

|kt (X) k(X)| < /2

1 K+

Ck=1

P

|kt (X) k(X)| < /2

1

So with probability at least 1 we have

mt(X) = g(X)t maxk {

kt (X) |

k

(X) < g(X)

(X)}

(g(X) /2) maxk

{kt (X) + /2 | k(X) < g(x)(X)}

= g(X) maxk

{k(X) | k(X) < g(X)(X)}

= m(X)

> 0

Since is arbitrary this means that the risk ofgt converges in probability to the Bayes risk.


17/23

1760176117621763176417651766176717681769177017711772177317741775177617771778

1779178017811782178317841785178617871788178917901791

1792179317941795179617971798179918001801180218031804

1805180618071808180918101811181218131814

1111111111111111111

1111111111111

1111111111111

1111111111


A

A

A

A A

A

d

Figure 8. This Figure shows the setting of Proposition 8. Conditioned on a partially built tree we select an arbitrary leafat depth d and an arbitrary candidate split in that leaf. The proposition shows that, assuming no other split for A isselected, we can guarantee that the chosen candidate split will occur in bounded time with arbitrarily high probability.

B.6. Proof of Theorem 1

The proof of Theorem 1 is built in several pieces.

Proposition 8. Fix a partitioning sequence. Let t0 be a time at which a split occurs in a tree built using this

sequence, and let gt0 denote the tree after this split has been made. IfA is one of the newly created cells ingt0 then we can guarantee that the cell A is split before time t > t0 with probability at least 1 by making tsufficiently large.

Proof. Let d denote the depth of A in the tree gt0 and note that (A) > 0 with probability 1 since X has adensity. This situation is illustrated in Figure 8. By construction, if the following conditions hold:

1. For some candidate split in A, the number of estimation points in both children is at least (d),

2. The number of estimation points in A is at least (d),

then the algorithm must split A when the next structure point arrives. Thus in order to force a split we needthe following sequence of events to occur:

1. A structure point must arrive in A to create a candidate split point.

2. The above two conditions must be satisfied.

3. Another structure point must arrive in A to force a split.

It is possible for a split to be made before these events occur, but assuming a split is not triggered by some othermechanism we can guarantee that this sequence of events will occur in bounded time with high probability.

Suppose a split is not triggered by a different mechanism. Define E0 to be an event that occurs at t0 withprobability 1, and let E1 E2 E3 be the times at which the above numbered events occur. Each of theseevents requires the previous one to have occurred and moreover, the sequence has a Markov structure, so fort0 t1 t2 t3 = t we have

P (E1 t E2 t E3 t | E0 = t0) P (E1 t1 E2 t2 E3 t3 | E0 = t0)

= P (E1 t1 | E0 = t0)P (E2 t2 | E1 t1)P (E3 t3 | E2 t2)

P (E1 t1 | E0 = t0)P (E2 t2 | E1 = t1)P (E3 t3 | E2 = t2) .

We can rewrite the first and last term in more friendly notation as

P (E1 t1 | E0 = t0) = P

Nst0,t1(A) 1

,

P (E3 t3 | E2 = t2) = P

Nst2,t3(A) 1

.


18/23

1870187118721873187418751876187718781879188018811882188318841885188618871888

1889189018911892189318941895189618971898189919001901

1902190319041905190619071908190919101911191219131914

1915191619171918191919201921192219231924

1111111111111111111

1111111111111

1111111111111

1111111111


tt0

E0 E1 E2 E3

t1 t0t2 t1

t3 t2

Figure 9. This Figure diagrams the structure of the argument used in Propositions 8 and 9. The indicated intervals areshow regions where the next event must occur with high probability. Each of these intervals is finite, so their sum is also

finite. We find an interval which contains all of the events with high probability by summing the lengths of the intervalsfor which we have individual bounds.

Lemma 7 allows us to lower bound both of these probabilities by 1 for any > 0 by making t1 t0 and t3 t2large enough that

Nst0,t1 2

(A)max

1, (A)1 log

1

and

Ns

t2

,t3

2

(A)max1, (A)1 log1

respectively. To bound the centre term, recall that (A) > 0 and (A) > 0 with probability 1, and (d) (d)so

P (E2 t2 | E1 = t1) P

Net1,t2(A) (d) Net1,t2(A

) (d)

P

Net1,t2(A) (d)

+ P

Net1,t2(A

) (d)

1 ,

and we can again use Lemma 7 lower bound this by 1 by making t2 t1 sufficiently large that

Net1,t2 2

min{(A), (A)}max

(d), min{(A), (A)}1 log

2

Thus by setting = 1 (1 )1/3 can ensure that the probability of a split before time t is at least 1 if wemake

t = t0 + (t1 t0) + (t2 t1) + (t3 t2)

sufficiently large.

Proposition 9. Fix a partitioning sequence. Each cell in a tree built based on this sequence is split infinitelyoften in probability. i.e for any x in the support of X,

P (At(x) has been split fewer than K times) 0

as t for all K.

Proof. For an arbitrary point x in the support ofX, let Ek denote the time at which the cell containing x is splitfor the kth time, or infinity if the cell containing x is split fewer than k times (define E0 = 0 with probability1). Now define the following sequence:

t0 = 0

ti = min{t |P (Ei t | Ei1 = ti1) 1 }


19/23

1980198119821983198419851986198719881989199019911992199319941995199619971998

1999200020012002200320042005200620072008200920102011

2012201320142015201620172018201920202021202220232024

2025202620272028202920302031203220332034

2222222222222222222

2222222222222

2222222222222

2222222222


and set T = tk. Proposition 8 guarantees that each of the above tis exists and is finite. Compute,

P (Ek T) = P

k

i=1

[Ei T]

P

k

i=1

[Ei ti]

=

ki=1

P

Ei ti |

j 0 we can choose T to guarantee P (Ek T)

1 by setting = 1 (1 )1/k

and applying the above process. We can make this guarantee for any k whichallows us to conclude that P (Ek t) 1 as t for all k as required.

Proposition 10. Fix a partitioning sequence. Let At(X) denote the cell of gt (built based on the partitioningsequence) containing the point X. Then diam(At(X)) 0 in probability as t .

Proof. Let Vt(x) be the size of the first dimension of At(x). It suffices to show that E [Vt(x)] 0 for all x in thesupport of X.

Let X1, . . . , X m |At(x) for some 1 m m denote the samples from the structure stream that are used

to determine the candidate splits in the cell At(x). Use d to denote a projection onto the dth coordinate, andwithout loss of generality, assume that Vt = 1 and 1Xi Uniform[0, 1]. Conditioned on the event that the firstdimension is cut, the largest possible size of the first dimension of a child cell is bounded by

V

= max(

m

maxi=1 1Xi, 1

m

mini=1 1Xi) .

Recall that we choose the number of candidate dimensions as min(1 + Poisson(), D) and select that number ofdistinct dimensions uniformly at random to be candidates. Define the following events:

E1 = {There is exactly one candidate dimension}

E2 = {The first dimension is a candidate}

Then using V to denote the size of the first dimension of the child cell,

E [V] E [I {(E1 E2)c} + I {E1 E2} V

]

= P (Ec1) + P (Ec2|E1)P (E1) + P (E2|E1)P (E1)E [V

]

= (1 e) + (1 1

d)e +

1

deE [V]

= 1 e

D+

e

DE [V]

= 1 e

D+

e

DE

max(

mmaxi=1

1Xi, 1 m

mini=1

1Xi)

= 1 e

D+

e

D

2m + 1

2m + 2

= 1 e

2D(m + 1)


20/23

2090209120922093209420952096209720982099210021012102210321042105210621072108

2109211021112112211321142115211621172118211921202121

2122212321242125212621272128212921302131213221332134

2135213621372138213921402141214221432144

2222222222222222222

2222222222222

2222222222222

2222222222


Iterating this argument we have that after K splits the expected size of the first dimension of the cell containingx is upper bounded by

1

e

2D(m + 1)

K

so it suffices to have K in probability, which we know to be the case from Proposition 9.

Proposition 11. Fix a partitioning sequence. In any tree built based on this sequence, Ne(At(X)) inprobability.

Proof. It suffices to show that Ne(At(x)) for all x in the support ofX. Fix such an x, by Proposition 9 wecan make the probability At(x) is split fewer than K times arbitrarily small for any K. Moreover, by constructionimmediately after the K-th split is made the number of estimation points contributing to the prediction at x isat least (K), and this number can only increase. Thus for all K we have that P (Ne(At(x)) < (K)) 0 ast as required.

We are now ready to prove our main result. All the work has been done, it is simply a matter of assembling thepieces.

Proof (of Theorem 1). Fix a partitioning sequence. Conditioned on this sequence the consistency of each of theclass posteriors follows from Theorem 5. The two required conditions where shown to hold in Propositions 10and 11. Consistency of the multiclass tree classifier then follows by applying Proposition 4.

To remove the conditioning on the partitioning sequence, note that Lemma 6 shows that our tree generationmechanism produces a partitioning sequence with probability 1. Apply Proposition 3 to get unconditionalconsistency of the multiclass tree.

Proposition 2 lifts consistency of the trees to consistency of the forest, establishing the desired result.

B.7. Extension to a Fixed Size Fringe

Proving consistency is preserved with a fixed size fringe requires more precise control over the relationship

between the number of estimation points seen in an interval, Net0,t, and the total number of splits which haveoccurred in the tree, K. The following two lemmas provide the control we need.

Lemma 12. Fix a partitioning sequence. If K is the number of splits which have occurred at or before time tthen for all M > 0

P (K M) 0

in probability as t .

Proof. Denote the fringe at time t with Ft which has max size |F|, and the set of leafs at time t as Lt with size|Lt|. If |Lt| < |F| then there is no change from the unbounded fringe case, so we assume that |Lt| |F| so thatfor all t there are exactly |F| leafs in the fringe.

Suppose a leaf A1 Ft0 for some t0 then for every > 0 there is a finite time t1 such that for all t t1

P (A1 has not been split before time t)

|F|

Now fix a time t0 and > 0. For each leaf Ai Ft0 we can choose ti to satisfy the above bound. Set t = maxi tithen the union bound gives

P (K |F| at time t)


21/23

2200220122022203220422052206220722082209221022112212221322142215221622172218

2219222022212222222322242225222622272228222922302231

2232223322342235223622372238223922402241224222432244

2245224622472248224922502251225222532254

2222222222222222222

2222222222222

2222222222222

2222222222


Iterate this argument M/|F| times with = / M/|F| and apply the union bound again to get that forsufficiently large t

P (K M)

for any > 0.

Lemma 13. Fix a partitioning sequence. If K is the number of splits which have occurred at or before time t

then for any t0 > 0, K/Net0,t 0 as t .

Proof. First note that Net0,t = Ne0,t N

e0,t01 so

K

Net0,t=

K

Ne0,t Ne0,t01

and since Ne0,t01 is fixed it is sufficient to show that K/N0,t 0. In the following we write N = Ne0,t to lighten

the notation.

Define the cost of a tree T as the minimum value of N required to construct a tree with the same shape as T.The cost of the tree is governed by the function (d) which gives the cost of splitting a leaf at level d. The costof a tree is found by summing the cost of each split required to build the tree.

Note that no tree on K splits is cheaper than a tree of max depth d = log2(K) with all levels full (exceptpossibly the last, which may be partially full). This is simple to see, since (d) is an increasing function of d,meaning it is never more expensive to add a node at a lower level than a higher one. Thus we assume wlog thatthe tree is full except possibly in the last level.

When filling the dth layer of the tree, each split requires at least 2(d + 1) points because a split creates twonew leafs at level d + 1. This means that for K in the range [2d, 2d+1 1] (the range of splits which fill up leveld), K can increase at a rate which is at most 1 /2(d + 1) with respect to N. This also tells us that filling thedth level of the tree requires that N increase by at least 2d(d) = 2d1 2(d) (filling the dth level correspondsto splitting each of the 2d1 leafs on the d 1th level at a cost of 2(d) each). This means that filling d levelsof the tree requires at least

Nd =d

k=12k(k)

points. When N = Nd, K is at most 2d 1 because that is the number of splits in a full binary tree of depth d.

The above argument gives a collection of linear upper bounds on K in terms ofN. We know that the maximumgrowth rate is linear between (Nd, 2

d 1) and (Nd+1, 2d+1 1) so for all d we can find that since

(2d+1 1) (2d 1)

(Nd+1) (Nd)=

2d+1 2dd+1k=1 2

k(k) d

k=1 2k(k)

=2d

2d+1(d + 1)=

1

2(d + 1)

we have that for N and d,

K 1

2(d + 1)N + C(d)

where C(d) is given by

C(d) = 2d 1 12

dk=1

2k(k)

(d + 1)

From this we have

K

N

1

2(d + 1)+

1

N

2d 1

1

2

dk=1

2k(k)

(d + 1)

which holds for all d and N, so if we choose d to make 1/(d +1) /2 and then pick N such that C(d)/N /2we have K/N for arbitrary > 0 which proves the claim.


22/23

2310231123122313231423152316231723182319232023212322232323242325232623272328

2329233023312332233323342335233623372338233923402341

2342234323442345234623472348234923502351235223532354

2355235623572358235923602361236223632364

2222222222222222222

2222222222222

2222222222222

2222222222


K

Ne0,t0

21 1

22 1

23 1

C(2)

12(1)

12(2)

12(3)

Figure 10. Diagram of the bound in Lemma 13. The horizontal axis is the number of estimation points seen at time t andthe vertical axis is the number of splits. The first bend is the earliest point at which the root of the tree could be split,which requires 2(1) points to create 2 new leafs at level 1. Similarly, the second bend is the point at which all leafs atlevel 1 have been split, each of which requires at least 2(2) points to create a pair of leafs at level 2.

In order to show that our algorithm remains consistent with a fixed size fringe we must ensure that Proposition 8does not fail in this setting. Interpreted in the context of a finite fringe, Proposition 8 says that any cell in thefringe will be split in finite time. This means that to ensure consistency we need only show that any inactivepoint will be added to the fringe in finite time.

Remark 14. If s(A) = 0 for any leaf then we know that e(A) = 0, since (A) > 0 by construction. If e(A) = 0then P (g(X) = Y | X A) = 0 which means that any subdivision of A has the same asymptotic probability oferror as leaving A in tact. Our rule never splits A and thus fails to satisfy the shrinking leaf condition, but ourpredictions are asymptotically the same as if we had divided A into arbitrarily many pieces so this doesnt matter.

Proposition 15. Every leaf with s(A) > 0 will be added to the fringe in finite time with high probability.

Proof. Pick an arbitrary leaf A. We know from Hoeffdings inequality that

P (p(A) (A) ) exp

2|A|2

exp

2(d)2

and

P (p(A) (A) + ) exp

2|A|2

exp

2(d)2

Now pick an arbitrary time t0 and condition on everything before t0. For an arbitrary node A RD, if A is a

child of A then we know that if {Ui}Dmi=1 are iid on [0, 1] then

E [(A)] (A)EDm

maxi=1

(max(Ui, 1 Ui))

= (A)2Dm + 1

2Dm + 2since there are at most D candidate dimensions and each one accumulates at most m candidate splits. So ifAK

is any leaf created by K splits of A then

E

(AK)

(A)

2Dm + 1

2Dm + 2

KNotice that since we have conditioned on the tree at t0 so,

E

p(AK)

= EE

p(AK) | (AK)

= E

(AK)


23/23

2420242124222423242424252426242724282429243024312432243324342435243624372438

2439244024412442244324442445244624472448244924502451

2452245324542455245624572458245924602461246224632464

2465246624672468246924702471247224732474

2222222222222222222

2222222222222

2222222222222

2222222222


And we can bound p(AK) with

P

p(AK) (A)

2Dm + 1

2Dm + 2

K+

exp

2|AK|2

Set (2K+1|L|)1 = exp 2|AK|2 and invert the bound so we have

P

p(AK) (A)

2Dm + 1

2Dm + 2

K+

1

2|AK|log

2K+1|L|

2K+1|L|

Pick an arbitrary leaf A0 which is in the tree at time t0. We can use the same approach to find a lower boundon s(A0):

P

s(A0) s(A0)

1

2|A0|log

2K+1|L|

2K+1|L|

To ensure that s(A0) p(AK) ( s(AK)) fails to hold with probability at most 2K|L|1 we must choose k

and t to make

s(A0) (A)

2Dm + 1

2Dm + 2

K

+

1

2|AK|log

2K+1|L|

+

1

2|A0|log

2K+1|L|

The first term goes to 0 as K . We know that |AK| (K) so the second term also goes to 0 providedthat K/(K) 0, which we require.

The third term goes to 0 if K/|A0| 0. Recall that |A0| = Net0,t(A0) and for any > 0

P

Net0,t(A) N

et0,t(A)

1

2Net0,tlog

1

From this we see it is sufficient to have K/Net0,t 0 which we established in a lemma.

In summary, there are |L| leafs in the tree at time t0 and each of them generates at most 2K different AKs.Union bounding over all these leafs and over the probability of Net0,t(A0) growing sublinearly in N

et0,t we have

that, conditioned on the event that A0 has not yet been split, A0 is the leaf with the highest value of s withprobability at least 1 in finite time. Since and are arbitrary we are done.

Documents

Online Random Forest