Echo State Hoeffding Tree Learning

Echo State Hoeffding Tree Learning

Diego Marron ([email protected])Jesse Read ([email protected])

Albert Bifet ([email protected])Talel Abdessalem ([email protected])Eduard Ayguade ([email protected])Jose R. Herrero ([email protected])

ACML 2016

Hamilton, New Zeland

Introduction ESHT Evaluations Conclusions

Introduction

• Real-time classification of Big Data streams is becomingessential in a variety of application domains.

• Real-time classification imposes some challenges:• Deal with potentially infinite streams• Strong temporal dependences• React to changes on the stream• Response time and memory are bounded

2/18


Real Time Classification

• In real-time classification:• Hoeffding Tree (HT) is the streaming state-of-the art decision

tree• HTs are powerful and easy–to–deploy (no hyper-parameter to

tune)• But, they are unable to capture strong temporal dependences

• Recurrent Neural Networks (RNN) are very popular nowadays

3/18


Recurrent Neural Networks

• Recurrent Neural Networks (RNNs) are the state-of-the-art inhandwriting recognition, speech recognition, natural languageprocessing among others

• They are able to capture time dependences• But their use for data streams is not straight forward

• Very sensitive to hyper-parameters configuration• Training requires many iterations over data...• ...and large amount of time

4/18


RNN: Echo State Network

• A type of Recurrent Neural Network• Echo State Layer (ESL):

• Dynamics only driven by the input• Requires very few computations• Easy to understand hyper-parameters• Can capture time dependences

• ESN also requires the hyper-parameters needed by the NN

• Gradient Descent methods have slow convergence

5/18


Contribution

• Objective:• Need to model the evolution of the stream over time• Reduce number of hyper-parameters• Reduce amount of samples needed to learn

• In this work we present the ESHT:• Combination of HT + ESL• To learn temporal dependences in data streams in real-time• Requires less hyper-parameters than the ESN

6/18


ESHT

• Echo State Layer (ESL):• Only needs two hyper-parameters:

• Alpha (α): weights events in X(n) importance over new ones• Density: Wres is a sparse matrix with given density

• Encodes time-dependences

• FIMT-DD: Hoeffding tree for regression• Works out-of-the-box: no hyper-parameters tuning

7/18


ESHT: Evaluation Methodology

• We propose the ESHT to learn character-stream functions:• Counter (skipped in this presentation)• lastIndexOf• emailFilter

• lastIndexOf Evaluation:• Study the effects of hyper-parameters: α and density

• Alpha (α): weights events in X(n) importance over new ones• Density: Wres is a sparse matrix with given density

• Use 1,000 neurons on the ESL

• emailFilter evaluation:• We focus on the speed of learning• Use outcomes from previous evaluations to configure the

ESHT for this task

• Metrics:• Cumulative loss• We consider an error if |yt − y | >= 0.5

8/18


Input format

• Input is a vector of floats• Number of attributes = number of input symbols• Attribute representing current symbol set to 0.5• Other attributes are set to zero

9/18


LastIndexOf

• Counts the number of time steps since the current symbol waslast observed

• Input stream is randomly generated• We 2,3 and 4 symbols

10/18


LastIndexOf: Vector vs Scalar Input

• Vector input improves accuracy in all cases• Specially with 4 symbols

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

α

Accuracy

(%)

2symbols density=0.4

2symbols-vec density=0.4





11/18


LastIndexOf: Alpha and Density vs Accuracy

• Lower values of alpha (α) have low accuracy

• There is no clear correlation between accuracy and density

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Alpha (α)

Accuracy

(%)







0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.3

0.4

0.5

0.6

0.7

0.8

Density

Accuracy

(%)

α=0.2

α=0.3

α=0.4

α=0.5

α=0.6

α=0.7

α=0.8

α=0.9

α=1.0

12/18


EmailFilter

• ESHT configuration:• ESL: 4,000 neurons• α = 1.0 and density = 0.1

• Outputs the length on the next space character• Dataset: 20 newsgroups dataset• Extracted 590 characters and repeated them 8 times• To reduce the memory usage we used an input vector of 4

symbols

13/18


EmailFilter: Recurrence vs Non Recurrence

• Non-recurrent methods (FIMT-DD and NN) fail to capturetemporal dependences

• NN defaults to majority class

Algorithm Density α Learning rate Loss Accuracy (%)FIMT-DD - - - 4,119.7 91.61

NN - - 0.8 2,760 97.80ESN1 0.2 1.0 0.1 1,032 98.47ESN2 0.7 1.0 0.1 850 98.47ESHT 0.1 1.0 - 180 99.75

14/18


EmailFilter: ESN vs ESHT

• After 500 samples the ESHT loss is close to 0 (and 0 lossafter the 1,000 samples)

0

1,000

2,000

3,000

4,000

0

200

400

600

800

1,000

1,200

500

# Samples

Cummulative

Loss

ESN1

ESN2

ESHT

15/18


Conclusions and Future Work

• Conclusions:• We presented the ESHT to learn temporal dependences in data

streams in real-time• The ESHT requires less hyper-parameters than the ESN• Our proof-of-concept implementation is able to learn faster

than an ESN (Most of them at first shot)

• Future Work:• We are currently reimplementing our prototype so we can test

larger input sequences• We need to study the effects of the initial state vanishing in

large sequences

16/18

Thank you

Echo State Hoeffding Tree Learning

Diego Marron ([email protected])Jesse Read ([email protected])

Albert Bifet ([email protected])Talel Abdessalem ([email protected])Eduard Ayguade ([email protected])Jose R. Herrero ([email protected])

ACML 2016

Hamilton, New Zeland

ESHT: Module Architecture

• In each evaluation we use the following architecture• Label generator implements the function to be learnt

1/0

Counter: Introduction

• Stream of zeros and ones randomly generated

• Input is a scalar

• Two variants:• Option1: Outputs cumulative count• Option2: Outputs total count on the next zero

2/0

Counter: Cumulative Loss

• After 200 samples the loss is stable

0200

400

600

800

1,000

0

10

20

30

# Samples

Cummulative

Loss

Op1(density=0.3,α=1.0)




3/0

Counter: Alpha and Density vs Accuracy

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.5

0.6

0.7

0.8

0.9

1

Alpha (α)

Accuracy

(%)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.5

0.6

0.7

0.8

0.9

1

Density (%)

Accuracy

(%)

4/0

EmailFilter: ASCII to 4 symbols Table

ASCII Domain 4-Symbols DomainOriginal Symbols Target Symbol Target Symbol Index

[\t \n \r]+ Single space 0[a-zA-Z0-9] x 1

@ @ 2. . 3

5/0

Data & Analytics

Echo State Hoeffding Tree Learning