Online Outlier Detection for Time Series - UCSD …t8zhu/talks/online outlier detection.pdf · Online Outlier Detection for Time Series Tingyi Zhu September 14, ... portfolio selection

Online Outlier Detection for Time Series

Tingyi Zhu

September 14, 2016

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 1 / 1

Outline

Overview of Online Machine Learning

Online Time Series Outlier Detection

Demo


Overview of Online Machine Learning


Introduction

Online learning

I a method of machine learning in which data is accessed in a sequentialorder

I train the data in consecutive steps

I update the predictor at each step

Batch (offline) learning

I learn on the entire training data set

I generate the best predictor at once


Batch Learning


Online Learning


Two scenarios online algorithms are useful :

When dealing with large dataset, computationally infeasible to trainover the entire dataset

When new data is constantly being generated, and prediction isneeded instantly, cases include

I In financial market, stock price prediction, portfolio selection

I Real-time Recommendation

I Online ad placement

I Anomaly (outlier) detection


Statistical Learning Model

X : space of inputs

Y : space of outputs

To learn the function f : X → Y

Loss function: V : Y × Y → R

V (f (x), y) measures the difference between the predictive value f (x)and the true value y

Goal: Minimize the expected risk

minf

E [V (f (x), y)]


Given the training set

(x1, y1), . . . , (xn, yn)

Then minimize the empirical loss

minf

n∑i=1

V (f (xi ), yi )


Online Model

Pure online learning model:

I At time t + 1, learning of ft+1 depends on the new input (xn+1, yn+1)and the previous best predictor ft

I Need extra stored information, usually independent of training data size

Hybrid online learning model

I Learning of ft+1 depends on ft , and all previous data(x1, y1), . . . , (xt+1, yt+1)

I The extra storing space requirement no longer constant, depends ontraining data size

I Take less time to compute, compared to batch learning


Example: Linear Least Square

xj ∈ Rd , yj ∈ R

Learning a linear function: f (x) = 〈w , x〉

Parameters to learn: w ∈ Rd

Square loss: V (f (x), y) = (f (x)− y)2

Minimize the empirical loss:

Q(w) =n∑

j=1

Qj(w) =n∑

j=1

V (〈w , xj〉, yj) =n∑

j=1

(xTj w − yj)2


Batching Learning

Let X be the i × d data matrix

Y be the i matrix of target values after the arrival of the first i datapoints

The covariance matrix Σi = XTX

The best solution of w is

w = (XTX )−1XTY = Σ−1i

i∑j=1

xjyj

Recompute the solution after the arrival of every datapoint


Complexity of Batching Learning

After the arrival of the ith datapoint,

Calculating Σi takes time O(id2)

Inverting the d × d matrix Σ−1i takes time O(d3)

Computation time at ith step: O(id2 + d3)

Total time with n total data points in the dataset:∑i

O(id2 + d3) = O(n2d2 + nd3)


Improving on Batch Learning

When storing the matrix Σi

Updating Σ at each step:

Σi+1 = Σi + xi+1xTi+1

which only takes O(d2) time

Total time reduces to∑

i O(d2 + d3) = O(nd3)

Additional storage space of O(d2) needed to store Σi


Online Learning: Recursive Least Square

Initializing: w0 = 0 ∈ Rd , Γ0 = I ∈ Rd×d

Solution of LS can be computed by the following iteration

Γi = Γi−1 −Γi−1xix

Ti Γi−1

1 + xTi Γi−1xi,

wi = wi−1 − Γixi (xTi wi−1 − yi )

w computed above is the same as the solution of batch learning,which can be proved by induction on i

Complexity: O(nd2) for n steps

require storage of matrix Γ, which is constant O(d2)


For the case when Σi is not invertible, consider the regularized versionof loss function

Q(w) =n∑

j=1

(xTj w − yj)2 + λ||w ||22.

The same recursive algorithm works with the minor change oninitialization:

Γ0 = (I + λI )−1


Stochastic Gradient Descent

Recall the standard gradient descent (also called batch gradientdescent)], dealing with the problem of minimizing an objectivefunction Q(w)

The solution w is found through the iteration

w := w − η∇Q(w)

When the objective has the form of a sum: Q(w) =∑n

j=1Qj(w)

the standard gradient descent performs the iteration:

w := w − ηn∑

j=1

∇Qj(w)


Each summand function Qj is typically associated with the i-thobservation in the data set

When sample size n is large, standard GD requires expensiveevaluations of the gradients from all summand functions, which isredundant because it computes gradients of very similar functions ateach update

SGD: update solution by computing the gradient of a single Qj

w := w − η∇Qj(w)


SGD for Online Learning

Linear least square:

minw

Q(w) =n∑

j=1

Qj(w) =n∑

j=1

V (〈w , xj〉, yj) =n∑

j=1

(xTj w − yj)2

After the arrival of the ith datapoint (xi , yi ), update the solution by

wi = wi−1 − γi∇V (〈w , xi 〉, yi ) = wi−1 − γixi (xTi wi−1 − yi )

which is similar to the updating procedure in recursive least square

wi = wi−1 − Γixi (xTi wi−1 − yi )


Complexity of SGD algorithm for n steps reduces to O(nd)

The storage requirement is constant at O(d)

Consistency of SGD:

When the learning rate γ decreases with an appropriate rate, SGDshows the same convergence behavior as batch gradient descent

Averaging SGD: Choose γi = 1/√i , keep track of

wn =1

n

n∑i=1

wi

which is consistent.


Figure: SGD fluctuation (Source: Wikipedia)

SGD performs frequent updates, is faster

With high variance, cause heavy fluctuation

Complicates convergence to the exact minimum


Mini-Batch Gradient Descent

Use b datapoints in each iteration

Generalization and compromise between batch GD and SGD

Get b datapoints: (xi+1, yi+1), . . . , (xi+b, yi+b),

w := w − γ · 1

b

i+b∑k=i+1

xk(xTk w − yk)

b ranges from 2 to 100

For online learning, b much smaller than n


Other topics of Online Learning

Incremental (Online) PCA

Incremental (Online) SVD

Adversarial Model

etc.


Online Time Series Outlier Detection


The Problem

Given: A time series x1, . . . , xt , new data point keep coming in astime went on

Goal: Upon arrival of the new data, determine whether it’s outlier

Applications:

I Financial market: abrupt change of stock price

I Health data: blood pressure, heart beats

I Intrusion detection: sudden increase of login attempts


The Framework

Initial series x1, . . . , xn as training sample

Use prediction algorithm, which can be adapted to online setting, topredict xn+1

Set criteria of identifying outliers or outlier events (subsequences),based on the prediction xn+1 and the new data point coming in xn+1

Update the training series to x1, . . . , xn, xn+1, and predict xn+2


Chanllenge:

Size of training sample grows as time went on, leading to moretraining time, fail to detect outliers on time

Solutions:

Use slide window, drop the earliest data in the training sample whileadding the new data, keep the size of training sample

Use online learning techniques to update the model, avoidingretraining the whole sample.


Classical Online Time Series Prediction Model


Training of AO,B —- A machine learning problem

Linear model: methods in previous section

Nonlinear model: Support vector regression, online SVR [Ma andPerkins, 2003]


Support Vector Regression

Training set: {(xi , yi ), i = 1, . . . , l}, where xi ∈ RB , yi ∈ R

Construct the regression function

f (x) = W TΦ(x) + b

where Φ(x) maps the input x to vector in feature space

W and b obtained by


Width of the band (between dashed lines) is 2ε||W || , so minimize ||W ||2

Also penalizes data points whose y -value differ from f (x) by morethan ε


The dual optimization problem:

where Qij = Φ(xi )TΦ(xj) = K (xi , xj).

K (·, ·) is the predetermined nonlinear kernel function and the solutionof SVR can be written as

f (x) =l∑

i=1

(α− α∗)K (xi , x) + b


Online SVR: [Ma, Theiler and Perkins, 2003]

The trained SVR function

f (x) =l∑

i=1

(α− α∗)K (xi , x) + b

Update (α− α∗) and b whenever a new data is added

Faster than batch SVR algorithm


Outlier Detection Criteria

An example [Ma and Perkins, 2003] :

Occurrence at t: O(t) = 1{|xt − xt | 6∈ (−ε, ε)}I 1{·}: indicator functionI x : prediction of xtI ε: tolerance width

A surprise is observed at t if O(t) = 1

Event at time t: En(t) = [O(t),O(t + 1), · · · ,O(t + n − 1)]T

I n: event duration

If |En(t)| > h, then En(t) is defined as a novel event (outlier)

I | · |: 1-normI h: a fixed lover bound


Demo


Documents

Online Outlier Detection for Time Series - UCSD …t8zhu/talks/online outlier detection.pdf · Online Outlier Detection for Time Series Tingyi Zhu September 14, ... portfolio selection