35
Online Outlier Detection for Time Series Tingyi Zhu September 14, 2016 Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 1/1

Online Outlier Detection for Time Series - UCSD …t8zhu/talks/online outlier detection.pdf · Online Outlier Detection for Time Series Tingyi Zhu September 14, ... portfolio selection

Embed Size (px)

Citation preview

Online Outlier Detection for Time Series

Tingyi Zhu

September 14, 2016

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 1 / 1

Outline

Overview of Online Machine Learning

Online Time Series Outlier Detection

Demo

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 2 / 1

Overview of Online Machine Learning

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 3 / 1

Introduction

Online learning

I a method of machine learning in which data is accessed in a sequentialorder

I train the data in consecutive steps

I update the predictor at each step

Batch (offline) learning

I learn on the entire training data set

I generate the best predictor at once

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 4 / 1

Batch Learning

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 5 / 1

Online Learning

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 6 / 1

Two scenarios online algorithms are useful :

When dealing with large dataset, computationally infeasible to trainover the entire dataset

When new data is constantly being generated, and prediction isneeded instantly, cases include

I In financial market, stock price prediction, portfolio selection

I Real-time Recommendation

I Online ad placement

I Anomaly (outlier) detection

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 7 / 1

Statistical Learning Model

X : space of inputs

Y : space of outputs

To learn the function f : X → Y

Loss function: V : Y × Y → R

V (f (x), y) measures the difference between the predictive value f (x)and the true value y

Goal: Minimize the expected risk

minf

E [V (f (x), y)]

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 8 / 1

Given the training set

(x1, y1), . . . , (xn, yn)

Then minimize the empirical loss

minf

n∑i=1

V (f (xi ), yi )

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 9 / 1

Online Model

Pure online learning model:

I At time t + 1, learning of ft+1 depends on the new input (xn+1, yn+1)and the previous best predictor ft

I Need extra stored information, usually independent of training data size

Hybrid online learning model

I Learning of ft+1 depends on ft , and all previous data(x1, y1), . . . , (xt+1, yt+1)

I The extra storing space requirement no longer constant, depends ontraining data size

I Take less time to compute, compared to batch learning

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 10 / 1

Example: Linear Least Square

xj ∈ Rd , yj ∈ R

Learning a linear function: f (x) = 〈w , x〉

Parameters to learn: w ∈ Rd

Square loss: V (f (x), y) = (f (x)− y)2

Minimize the empirical loss:

Q(w) =n∑

j=1

Qj(w) =n∑

j=1

V (〈w , xj〉, yj) =n∑

j=1

(xTj w − yj)2

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 11 / 1

Batching Learning

Let X be the i × d data matrix

Y be the i matrix of target values after the arrival of the first i datapoints

The covariance matrix Σi = XTX

The best solution of w is

w = (XTX )−1XTY = Σ−1i

i∑j=1

xjyj

Recompute the solution after the arrival of every datapoint

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 12 / 1

Complexity of Batching Learning

After the arrival of the ith datapoint,

Calculating Σi takes time O(id2)

Inverting the d × d matrix Σ−1i takes time O(d3)

Computation time at ith step: O(id2 + d3)

Total time with n total data points in the dataset:∑i

O(id2 + d3) = O(n2d2 + nd3)

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 13 / 1

Improving on Batch Learning

When storing the matrix Σi

Updating Σ at each step:

Σi+1 = Σi + xi+1xTi+1

which only takes O(d2) time

Total time reduces to∑

i O(d2 + d3) = O(nd3)

Additional storage space of O(d2) needed to store Σi

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 14 / 1

Online Learning: Recursive Least Square

Initializing: w0 = 0 ∈ Rd , Γ0 = I ∈ Rd×d

Solution of LS can be computed by the following iteration

Γi = Γi−1 −Γi−1xix

Ti Γi−1

1 + xTi Γi−1xi,

wi = wi−1 − Γixi (xTi wi−1 − yi )

w computed above is the same as the solution of batch learning,which can be proved by induction on i

Complexity: O(nd2) for n steps

require storage of matrix Γ, which is constant O(d2)

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 15 / 1

For the case when Σi is not invertible, consider the regularized versionof loss function

Q(w) =n∑

j=1

(xTj w − yj)2 + λ||w ||22.

The same recursive algorithm works with the minor change oninitialization:

Γ0 = (I + λI )−1

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 16 / 1

Stochastic Gradient Descent

Recall the standard gradient descent (also called batch gradientdescent)], dealing with the problem of minimizing an objectivefunction Q(w)

The solution w is found through the iteration

w := w − η∇Q(w)

When the objective has the form of a sum: Q(w) =∑n

j=1Qj(w)

the standard gradient descent performs the iteration:

w := w − ηn∑

j=1

∇Qj(w)

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 17 / 1

Each summand function Qj is typically associated with the i-thobservation in the data set

When sample size n is large, standard GD requires expensiveevaluations of the gradients from all summand functions, which isredundant because it computes gradients of very similar functions ateach update

SGD: update solution by computing the gradient of a single Qj

w := w − η∇Qj(w)

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 18 / 1

SGD for Online Learning

Linear least square:

minw

Q(w) =n∑

j=1

Qj(w) =n∑

j=1

V (〈w , xj〉, yj) =n∑

j=1

(xTj w − yj)2

After the arrival of the ith datapoint (xi , yi ), update the solution by

wi = wi−1 − γi∇V (〈w , xi 〉, yi ) = wi−1 − γixi (xTi wi−1 − yi )

which is similar to the updating procedure in recursive least square

wi = wi−1 − Γixi (xTi wi−1 − yi )

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 19 / 1

Complexity of SGD algorithm for n steps reduces to O(nd)

The storage requirement is constant at O(d)

Consistency of SGD:

When the learning rate γ decreases with an appropriate rate, SGDshows the same convergence behavior as batch gradient descent

Averaging SGD: Choose γi = 1/√i , keep track of

wn =1

n

n∑i=1

wi

which is consistent.

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 20 / 1

Figure: SGD fluctuation (Source: Wikipedia)

SGD performs frequent updates, is faster

With high variance, cause heavy fluctuation

Complicates convergence to the exact minimum

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 21 / 1

Mini-Batch Gradient Descent

Use b datapoints in each iteration

Generalization and compromise between batch GD and SGD

Get b datapoints: (xi+1, yi+1), . . . , (xi+b, yi+b),

w := w − γ · 1

b

i+b∑k=i+1

xk(xTk w − yk)

b ranges from 2 to 100

For online learning, b much smaller than n

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 22 / 1

Other topics of Online Learning

Incremental (Online) PCA

Incremental (Online) SVD

Adversarial Model

etc.

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 23 / 1

Online Time Series Outlier Detection

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 24 / 1

The Problem

Given: A time series x1, . . . , xt , new data point keep coming in astime went on

Goal: Upon arrival of the new data, determine whether it’s outlier

Applications:

I Financial market: abrupt change of stock price

I Health data: blood pressure, heart beats

I Intrusion detection: sudden increase of login attempts

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 25 / 1

The Framework

Initial series x1, . . . , xn as training sample

Use prediction algorithm, which can be adapted to online setting, topredict xn+1

Set criteria of identifying outliers or outlier events (subsequences),based on the prediction xn+1 and the new data point coming in xn+1

Update the training series to x1, . . . , xn, xn+1, and predict xn+2

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 26 / 1

Chanllenge:

Size of training sample grows as time went on, leading to moretraining time, fail to detect outliers on time

Solutions:

Use slide window, drop the earliest data in the training sample whileadding the new data, keep the size of training sample

Use online learning techniques to update the model, avoidingretraining the whole sample.

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 27 / 1

Classical Online Time Series Prediction Model

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 28 / 1

Training of AO,B —- A machine learning problem

Linear model: methods in previous section

Nonlinear model: Support vector regression, online SVR [Ma andPerkins, 2003]

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 29 / 1

Support Vector Regression

Training set: {(xi , yi ), i = 1, . . . , l}, where xi ∈ RB , yi ∈ R

Construct the regression function

f (x) = W TΦ(x) + b

where Φ(x) maps the input x to vector in feature space

W and b obtained by

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 30 / 1

Width of the band (between dashed lines) is 2ε||W || , so minimize ||W ||2

Also penalizes data points whose y -value differ from f (x) by morethan ε

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 31 / 1

The dual optimization problem:

where Qij = Φ(xi )TΦ(xj) = K (xi , xj).

K (·, ·) is the predetermined nonlinear kernel function and the solutionof SVR can be written as

f (x) =l∑

i=1

(α− α∗)K (xi , x) + b

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 32 / 1

Online SVR: [Ma, Theiler and Perkins, 2003]

The trained SVR function

f (x) =l∑

i=1

(α− α∗)K (xi , x) + b

Update (α− α∗) and b whenever a new data is added

Faster than batch SVR algorithm

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 33 / 1

Outlier Detection Criteria

An example [Ma and Perkins, 2003] :

Occurrence at t: O(t) = 1{|xt − xt | 6∈ (−ε, ε)}I 1{·}: indicator functionI x : prediction of xtI ε: tolerance width

A surprise is observed at t if O(t) = 1

Event at time t: En(t) = [O(t),O(t + 1), · · · ,O(t + n − 1)]T

I n: event duration

If |En(t)| > h, then En(t) is defined as a novel event (outlier)

I | · |: 1-normI h: a fixed lover bound

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 34 / 1

Demo

Tingyi Zhu Online Time Series Outlier Detection September 14, 2016 35 / 1