14
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series Wei Luo and Marcus Gallagher The University of Queensland, Australia {luo,marcusg}@itee.uq.edu.au Abstract. Time series discord has proven to be a useful concept for time- series anomaly identification. To search for discords, various algorithms have been developed. Most of these algorithms rely on pre-building an index (such as a trie) for subsequences. Users of these algorithms are typ- ically required to choose optimal values for word-length and/or alphabet- size parameters of the index, which are not intuitive. In this paper, we propose an algorithm to directly search for the top-K discords, without the requirement of building an index or tuning external parameters. The al- gorithm exploits quasi-periodicity present in many time series. For quasi- periodic time series, the algorithm gains significant speedup by reducing the number of calls to the distance function. Keywords: Time Series Discord, Minimax Search, Time Series Data Mining, Anomaly Detection, Periodic Time Series. 1 Introduction Periodic and quasi-periodic time series appear in many data mining applications, often due to internal closed-loop regulation or external phase-locking forces on the data sources. A time series’ temporary deviation from a periodic or quasi- periodic pattern constitutes a major type of anomalies in many applications. For example, an electrocardiography (ECG) recording is nearly periodic, as one’s heartbeat. Figure 1 shows an ECG signal where a disruption of periodicity is highlighted. This disruption of periodicity actually indicates a Premature Ven- tricular Contraction (PVC) arrhythmia [3]. As another example, Figure 4 shows the number of beds occupied in a tertiary hospital. The time series suggests a weekly pattern—busy weekdays followed by quieter weekends. If the weekly pat- tern is disrupted, then chaos often follows with elective surgeries being canceled and the emergency department being over-crowded, greatly impacting patient satisfaction and health care quality. Time Series Discord captures the idea of anomalous subsequences in time series and has proven to be useful in a diverse range of applications (see for example [5,1,11]). Intuitively, a discord of a time series is a subsequence with the largest distance from all other non-overlapping subsequences in the time se- ries. Similarly, the 2nd discord is a subsequence with the second largest distance from all other non-overlapping subsequences. And more generally one can search J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 135–148, 2011. c Springer-Verlag Berlin Heidelberg 2011

Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

Faster and Parameter-Free Discord Search in

Quasi-Periodic Time Series

Wei Luo and Marcus Gallagher

The University of Queensland, Australia{luo,marcusg}@itee.uq.edu.au

Abstract. Time series discord has proven to be a useful concept for time-series anomaly identification. To search for discords, various algorithmshave been developed. Most of these algorithms rely on pre-building anindex (such as a trie) for subsequences. Users of these algorithms are typ-ically required to choose optimal values for word-length and/or alphabet-size parameters of the index, which are not intuitive. In this paper, wepropose an algorithm to directly search for the top-K discords, without therequirement of building an index or tuning external parameters. The al-gorithm exploits quasi-periodicity present in many time series. For quasi-periodic time series, the algorithm gains significant speedup by reducingthe number of calls to the distance function.

Keywords: Time Series Discord, Minimax Search, Time Series DataMining, Anomaly Detection, Periodic Time Series.

1 Introduction

Periodic and quasi-periodic time series appear in many data mining applications,often due to internal closed-loop regulation or external phase-locking forces onthe data sources. A time series’ temporary deviation from a periodic or quasi-periodic pattern constitutes a major type of anomalies in many applications. Forexample, an electrocardiography (ECG) recording is nearly periodic, as one’sheartbeat. Figure 1 shows an ECG signal where a disruption of periodicity ishighlighted. This disruption of periodicity actually indicates a Premature Ven-tricular Contraction (PVC) arrhythmia [3]. As another example, Figure 4 showsthe number of beds occupied in a tertiary hospital. The time series suggests aweekly pattern—busy weekdays followed by quieter weekends. If the weekly pat-tern is disrupted, then chaos often follows with elective surgeries being canceledand the emergency department being over-crowded, greatly impacting patientsatisfaction and health care quality.

Time Series Discord captures the idea of anomalous subsequences in timeseries and has proven to be useful in a diverse range of applications (see forexample [5,1,11]). Intuitively, a discord of a time series is a subsequence withthe largest distance from all other non-overlapping subsequences in the time se-ries. Similarly, the 2nd discord is a subsequence with the second largest distancefrom all other non-overlapping subsequences. And more generally one can search

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 135–148, 2011.c© Springer-Verlag Berlin Heidelberg 2011

Page 2: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

136 W. Luo and M. Gallagher

0 1000 3000 5000index

1st discord

Fig. 1. An ECG time series that demon-strates periodicity, baseline shift, and adiscord. The time series is the second-leadsignal from dataset xmitdb_x108_0 of [6].According to [3], the ECG was taken un-der the frequency of 360 Hz. The unit formeasurement is unknown to the author.

0 1000 3000 5000

510

15

p

dist

ance

dd

Fig. 2. Illustration of Proposition 1. Theblue solid line represents the true d forthe time series xmitdb_x108_0 (with sub-sequence length 360). The red dashed linerepresents an estimate d for d. Althoughat many locations d is very different fromd, the maximum of d coincides with themaximum of d.

for the top-K discords [1]. Finding the discord for a time series in general requirescomparisons among O(m2) pair-wise distances, where m is the length of the timeseries. Despite past efforts in building heuristics (e.g., [5,1]), searching for thediscord still requires expensive computation, making real-time interaction withdomain experts difficult. In addition, most of existing algorithms are based onthe idea of indexing subsequences with a data structure such as a trie. Such datastructures often have unintuitive parameters (e.g., word length and alphabetsize) to tune. This means time consuming trial-and-error that compromises theefficiency of the algorithms.

Keogh, Lin, and Fu first defined time series discords and proposed a searchalgorithm named HOT SAX in [5]. A memory efficient search algorithm wasalso proposed later [11]. HOT SAX builds on the idea of discretizing and index-ing time series subsequences. To select the lengths for index keys, wavelet decom-position can be used ([2,1]). Most recently, adaptive discretization has been pro-posed to improve the index for efficient discord search ([8]). In this paper, we pro-pose a fast algorithm to find the top-K discords in a time series without prebuild-ing an index or tuning parameters. For periodic or quasi-periodic time series, thealgorithm finds the discord with much less computation, compared to results pre-viously reported in the literature (e.g., [5]). After finding the 1st discord, our algo-rithm finds subsequent discords with even less computation—often 50% less. Wetested our algorithm with a collection of datasets from [6] and [4]. The diversityof the collection shows the definition of “quasi-periodicity” can be very relaxedfor our algorithm to achieve search efficiency. Periodicity of a time series can beeasily assessed through visual inspection. The experiments with artificially gen-erated non-periodic random walk time series showed increased running time, but

Page 3: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 137

the algorithm is still hundreds of times faster than the brute-force search, withouttuning any parameter.

The paper is organized as follows. Section 2 reviews the definition of time-series discord and existing algorithms for discord search. Section 3 introduces ourdirect search algorithm and explains ideas behind it. Section 4 presents empiricalevaluation for the new algorithm and a comparison with the results of HOT SAXfrom [5]. Section 5 concludes the paper.

2 Time Series Discords

This section reviews the definition of time-series discord and major search algo-rithms.

Notation. In this paper, T � (t1, . . . , tm) denotes a time series of length m. Inaddition, T [p; n] denotes the length-n subsequence of T with beginning positionp. The distance between two length-n subsequences T [p; n] and T [q; n] is denoteddistT,n(p, q). Following [5], we consider by default the Euclidean distance betweentwo standardized subsequences—all subsequences are standardized to have amean of 0 and a standard deviation of 1. Nevertheless, the results in this paperapply to other definitions of distance. Given a subsequence T [p; n], the minimumdistance between T [p; n] and any non-overlapping subsequence T [q; n] is denoteddp,n (i.e., dp,n = minq:|p−q|≥n distT,n(p, q)). As n is a constant, we often write dp

for dp,n. Finally we use d to denote the vector (d1, d2, . . . , dm−n+1) and use d

and dp to denote estimates for d and dp respectively.For a time series of length m, there are at most 1

2 (m − n − 1)(m − n −2) + 1 distinct distT,n(p, q) values. In particular distT,n(p, q) = distT,n(q, p) anddistT,n(p, p) == 0. Figure 3 shows a heatmap of the distances distT,n(p, q) forall p and q values of the time series xmitdb_x108_0 (see Figure 1).

Fig. 3. Distribution of distT,360(p, q) where T isthe time series xmitdb_x108_0

750

850

950

Date

Num

ber o

f Occ

upie

d B

eds

Sep Oct Nov

Fig. 4. Hourly bed occupancy in atertiary hospital for two months

Page 4: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

138 W. Luo and M. Gallagher

The following definition is reformulated from [5].

Definition 1 (Discord). Let T be a sequence of length m. A subsequenceT [p(1); n] is the first discord (or simply the discord) of length n for T if

p(1) = argmaxp

{dp : 1 ≤ p ≤ m − n + 1} . (1)

Intuitively, a discord is the most “isolated” length-n subsequence in the spaceR

n. Subsequent discords—the second discord, the third discord, and so on—ofa time series are defined inductively as follows.

Definition 2. Let T [p(1); n], T [p(2); n], . . . , T [p(k−1); n] be the top k− 1 discordsof length n for a time series T . Subsequence T [p(k); n] is the k-th discord oflength n for T if

p(k) = argmaxp

{dp : |p − p(i)| ≥ n for all i < k

}

Note that the values for both n and k should be determined by the application;they are independent of the search algorithm. If a user was looking for threemost unusual weeks in the bed occupancy example (Figure 4), k would be 3and n would be 7 ∗ 24, assuming the time series is sampled hourly. Strictlyspeaking, the discord is not well defined as there may be more than one locationp that maximizes dp (i.e., dp1 = dp2 = max p {dp : 1 ≤ p ≤ m − n + 1}). Butthe ambiguity rarely matters in most applications, especially when the top-Kdiscords are searched in a batch. In this paper, we shall follow the existingliterature [5] and assume that all dp’s have distinct values.

The discord has a formulation similar to the minimax problem in game theory.Note that

maxp minq {distT,n(p, q) : |p − q| ≥ n} ≤ minp

maxq

{distT,n(p, q) : |p − q| ≥ n} .

According to Sion’s minimax theorem [9], the equality holds if distT,n(p, ·) isquasi-concave on q for every p and distT,n(·, q) is quasi-convex on p for every q.Figure 3 indicates, however, that in general neither distT,n(p, ·) is quasi-concavenor distT,n(·, q) is quasi-convex, and no global saddle point exists. That suggestssearching for discords requires a strategy different from those used in game the-ory. In the worst case, searching for the discord has the complexity O(m2), essen-tially requiring brute-force computation of the pair-wise distances of all length-nsubsequences of the time series. When m = 104, that means 100 million callsto the distance function. Nevertheless, the following sufficient condition for thediscord suggests a search strategy better than the brute-force computation.

Observation 1. Let T be a time series. A subsequence T [p∗; n] is the discordof length n if there exists d∗ such that

∀q : |p∗ − q| > n ⇒ distT,n(p∗, q) ≥ d∗, and (2)∀p �= p∗, ∃q : (|p − q| > n) ∧ (distT,n(p, q) < d∗). (3)

Page 5: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 139

In general, there are infinitely many d∗ that satisfies Clause (2) and Clause (3).Suppose we have a good guess d∗. Clause (3) implies that a false candidateof the discord can be refuted, potentially in fewer than m steps. Clause (2)implies that, given all false candidates have been refuted, the true candidatefor the discord can be verified in m − n + 1 steps. Hence in the best case,(m − n + 1) + (m − 1) = 2m − n calls to the distance function are sufficient toverify the discord. To estimate d∗, we can start with the value of dp where p isa promising candidate for the discord, and later increase the guess to a largervalue dp′ if p′ is not refuted (i.e., distT,n(p′, q) > dp for every non-overlapping q)and becomes the next candidate. This hill-climbing process goes on until all butone of the subsequences are refuted with the updated value of d∗.

This idea forms the basis of most existing discord search algorithms (e.g., HOTSAX in [5] and WAT in [1]); the common structure of these algorithms is shown inFigure 5. With this base algorithm, the efficiency of a search then depends on the

1: Select a p0 and let d∗ ← dp0 and p∗ ← p0. {Initialization}2: for all the remaining locations p ordered by certain heuristic Outer. do {Outer Loop}3: for all locations q ordered by some heuristic Inner such that |p− q| ≥ n. do {Inner Loop}4: if distT,n(p, q) < d∗ then5: According to Clause (3) in Observation 1, T [p; n] cannot be the discord; break to next p.

6: end if7: end for8: if minq distT,n(p, q) > d∗ then9: As Clause (2) in Observation 1 is not met, update d∗ ← minq distT,n(p, q) and p∗ ← p.

10: end if11: end for12: return p∗ and d∗

Fig. 5. Base algorithm for HOT SAX and WAT

order of subsequences in the Outer and Inner loops (see lines 2 and 3). Intuitively,the Outer loop should rank p according to the singularity of subsequence T [p; n];the Inner loop should rank q according the proximity between subsequencesT [p; n] and T [q; n]. Both HOT SAX and WAT adopt the following strategy.Firstly all subsequences of length n are discretized and compressed into shorterstrings. Then the strings are indexed with a suffix trie—in the ideal situation,subsequences close in distance also share an index key or occupy neighboringindex keys in the trie. This is not so different from the idea of hashing to achieveO(1) search time. In the end, all subsequences will be indexed into a number ofbuckets on the terminal nodes. The hope is that, with careful selection of stringlength and alphabet size, the discord will fall into a bucket containing very fewsubsequences while a non-discord subsequence will fall into a bucket shared withsimilar subsequences. Then the uneven distribution of subsequences among thebuckets can be exploited to devise efficient ordering for the Outer and Innerloops.

This ingenious approach however has two drawbacks. Firstly, one needs toselect optimal parameters that balance the index size and the bucket size, whichare critical to the search efficiency. For example, to use HOT SAX, one needs

Page 6: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

140 W. Luo and M. Gallagher

to set the alphabet size and the word size for the discretized subsequences [5,Section 4.2]; WAT automates the selection of word size, but still requires settingthe alphabet size [1, Section 3.2]. Such parameters are not always intuitive to auser, as the difficulty of building a useable trie has been discussed in [11, Section2]. Secondly, the above approach uses fixed/random order in the outer loopto search for all top-K discords. A dynamic ordering for the outer loop couldpotentially make better use of the information gained in the previous searchsteps. Also it is not clear how knowledge gained in finding the k-th discord canhelp finding the (k + 1)-th discord. In [1, Section 3.6], partial information aboutd is cached so that the inner loop may break quickly. But as caching works atthe “easy” part of the search space—where dp is small, it is not clear how muchcomputation is saved.

In the following section, we address the above issues by proposing a directway to search for multiple discords. In particular, our algorithm requires noancillary index (and hence no parameters to tune), and the algorithm reuses theknowledge gained in searching for the first k discords to speed up the search forthe (k + 1)-th discord.

3 Direct Discord Search

In Definition 1, the formula p(1) = argmaxp

{dp : 1 ≤ p ≤ m − n + 1} suggests a

direct way to search for the discord with the following two steps:

Step 1: Compute an estimate dp of dp for each p.Step 2: Let p∗�argmaxp

{dp : 1 ≤ p ≤ m − n + 1

}, and verify that T [p∗; n] is

the discord.

Step 2 can be carried out by testing the condition dp∗ ≥ maxp dp, as justified bythe following proposition.

Proposition 1. Let d be an estimate of d such that d d. If dp∗ ≥ maxp dp,then dp∗ ≥ maxp dp.

Proof. With d d, we have dp∗ ≥ maxp dp ≥ maxp dp.

Proposition 1 gives a sufficient condition for verifying the discord of a time series.It shows that d does not have to be close to d at every location p. To verify thediscord, it suffices to have d d and maxd ≥ max d. This point is illustratedin Figure 2.

To estimate dp = minq dist(p, q) in Step 1, we can use dp � minq∈Qp dist(p, q).Here Qp is a subset of {q : |p − q| > n}—Hence dp ≥ dp. As Qp includes morelocations, the error dp − dp becomes smaller. If Qp = {q : |p − q| > n}, thendp − dp = 0. By controlling the size of Qp, we can control the accuracy of dp fordifferent p. Therefore Proposition 1 justifies the search strategy shown in Figure 6.

For top-K discords search, the while-loop (Line 2-10) is repeated K times (withproper book keeping to exclude overlapping subsequences). As ds keeps decreasingin the computation, every time we start with a better estimate d in Line 3.

Page 7: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 141

1: For each p, estimate dp � minq∈Qpdist(p, q), where Qp is a subset of {q : |p− q| > n}.

2: while the discord has not been found, do3: p∗ ← argmax

p

{dp}.4: Compute dp∗ � minq dist(p∗, q).

5: if dp∗ > dp for all p �= p∗ then6: return p∗ as the discord starting location.

7: else8: Decrease d by enlarging Qps.

9: end if10: end while

Fig. 6. Base algorithm for direct discord search

3.1 Efficient Way to Estimate d

Figure 2 suggests that to find the discord, it is not necessary to have a highlyaccurate estimate of dp for every p. Instead, highly accurate dp is needed onlywhen dp is relatively large. To minimize the total computation cost, we shoulddistribute computational resources according to the importance of dp.

We propose three operations to estimate dp, with increasing level of compu-tation cost.

1. Traversing: Suppose that dist(p, qp) is known to be small for some qp. Forsmall integers k, let Qp+k contain one location qp + k. Intuitively, traversingtranslates to searching along the 45 degree lines in Figure 3.

2. Sampling: Let Qp be a set of locations if dp is likely to be large or knowledgeof dp is unavailable. We shall see a way to construct such Qp using periodicityof time series.

3. Exhausting: Let Qp be all possible locations if the exact value of dp isrequired.

Note that the most expensive Exhausting operation is needed only in verifyingthe discord (Line 4 in Figure 6).

The Traversing operation can be justified with the following argument. For arelatively large n, distT,n(p, q) � distT,n(p+1, q+1). If distT,n(p, qp) is small, thendistT,n(p+1, qp+1) is likely to be small as well. The argument can be“telescoped”to other k values as long as k

n is small enough. This is demonstrated in Figure 7,where local minima for sp’s tend to cluster around some “sweet spots” (the redcircle). Therefore, in Traversing, a good estimate dp = distT,n(p, q) suggests a“sweet spot”q around which good estimates dp+k for neighboring positions (p+k)can be found.

The Sampling operationmay be implemented with local searchwith a set of ran-dom starting points. But when the time series is nearly periodic or quasi-periodic,more efficient implementation exists. This will be discussed in the next section.

3.2 Quasi-Periodic Time Series

Suppose a time series T is nearly periodic with a period l (i.e., tp � tp±k·l).Then distT,n(p, p ± k · l) � 0 implies dp = minq:|p−q|≥n distT,n(p, q) � 0 as long

Page 8: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

142 W. Luo and M. Gallagher

0 200 400 600 800 1000

05

1015

2025

30

q

dist

((p,, q

))

Fig. 7. Distance profiles dist(p, ·) of timeseries xmitdb_x018_0. Each line plotsthe sequence sp = (distT,n(p, 1), . . . ,distT,n(p, 1000)) for some p, where n = 360.The 10 lines in the plot correspond to pbeing 10, 20, . . . , 100 respectively.

Fig. 8. Locations of qp’s for time seriesxmitdb_x108_0. Each location (p, qp) iscolored according the value dp. Dashedlines are a period (360) apart. Hence ifa location (p, qp) falls on a dashed line,then qp − p is a multiple of the period360.

as k · l ≥ n for some k . Small distances associated with multiple times of thetime-series period can be seen in Figure 7—at locations around p + 360 andp + 2 × 360 for each p in {10, 20, . . . , 100}.

With this observation, the following heuristic can be used to implement theSampling operation for nearly periodic time series: a location q multiple periodsaway from p is likely to be near a local minimum for {distT,n(p, q) : q}. Figure 8shows the location qp = argmin

q{distT,n(p, q)} for all locations p at the time

series in Figure 1. It shows that in most cases a minimum-location qp is roughlymultiple periods away from p.

There are a number of ways to estimate the period of a time series. Forexample, autocorrelation function (see Figure 9) and phase coherence analysis[7] are often used to estimate period.

As suggested in Figure 7, the gaps between local minima of a distance profile{distT,n(p, q) : 1 ≤ q ≤ m − n + 1} approximate the period of a time series,for distT,n(p, p + k · l) � 0. We use this observation to estimate period in thispaper (see Figure 11). Figure 10 shows the collection of gaps {Δk} for localminima of {distT,n(1000, q) : q}, where T is the time series in Figure 1 andn = 360. Taking the median of {Δk} gives the estimate 354 for the period of thetime series. Note that the period need to estimated only once (with the distanceprofile {distT,n(p, q) : 1 ≤ q ≤ m−n+1} for only one location p). Hence it takesonly m − n calls to the distance function to estimate the period of a time seriesof length m. As a by-product, the exact value of dp is also obtained.

Page 9: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 143

0 1000 3000 5000

−0.2

0.2

0.6

1.0

Lag

ACF

Fig. 9. Autocorrelation function of timeseries xmitdb_x108_0. The plot showsmultiple peaks corresponding to multiplesof the period.

300 400 500 600 700

0.00

00.

010

Gaps between neighboring local minima

Den

sity

●●●● ● ●●●● ●●

estimated period

Fig. 10. The density plot for gaps be-tween local minima and the estimated pe-riod for time series xmitdb_x108_0

1: Randomly pick a location p.2: Compute dist(p, q) for every q.3: cp ← the lower 5% quantile of all distances calculated in the previous step.4: Q← all local minima q such that dist(p, q) ≤ cp.5: Sort Q in increasing order (Q1, Q2, . . . , Q|Q|).6: Δk = Qk+1 −Qk for all k < |Q|.7: l← the median of {Δk : k < |Q|}.8: return l as the estimated period.

Fig. 11. Estimating period with the median gap between two neighboring local minima

3.3 Implementation of the Search Strategy

With heuristics for both Traversing and Sampling, Figure 12 implements Line 1in Figure 6. The procedure uses a sequential covering strategy to estimate dp foreach p. In each iteration (the while loop from Line 2 to Line 10), a Samplingoperation is done to find a “sweet spot”. Then a Traversing operation exploitsthat location to cover as many neighboring locations as possible.

The verification stage of our algorithm (Lines 2-10 in Figure 6) consists ofa while loop which resembles the outer loop in HOT SAX and WAT. But herethe order of location is dynamic, determined by the ever-improving estimate d.Line 8 in Figure 6 further improves d when the initial guess for the discordturns out to be incorrect. The improvement can be achieved by traversing witha better starting location qp∗ produced in Line 4 of Figure 6 (see Figure 13). Assuggested by Figure 8, the “best” locations tend to cluster along the 45 degreelines. Moreover the large value of the initial estimate ˆdp∗ suggests the neighbor-hood of p∗ is a high-payoff region for further refinement of d. As the traversingis done locally, the improvement step is relatively fast compared to the initialestimation step for d.

To sum up, we have described a new algorithm for discord search that con-sists of an estimation stage followed by a verification stage. The estimation stage

Page 10: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

144 W. Luo and M. Gallagher

1: traversed[p]← FALSE and mindist[p]←∞, for each p.2: while traversed[p] = FALSE for some p do3: Randomly pick a location p from {p : traversed[p] = FALSE}.4: Q← {q : |p− q| = k · l for some integer k}.5: Do local search for the optimal qp with starting points in Q.{Sampling}6: cp ← the lower 5% quantile of all distances calculated in the previous step.7: Find the largest numbers L and R such that dist(p − i, qp − i) < cp for all

i ≤ L and dist(p + i, qp + i) < cp for all i ≤ R. {Traversing}8: traversed[p− L : p + R]← TRUE.9: mindist[p− L : p + R]← {dist(p + i, qp + i) : −L ≤ i ≤ R}.

10: end while

Fig. 12. Implementation of d estimation (Line 1 in Figure 6)

1: Let qp be the minimum location returned with Exhausting.2: cp ← the lower 5% quantile of all distances calculated in Exhausting.3: Find the largest numbers L and R such that mindist[p − L : p + R] ≥ dp,

dist(p− i, qp− i) < cp and for all i ≤ L, and dist(p+ i, qp + i) < cp for all i ≤ R.

4: mindist[p− L : p + R]← {dist(p + i, qp + i) : −L ≤ i ≤ R}.

Fig. 13. Traversing with a better starting point to improve d

achieves efficiency by dynamically differentiating locations p according to theirpotential influence to maxp dp. Further reduction in computation cost comes fromthe periodicity of a time series. In general, the Traversing heuristic works bestwhen a time series is smooth (or equivalently densely sampled), while the Sam-pling heuristic works best when the periodicity of the time series is pronounced.The algorithm is guaranteed to halt and to return the discord by Observation 1.The efficiency of the algorithm has been evaluated in the following section.

4 Empirical Evaluation

In this section, we first compare the performance of our direct-discord-searchalgorithm with the results reported for HOT SAX in [5]. We then report theperformance of our algorithm on a collection of time series which are publiclyavailable. Following the tradition established in [5] and [1], the efficiency of ouralgorithm was measured by the number of calls to the distance function, asapposed to wall clock or CPU time. Since our algorithm entails no overhead ofconstructing an index (in contrast to the algorithms in [5] and [1]), the numberof calls to the distance function is roughly proportional to the total computationtime involved. As shown in [2] and [1], the performance of HOT SAX dependson the parameters selected. Here we assume that the metrics reported in [5] werebased on optimal parameter values.

To compare to HOT SAX, we use the dataset qtdbsel102 from [6]. Althoughseveral datasets were used in [5] to evaluate the performance of HOT SAX, this isthe only one readily available to us. The dataset qtdbsel102 contains two timeseries of length 45, 000; we use the first one as the two are highly correlated.

Page 11: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 145

Length of time series

Num

ber o

f cal

ls to

the

dist

ance

func

tion

1k 2k 4k 8k 16k 32k

110

100

1000

01e

+06

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

● ●

● ● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

HOT SAXdirect search: 1st discorddirect search: 2nd discorddirect search: 3rd discord

Fig. 14. Search costs for the direct searchalgorithm and HOT SAX. For HOT SAX,the mean numbers of distance calls werevisually estimated from [5]; interval esti-mates were used to account for potentialestimation error.

0 5000 15000 25000

1st discord2nd discord3rd discord

0 5000 15000 25000

610

14

Index

d

Fig. 15. Time series nprs44 and its dvector

Following [5], we created random excerpts of length {1000 × 2k : 0 ≤ k ≤5}from the original time series1. For each length configuration, 100 random ex-cerpts were created and the top 3 discords of length 128 were searched for.

Table 1 shows the mean and the standard error for numbers of calls to thedistance function. The rightmost column of the table contains the mean perfor-mance metric visually estimated from Figure 13 of [5]. Similar information isalso visualized in Figure 14. The figure plots the numbers of calls to the distancefunction for 6 × 3 × 100 runs of the direct-discord-search algorithm. Each pointcorresponds to one run of discord search; horizontal jitter was applied to reduceoverlaps among points. The dashed intervals estimate the average number of callsto the distance function by HOT SAX. Loess lines for the costs of searching fortop-3 discords are also plotted. We can see that for the 1st discord, the averagenumber of calls by the direct search algorithm (the red line) is roughly linearto the size of the time series excerpts. Moreover, these numbers are significantsmaller than the numbers reported for HOT SAX (summarized with the dashedintervals). For subsequent discords, the average numbers of calls to the distancefunction (the blue line and the green line) decrease significantly, due to informa-tion gained from prior computation. The metrics for the second and the thirddiscords also show larger variance: some points are significantly higher or lowerthan the loess lines. A likely cause is that the complete time series contains onlya number of truly anomalous subsequences (discords): When a random excerptof the time series includes only one (or two) of these discords, searching for thesecond (or the third, respectively) discord will be difficult. (Note the plot usesthe log scale for x and y axes.)

1 Experiments on the length 64, 000 were not carried out because qtdbsel102 has only45, 000 points and we choose not to pad the time series with hypothetical values.

Page 12: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

146 W. Luo and M. Gallagher

Table 1. Numbers of calls to the distance function with random excerpts from ptdb-

sel102, for the direct-discord-search algorithm and HOT SAXg

Time Series Direct Search Cost (Standard Error) Aver. Cost for HOT SAX

Length 1st discord 2nd discord 3rd discord (visual estimates)

1,000 4,020 (1,441) 1,072 (705) 998 ( 690) 16, 000 to 40, 000

2,000 11,159 (4,641) 4,120 (2,532) 3,493 (2,780) 40, 000 to 100, 000

4,000 30,938 (12,473) 13,963 (10,633) 13,399 (12,473) 60, 000 to 160, 000

8,000 77,381 (33,064) 29,711 (32,651) 38,632 (40,974) 100, 000 to 160, 000

16,000 168,277 (70,071) 94,855 (107,128) 141,038 (143,553) 250, 000 to 400, 000

32,000 365,900 (184,540) 198,797 (95,960) 105,911 (107,992) 400, 000 to 1× 106

In the second set of experiments, we search for the top 3 discords for a col-lection of time series from [6]2 and [4], using the proposed algorithm. For timeseries from [6], the discord lengths are chosen to be consistent with configura-tions used in [5]. The results are shown in Table 2. Many of these datasets, inparticular 2h_radioactivity, demonstrate little periodicity. The results showthat our algorithm has reasonable performance even for such time series.

Table 2. Numbers of calls to the distance function for top-3 discord search

Time Length Discord Search Cost (Standard Error)Series Length 1st discord 2nd discord 3rd discord

nprs44 24,085 320 249,283 (12,454) 231,350 (19,949) 208,539 (34,640)

nprs43 18,012 320 188,095 (11,820) 24,588 (2,785) 109,147 (29,516)

power data 35,000 750 158,235 (13,546) 34,680 (874) 37,992 (3,460)

chfdbchf15 15,000 256 79,683 (5,606) 21,400 (2,224) 134,734 (18,967)

2h radioactivity 4,370 128 157,495 (8,799) 20,286 (5,725) 16,463 (4,657)

In Table 2, the results for the time series nprs44 are particularly interesting.For nprs44, no significant reduction in computation is observed for computingthe 2nd and the 3rd discords. To find out why, we plot the time series and theestimated d vector in Figure 15. The figure shows that the 2nd and the 3rddiscords are not noticeably different from other subsequences.

Completely nonperiodic case. Completely nonperiodic time series rarely existin applications, and they can be easily identified through visual inspection ofthe time series or their autocorrelation function. In an unlikely situation whereour algorithm is blindly applied to a completely nonperiodic time series, a badestimation of period will reduce the efficiency of the algorithm. To demonstratethis, we generate two random walk time series T with tp =

∑pi=1 Zi, where Zi are

independent normally-distributed random variables with mean 0 and variance 1

2 For datasets containing more than one time series, we take the first one in each datafile.

Page 13: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 147

0 5000 10000 15000Index

(a) Random Walk 1 (b) Random Walk 2

Fig. 16. Random walk time series used in the experiments for completely nonperiodicdata

Table 3. Number of calls to the distance function for top-3 discord search (randomwalk time series)

Time Length Discord Direct Search Cost (Standard Error)Series Length 1st 2nd 3rd

random walk 1 15,000 256 136,395 (7,410) 54,994 (10,144) 34,355 (7,696)

random walk 2 30,000 128 441,685 (35,695) 329,380 (50,432) 636,930 (164,842)

(see Figure 16). Random walk time series is interesting in two aspects: firstly arandom walk time series is completely nonperiodic; secondly every subsequenceof a random walk can be regarded as equally anomalous.

We applied the algorithm to find the top-3 discords in the two random-walktime series. The results are shown in Table 3. Without tuning any parameter,the algorithm is still hundreds of times faster than the brute-force computationof all pair-wise distances.

To sum up, our experiments show clear performance improvement on quasi-periodic time series by the proposed direct discord-search algorithm. Our algo-rithm also demonstrates consistent performance across a broad range of timeseries, with varying degree of periodicity.

5 Conclusions and Future Work

The paper has introduced a parameter-free algorithm for top-K discord search.When a time series is nearly periodic or quasi-periodic, the algorithm demon-strated significant reduction in computation time. Many applications generatequasi-periodic time series, and the assumption of quasi-periodicity can be assessedby simple visual inspection. Therefore our algorithm has wide applicability.

Our results have shown that periodicity is a useful feature in time-series anomalydetection. More theoretical study is needed to better understand the effect of peri-odicity on the search space of time-series discords. We are also interested in know-ing to what extent the results in this paper can be generalized to chaotic time series[10].

Page 14: Faster and Parameter-Free Discord Search in Quasi-Periodic ... · Definition 1 (Discord). Let T be a sequence of length m.Asubsequence T[p(1);n] is the first discord (or simply

148 W. Luo and M. Gallagher

One limitation of the proposed algorithm is that the time series need to be fitinto the main memory. Hence the algorithm requires O(m) memory. One futuredirection is to explore disk-aware approximations to the direct-discord-searchalgorithm. When the time series is too large to be fitted into the main memory,one needs to minimize the number of disk scans as well the number of calls tothe distance function (see [11]).

Another direction is to explore alternative ways of estimating the d vector sothat the number of iterations for refining d is minimized. We also are looking forways to extend the algorithm so that the periodicity assumption can be removed.

Acknowledgment

Support for this work was provided by an Australian Research Council LinkageGrant (LP 0776417). We would like to thank anonymous reviewers for theirhelpful comments.

References

1. Bu, Y., Leung, T.W., Fu, A.W.C., Keogh, E., Pei, J., Meshkin, S.: WAT: Findingtop-k discords in time series database. In: Proceedings of 7th SIAM InternationalConference on Data Mining (2007)

2. Fu, A.W.-c., Leung, O.T.-W., Keogh, E.J., Lin, J.: Finding time series discordsbased on haar transform. In: Li, X., Zaıane, O.R., Li, Z.-h. (eds.) ADMA 2006.LNCS (LNAI), vol. 4093, pp. 31–41. Springer, Heidelberg (2006)

3. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark,R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E.: PhysioBank, Phys-ioToolkit, and PhysioNet: Components of a new research resource for complexphysiologic signals. Circulation 101(23), e215–e220 (2000), Circulation ElectronicPages: http://circ.ahajournals.org/cgi/content/full/101/23/e215

4. Hyndman, R.J.: Time Series Data Library, http://www.robjhyndman.com/TSDL

(accessed on April 15, 2010)5. Keogh, E., Lin, J., Fu, A.: HOT SAX: Efficiently finding the most unusual time

series subsequence. In: Proc. of the 5th IEEE International Conference on DataMining, pp. 226–233 (2005)

6. Keogh, E., Lin, J., Fu, A.: The UCR Time Series Discords Homepage,http://www.cs.ucr.edu/~eamonn/discords/

7. Lindstrom, J., Kokko, H., Ranta, E.: Detecting periodicity in short and noisy timeseries data. Oikos 78(2), 406–410 (1997)

8. Pham, N.D., Le, Q.L., Dang, T.K.: HOT aSAX: A novel adaptive symbolic repre-sentation for time series discords discovery. In: Nguyen, N.T., Le, M.T., Swi ↪atek,J. (eds.) ACIIDS 2010. LNCS, vol. 5990, pp. 113–121. Springer, Heidelberg (2010)

9. Sion, M.: On general minimax theorems. Pacific J. Math. 8(1), 171–176 (1958)10. Sprott, J.C.: Chaos and time-series analysis. Oxford Univ. Pr., Oxford (2003)11. Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: Finding

unusual time series in terabyte sized datasets. Knowledge and Information Sys-tems 17(2), 241–262 (2008)