i SAX: Indexing and Mining Terabyte Sized Time Series

Time Series Data Mining GroupTime Series Data Mining Group

iSAX: Indexing and Mining Terabyte Sized Time Series

Jin Shieh, Eamonn Keogh

Computer Science & Eng. Dept.

University of California, Riverside


Outline

• Introduction– Motivating example

• iSAX representation– Indexing time series

• Experimental evaluation

• Conclusion


Introduction• Our work extends a popular symbolic

representation of time series to allow for the indexing and retrieval of millions of time series

• Symbolic Aggregate approXimation (SAX) – Represent a time series T of length n in w-

dimensional space using PAA• Where the ith element of is:

– Then discretize into a vector of symbols • Breakpoints map to a small alphabet a of symbols

i

ijjn

wi

wn

wn

Tt1)1(

-3

-2

-1

0

1

2

3

4 8 12 160

00

01

10

11

iSAX(T,4,4)

-3

-2

-1

0

1

2

3

4 8 12 160 4 8 12 160

A time series T PAA(T,4)

-3

-2

-1

0

1

2

3


Introduction (cont.)

• SAX is lower bounding

– Given a SAX representations Ta, Sa a lower bound to the Euclidean distance is:

MINDIST(Ta, Sa)

– dist(ti,si) is the smallest distance between the breakpoints that characterize each symbol, 0 if they overlap

w

i iiwn stdist

1

2),(


Motivating Example• Why not just index using SAX?

• For example: index 1,000,000 time series using SAX– Choose SAX parameters

• cardinality = 8, wordlength = 4• 84 = 4,096 possible SAX word labels

– Place time series which map to the same label in the same file on disk

– Compute label for query and retrieve matching file• Time series in file likely to be good approximate

matches

– Average label occupancy 1,000,000/4,096 = ~244 (reasonable)


Motivating Example (cont.)

• In practice, the distribution of time series to SAX word labels is not uniform!– Empty – Disproportionate percentage of the dataset

• Ideal condition: We want to give a threshold th, and have the number of entries n mapped to a label to be 1 ≤ n ≤ th – Favor larger n

• How can we achieve this? We need to make SAX more flexible


iSAX Representation

• SAX uses a single hard-coded cardinality– Unable to differentiate only on dimensions

of interest

• We will show that the indexing problem can be solved if we extend SAX to allow:– Different cardinalities within a single word– Comparison of words with different

cardinalities

• We call this extension indexable SAX (iSAX)


iSAX Representation (cont.)

• Multi-resolution property– Readily convert to any lower resolution that differs

by a power of two

• Lower bounding distance between iSAX words enforced through examination of both sets of breakpoints

• iSAX offers a bit aware, quantized, multi-resolution representation with variable granularity

{12,13, 6, 1} = {1100,1101,0110,0001}{ 6, 6, 3, 0} = {110 ,110 ,011 ,000 }{ 3, 3, 1, 0} = {11 ,11 ,01 ,00 }{ 1, 1, 0, 0} = {1 ,1 ,0 ,0 }


Indexing with iSAX

• Split a set of time series represented by a common iSAX word into mutually exclusive subsets (using multi-resolution property):– Increase cardinality along dimensions d, word length w,

1 ≤ d ≤ w– Fan-out rate bound by 2d

• Iterative doubling– Given a base cardinality b, cardinality at i-th increase is

b*2i

– Alignment of breakpoints overlap

• Allows for index structures which are hierarchical, with non-overlapping regions, and a controlled fan-out rate


Indexing with iSAX (cont.)• Simple tree-based index (base cardinality b, word length w, threshold

th) – Hierarchically subdivides SAX space until entries in each

subspace falls within th• Leaf nodes point to index files on disk• Internal nodes designate a split in SAX space

• Approximate Search– Similar time series often represented by same iSAX word– Traverse index until leaf

• Match iSAX representation at each level• Apply heuristics if no match

• Exact Search– Leverage approximate search– Prune search space

• Lower bounding distance


Experimental Evaluation

• We conduct experiments to identify characteristics of the iSAX representation:

– Tightness of the lower bound

– Indexing performance on massive datasets

– Applicability to data-mining algorithms


Tightness of Lower Bounds

• TLB = LowerBoundDist(T’,S’) / EuclideanDist(T,S)

• For a given dataset– Time series length [480, 960, 1440, 1920] – Bytes available for representation [16, 24,

32, 40]– Results similar across thirty datasets

480

960

1440

1920

16 bytes24 bytes

32 bytes40 bytes

0

0.2

0.4

0.6

0.8

iSAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA

0 500 1000

Koski ECG

TL

B

Koski ECG dataset

Time series Length

Byt

es A

vaila

ble


Indexing Performance on Massive Datasets• Indexed random walk datasets of [1, 2, 4, 8] million time

series of length 256– Parameters: b = 4, w = 8, th = 100– Generated [39,255, 57,365, 92,209, 162,340] index files

• Approximate Search (1000 queries):

Exact Search (100 queries):

At least 1 from top 100

1m 2m 4m 8m0

20

40

60

80

100

At least 1 from top 10

1 from top 1 (true nearest neighbor)Outside top 1000

Size of Random Walk Database

Percentage of Queries

Avg. Time/Query (min)

1M 2M 4M 8M

Exact Search 3.8 5.8 9.0 14.1

Sequential Scan

71.5 104.8 168.8 297.6

Avg. Disk Accesses/Query

1M 2M 4M 8M

Exact Search 2115.3 3172.5

4925.3

7719.1

Sequential Scan 39255 57365 92209 162340


Data Mining• Definition: Time Series Set Difference (TSSD)

(A,B). Given two collections of time series A and B, the time series set difference is the subsequence in A whose distance from its nearest neighbor in B is maximal

• Electrocardiogram dataset from a 45 year old male subject with suspected sleep-disordered breathing – 7.2 hours as reference set B (1,000,000 time series)– 8 minutes 39 seconds as “novel” set A (20,000 time

series) where the patient woke up0 50 100 150 200 250

-4

0

4

109 90

The Time Series Set Difference discovered between ECGs recorded during a waking cycle and the previous 7.2 hours (respiration pattern change in accordance with change in sleep stages)


Data Mining (cont.)• Solutions:

1) Sequential scan A across B 2) Exact search each entry in A using index on B3) Leverage approximate and exact search

• Order A by approximate search distance in a queue• Perform exact search using index on B in descending

distance– Suspend if distance becomes lower than next entry in

the queue– If search completes, return as TSSD

Distance Computations Disk Accesses Est. Time

1) Sequential Scan 20,000,000,000 31,196 6.25 days

2) Exact Search 325,604,200 5,676,400 1.04 days

3) Leveraged 2,365,553 43,779 34 minutes


Conclusion

• Introduced the iSAX representation and shown how it can be used for indexing time series

• Demonstrated scalability and efficacy on massive datasets

• Showed how approximate and exact search can be used in conjunction to produce exact results on data mining problems


THANK YOU!

Documents

i SAX: Indexing and Mining Terabyte Sized Time Series