Upload
anitra
View
31
Download
3
Tags:
Embed Size (px)
DESCRIPTION
i SAX: Indexing and Mining Terabyte Sized Time Series. Jin Shieh, Eamonn Keogh Computer Science & Eng. Dept. University of California, Riverside. Outline. Introduction Motivating example i SAX representation Indexing time series Experimental evaluation Conclusion. 3. 3. iSAX( T ,4,4). - PowerPoint PPT Presentation
Citation preview
Time Series Data Mining GroupTime Series Data Mining Group
iSAX: Indexing and Mining Terabyte Sized Time Series
Jin Shieh, Eamonn Keogh
Computer Science & Eng. Dept.
University of California, Riverside
Time Series Data Mining GroupTime Series Data Mining Group
Outline
• Introduction– Motivating example
• iSAX representation– Indexing time series
• Experimental evaluation
• Conclusion
Time Series Data Mining GroupTime Series Data Mining Group
Introduction• Our work extends a popular symbolic
representation of time series to allow for the indexing and retrieval of millions of time series
• Symbolic Aggregate approXimation (SAX) – Represent a time series T of length n in w-
dimensional space using PAA• Where the ith element of is:
– Then discretize into a vector of symbols • Breakpoints map to a small alphabet a of symbols
i
ijjn
wi
wn
wn
Tt1)1(
-3
-2
-1
0
1
2
3
4 8 12 160
00
01
10
11
iSAX(T,4,4)
-3
-2
-1
0
1
2
3
4 8 12 160 4 8 12 160
A time series T PAA(T,4)
-3
-2
-1
0
1
2
3
Time Series Data Mining GroupTime Series Data Mining Group
Introduction (cont.)
• SAX is lower bounding
– Given a SAX representations Ta, Sa a lower bound to the Euclidean distance is:
MINDIST(Ta, Sa)
– dist(ti,si) is the smallest distance between the breakpoints that characterize each symbol, 0 if they overlap
w
i iiwn stdist
1
2),(
Time Series Data Mining GroupTime Series Data Mining Group
Motivating Example• Why not just index using SAX?
• For example: index 1,000,000 time series using SAX– Choose SAX parameters
• cardinality = 8, wordlength = 4• 84 = 4,096 possible SAX word labels
– Place time series which map to the same label in the same file on disk
– Compute label for query and retrieve matching file• Time series in file likely to be good approximate
matches
– Average label occupancy 1,000,000/4,096 = ~244 (reasonable)
Time Series Data Mining GroupTime Series Data Mining Group
Motivating Example (cont.)
• In practice, the distribution of time series to SAX word labels is not uniform!– Empty – Disproportionate percentage of the dataset
• Ideal condition: We want to give a threshold th, and have the number of entries n mapped to a label to be 1 ≤ n ≤ th – Favor larger n
• How can we achieve this? We need to make SAX more flexible
Time Series Data Mining GroupTime Series Data Mining Group
iSAX Representation
• SAX uses a single hard-coded cardinality– Unable to differentiate only on dimensions
of interest
• We will show that the indexing problem can be solved if we extend SAX to allow:– Different cardinalities within a single word– Comparison of words with different
cardinalities
• We call this extension indexable SAX (iSAX)
Time Series Data Mining GroupTime Series Data Mining Group
iSAX Representation (cont.)
• Multi-resolution property– Readily convert to any lower resolution that differs
by a power of two
• Lower bounding distance between iSAX words enforced through examination of both sets of breakpoints
• iSAX offers a bit aware, quantized, multi-resolution representation with variable granularity
{12,13, 6, 1} = {1100,1101,0110,0001}{ 6, 6, 3, 0} = {110 ,110 ,011 ,000 }{ 3, 3, 1, 0} = {11 ,11 ,01 ,00 }{ 1, 1, 0, 0} = {1 ,1 ,0 ,0 }
Time Series Data Mining GroupTime Series Data Mining Group
Indexing with iSAX
• Split a set of time series represented by a common iSAX word into mutually exclusive subsets (using multi-resolution property):– Increase cardinality along dimensions d, word length w,
1 ≤ d ≤ w– Fan-out rate bound by 2d
• Iterative doubling– Given a base cardinality b, cardinality at i-th increase is
b*2i
– Alignment of breakpoints overlap
• Allows for index structures which are hierarchical, with non-overlapping regions, and a controlled fan-out rate
Time Series Data Mining GroupTime Series Data Mining Group
Indexing with iSAX (cont.)• Simple tree-based index (base cardinality b, word length w, threshold
th) – Hierarchically subdivides SAX space until entries in each
subspace falls within th• Leaf nodes point to index files on disk• Internal nodes designate a split in SAX space
• Approximate Search– Similar time series often represented by same iSAX word– Traverse index until leaf
• Match iSAX representation at each level• Apply heuristics if no match
• Exact Search– Leverage approximate search– Prune search space
• Lower bounding distance
Time Series Data Mining GroupTime Series Data Mining Group
Experimental Evaluation
• We conduct experiments to identify characteristics of the iSAX representation:
– Tightness of the lower bound
– Indexing performance on massive datasets
– Applicability to data-mining algorithms
Time Series Data Mining GroupTime Series Data Mining Group
Tightness of Lower Bounds
• TLB = LowerBoundDist(T’,S’) / EuclideanDist(T,S)
• For a given dataset– Time series length [480, 960, 1440, 1920] – Bytes available for representation [16, 24,
32, 40]– Results similar across thirty datasets
480
960
1440
1920
16 bytes24 bytes
32 bytes40 bytes
0
0.2
0.4
0.6
0.8
iSAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA
0 500 1000
Koski ECG
TL
B
Koski ECG dataset
Time series Length
Byt
es A
vaila
ble
Time Series Data Mining GroupTime Series Data Mining Group
Indexing Performance on Massive Datasets• Indexed random walk datasets of [1, 2, 4, 8] million time
series of length 256– Parameters: b = 4, w = 8, th = 100– Generated [39,255, 57,365, 92,209, 162,340] index files
• Approximate Search (1000 queries):
Exact Search (100 queries):
At least 1 from top 100
1m 2m 4m 8m0
20
40
60
80
100
At least 1 from top 10
1 from top 1 (true nearest neighbor)Outside top 1000
Size of Random Walk Database
Percentage of Queries
Avg. Time/Query (min)
1M 2M 4M 8M
Exact Search 3.8 5.8 9.0 14.1
Sequential Scan
71.5 104.8 168.8 297.6
Avg. Disk Accesses/Query
1M 2M 4M 8M
Exact Search 2115.3 3172.5
4925.3
7719.1
Sequential Scan 39255 57365 92209 162340
Time Series Data Mining GroupTime Series Data Mining Group
Data Mining• Definition: Time Series Set Difference (TSSD)
(A,B). Given two collections of time series A and B, the time series set difference is the subsequence in A whose distance from its nearest neighbor in B is maximal
• Electrocardiogram dataset from a 45 year old male subject with suspected sleep-disordered breathing – 7.2 hours as reference set B (1,000,000 time series)– 8 minutes 39 seconds as “novel” set A (20,000 time
series) where the patient woke up0 50 100 150 200 250
-4
0
4
109 90
The Time Series Set Difference discovered between ECGs recorded during a waking cycle and the previous 7.2 hours (respiration pattern change in accordance with change in sleep stages)
Time Series Data Mining GroupTime Series Data Mining Group
Data Mining (cont.)• Solutions:
1) Sequential scan A across B 2) Exact search each entry in A using index on B3) Leverage approximate and exact search
• Order A by approximate search distance in a queue• Perform exact search using index on B in descending
distance– Suspend if distance becomes lower than next entry in
the queue– If search completes, return as TSSD
Distance Computations Disk Accesses Est. Time
1) Sequential Scan 20,000,000,000 31,196 6.25 days
2) Exact Search 325,604,200 5,676,400 1.04 days
3) Leveraged 2,365,553 43,779 34 minutes
Time Series Data Mining GroupTime Series Data Mining Group
Conclusion
• Introduced the iSAX representation and shown how it can be used for indexing time series
• Demonstrated scalability and efficacy on massive datasets
• Showed how approximate and exact search can be used in conjunction to produce exact results on data mining problems
Time Series Data Mining GroupTime Series Data Mining Group
THANK YOU!