Upload
marianna-hall
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
University of Macau
Discovering Longest-lasting Correlation in Sequence Databases
Yuhong Li
Department of Computer and Information Science
University of Macau, Macau
2 University of Macau
■ Typical Analysis Query
LCS: Motivation
Find the most correlated stock to GOOG for every 3 months in
2008 - 2011?
It is hard to define a proper length.
ρ=0.985[4/3/2009,4/6/2009]
-1 0 +1
Perfect negativecorrelation
Perfect positivecorrelation
Nocorrelation
3 University of Macau
■ Longest-lasting Correlated Subsequences
LCS: Motivation
Find the longest-lasting period of a stock who performs similar
to GOOG in 2008-2011? (e.g., = 0.95)
ρ=0.95[28/2/2008,5/4/2010]
It do not require a given length&
Correlation threshold seems a more natural parameter.
4 University of Macau
Baseline Solution
Query sequence
…
Length
Length
…
…
Length
Sequence database
…
…
os=0
os=1
Correlation threshold
5 University of Macau
o1(0, m)
o1(0, m-1) o1(1, m-1)
o1(0, m-2) o1(1, m-2) o1(2, m-2)
o1(0, m-3) o1(1, m-3) o1(2, m-3) o1(3, m-3)
o1(1, m-4) o1(2, m-4) o1(3, m-4) o1(4, m-4)o1(0, m-4)
Challenges
■ Computing LCS is very time consuming Number of subsequences: . Time complexity is . E.g., stock dataset, n = 2187, m = 1037, discovering LCS
takes about 2 minutes.
on(0, m)
on(0, m-1) on(1, m-1)
on(0, m-2) on(1, m-2) on(2, m-2)
on(0, m-3) on(1, m-3) on(2, m-3) on(3, m-3)
on(1, m-4) on(2, m-4) on(3, m-4) on(4, m-4)on(0, m-4)
…
sequence 1, … , sequence n
6 University of Macau
Main Idea
■ Time Series are Long Dimensionality Reduction Thousands of dimensions. Dimensionality Reduction obeys upper bounding lemma.
■ Huge Search Space Batch Pruning
Group similar subsequences
■ Unpruned Subsequences Further Refinement
Intra-object grouping
Inter-object grouping
≤
Correlation computing costs O(m)
Raw subsequences, dim = m PAA representation, dim = 3
Correlation computing costs O(3)
7 University of Macau
LCS: Diamond Cover Index
■ Intra-object grouping Grouping similar subsequences in a sequence object.
o(0, m)
o(0, m-1) o(1, m-1)
o(0, m-2) o(1, m-2) o(2, m-2)
o(0, m-3) o(1, m-3) o(2, m-3) o(3, m-3)
o(1, m-4) o(2, m-4) o(3, m-4) o(4, m-4)o(0, m-4) ...
PAA feature space 𝑓 1
𝑓 2
minDist q(0, m)
q(0, m-1) q(1, m-1)
q(1, m-2)
𝑜(0 ,𝑚)
𝑜(0 ,𝑚−1)𝑜(1 ,𝑚−1)𝑜(1 ,𝑚−2)
8 University of Macau
LCS: Diamond Cover Index
■ Inter-object Grouping Exploiting Similarity between Sequence Objects. Grouping the diamond MBRs of different objects into higher
level MBRs Compact MBRs.
MCô(0,m)
MDô2(0,m)
MDô1(0,m)
MDô9(0,m)MD
ô2(0,m)
MDô9(0,m)MD
ô1(0,m)
Diamond ID of MCô(0,m)
DCI is the collection of the compact MBRs. Memory efficient. Offer good pruning ability.
9 University of Macau
LCS: Subsequence Refinement
minDist
𝑜(0 ,𝑚)𝑜(0 ,𝑚−1)𝑜(1 ,𝑚−1)𝑜(1 ,𝑚−2)
q
q (0 ,𝑚−1)𝑞 (1 ,𝑚−1)𝑞 (1 ,𝑚−2)
■ Incremental correlation computation Reduce the correlation cost from to .
Each subsequence pair cost
10 University of Macau
LCS: Experimental Evaluation
■ Programming Language: C++ Machine: Ubuntu 12.04, 4GB RAM
■ Datasets RAND: Random generate sequences. STOCK: 2187 quoted companies in NYSE from 2008 to 2012. TAO: Sea surface temperatures, 28399 sequences of length 1008.
11 University of Macau
SOTA: state-of-the-art method in distance calculation.SKIP: incremental correlation computation.
SOTA+DCI, SKIP+DCI: DCI version of SOTA and SKIP respectively.
Stock Dataset TAO Dataset
LCS: Experimental Evaluation
At least one order of magnitude faster than SOTA adaption.
12 University of Macau
Thanks
Q A
input hidden output