12
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li [email protected] Department of Computer and Information Science University of Macau, Macau

University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li [email protected] Department of Computer and Information Science

Embed Size (px)

Citation preview

Page 1: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

University of Macau

Discovering Longest-lasting Correlation in Sequence Databases

Yuhong Li

[email protected]

Department of Computer and Information Science

University of Macau, Macau

Page 2: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

2 University of Macau

■ Typical Analysis Query

LCS: Motivation

Find the most correlated stock to GOOG for every 3 months in

2008 - 2011?

It is hard to define a proper length.

ρ=0.985[4/3/2009,4/6/2009]

-1 0 +1

Perfect negativecorrelation

Perfect positivecorrelation

Nocorrelation

Page 3: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

3 University of Macau

■ Longest-lasting Correlated Subsequences

LCS: Motivation

Find the longest-lasting period of a stock who performs similar

to GOOG in 2008-2011? (e.g., = 0.95)

ρ=0.95[28/2/2008,5/4/2010]

It do not require a given length&

Correlation threshold seems a more natural parameter.

Page 4: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

4 University of Macau

Baseline Solution

Query sequence

Length

Length

Length

Sequence database

os=0

os=1

Correlation threshold

Page 5: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

5 University of Macau

o1(0, m)

o1(0, m-1) o1(1, m-1)

o1(0, m-2) o1(1, m-2) o1(2, m-2)

o1(0, m-3) o1(1, m-3) o1(2, m-3) o1(3, m-3)

o1(1, m-4) o1(2, m-4) o1(3, m-4) o1(4, m-4)o1(0, m-4)

Challenges

■ Computing LCS is very time consuming Number of subsequences: . Time complexity is . E.g., stock dataset, n = 2187, m = 1037, discovering LCS

takes about 2 minutes.

on(0, m)

on(0, m-1) on(1, m-1)

on(0, m-2) on(1, m-2) on(2, m-2)

on(0, m-3) on(1, m-3) on(2, m-3) on(3, m-3)

on(1, m-4) on(2, m-4) on(3, m-4) on(4, m-4)on(0, m-4)

sequence 1, … , sequence n

Page 6: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

6 University of Macau

Main Idea

■ Time Series are Long Dimensionality Reduction Thousands of dimensions. Dimensionality Reduction obeys upper bounding lemma.

■ Huge Search Space Batch Pruning

Group similar subsequences

■ Unpruned Subsequences Further Refinement

Intra-object grouping

Inter-object grouping

Correlation computing costs O(m)

Raw subsequences, dim = m PAA representation, dim = 3

Correlation computing costs O(3)

Page 7: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

7 University of Macau

LCS: Diamond Cover Index

■ Intra-object grouping Grouping similar subsequences in a sequence object.

o(0, m)

o(0, m-1) o(1, m-1)

o(0, m-2) o(1, m-2) o(2, m-2)

o(0, m-3) o(1, m-3) o(2, m-3) o(3, m-3)

o(1, m-4) o(2, m-4) o(3, m-4) o(4, m-4)o(0, m-4) ...

PAA feature space 𝑓 1

𝑓 2

minDist q(0, m)

q(0, m-1) q(1, m-1)

q(1, m-2)

𝑜(0 ,𝑚)

𝑜(0 ,𝑚−1)𝑜(1 ,𝑚−1)𝑜(1 ,𝑚−2)

Page 8: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

8 University of Macau

LCS: Diamond Cover Index

■ Inter-object Grouping Exploiting Similarity between Sequence Objects. Grouping the diamond MBRs of different objects into higher

level MBRs Compact MBRs.

MCô(0,m)

MDô2(0,m)

MDô1(0,m)

MDô9(0,m)MD

ô2(0,m)

MDô9(0,m)MD

ô1(0,m)

Diamond ID of MCô(0,m)

DCI is the collection of the compact MBRs. Memory efficient. Offer good pruning ability.

Page 9: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

9 University of Macau

LCS: Subsequence Refinement

minDist

𝑜(0 ,𝑚)𝑜(0 ,𝑚−1)𝑜(1 ,𝑚−1)𝑜(1 ,𝑚−2)

q

q (0 ,𝑚−1)𝑞 (1 ,𝑚−1)𝑞 (1 ,𝑚−2)

■ Incremental correlation computation Reduce the correlation cost from to .

Each subsequence pair cost

Page 10: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

10 University of Macau

LCS: Experimental Evaluation

■ Programming Language: C++ Machine: Ubuntu 12.04, 4GB RAM

■ Datasets RAND: Random generate sequences. STOCK: 2187 quoted companies in NYSE from 2008 to 2012. TAO: Sea surface temperatures, 28399 sequences of length 1008.

Page 11: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

11 University of Macau

SOTA: state-of-the-art method in distance calculation.SKIP: incremental correlation computation.

SOTA+DCI, SKIP+DCI: DCI version of SOTA and SKIP respectively.

Stock Dataset TAO Dataset

LCS: Experimental Evaluation

At least one order of magnitude faster than SOTA adaption.

Page 12: University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li yb27407@umac.mo Department of Computer and Information Science

12 University of Macau

Thanks

Q A

input hidden output