30
E.G.M. Petrakis Searching Signals and Patterns 1 Searching Signals and Patterns Given a query Q and a collection of N objects O 1 ,O 2 ,…O N search exactly or approximately The ideal method should be: Fast: faster than sequential scanning Correct: returns all qualifying object Dynamic: allows for insertions, deletions, updates

E.G.M. PetrakisSearching Signals and Patterns1 Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately The ideal

Embed Size (px)

Citation preview

Page 1: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

1

Searching Signals and Patterns

Given a query Q and a collection of N objects O1,O2,…ON search exactly or approximately

The ideal method should be: Fast: faster than sequential scanningCorrect: returns all qualifying objectDynamic: allows for insertions,

deletions, updates

Page 2: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

2

Similarity Queries

Range queries: find all objects within distance e from the queryD(Q,I) < e, where D,e: user defined

Nearest Neighbor (NN): find the k most similar objects

All-pairs (“spatial join”) queries: find all pairs of objects Oi,Oj within distance e of each other D(Oi,Oj) < e

Page 3: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

3

Similarity queries (cont,d)

Whole matching: the whole query Q matches an object Oi

the image is 512x512, the query is 512x512

Partial matching: the query specifies only a part of an object find parts of objects that match the querythe images are 512x512, the query is

32x32

Page 4: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

4

Object Types

1D signals: time sequencesscientific datadigitized voice or music

2D signals: digitized images (gray scale, color)video clips

General objects: text, multimedia documents

Page 5: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

5

Applications

In many applications searching for similar patterns helps in predictions, decision making, data mining etc.FinancialMarketing & production of 1D signalsScientific databasesDNA/genome databasesAudio databasesImage and Video databases

Page 6: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

6

Queries

Find companies whose stock prices move similarly or with similar pattern of growth

Find products with similar selling patterns

Find if a musical score is similar to one of the copyrighted scores

Find images that look like a sunsetFind X-rays showing lung tumor

Page 7: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

7

Indexing [Agrawal et.al 93]

To achieve faster than sequential scanning the objects are indexed

Extract f features from each object and apply a SAM to index this objectSearch the SAM to retrieve promising objectsClean-up the response

The indexing method must be correct (i.e., has no “misses”), have small space overhead and be dynamic

Page 8: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

8

Objects are mapped to points A query Q becomes a sphere with radius e

Mapping Objects to Space

Page 9: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

9

Mapping Objects to Points

F( ): mapping functionDf: object distance in feature spaceD: object distance in actual spaceSelection of F( ) and Df ?

Ideally, Df(Qi,Oj) = D(Qi,Oj) The mapping preserves the distances

The mapping should guarantee no misses

Page 10: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

10

GEMINI [Faloutsos 96]

GEMINI: Generic Multimedia Indexing1. Define F( ): mapping of objects to f features

(objects become vectors)2. Determine the distance function Df in the f

space3. Guarantee correctness: prove that Df < D

4. Apply a SAM (e.g., R-tree) to index the f-dimensional vectors

5. Apply the Search Algorithm to eliminate flase drops.

Page 11: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

11

Search Algorithm

Problem: Retrieve all objects satisfying D(Q,O) < eRetrieve points Df(Qi,Oj) < eRetrieve the actual objects SKeep only those satisfying D(Q,S) < e (discard false alarms)

Page 12: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

12

Lower Bounding

Lemma: To guarantee no false dismissals F( ) should satisfyDf(Q,Oi) <= D(Q,Oi) for all Q, Oi

Proof: prove that if an object qualifies for the query, it will be retrieved in the feature spaceDf(Q,Oi) <= e but since Df(Q,Oi) <=

D(Q,Oi) we have that D(Q,Oi) <= e

Page 13: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

13

Indexing 1D Signals

Find all signals S=(s1,s2,…Sn) within distance e from Q=(q1,q2,…qn)D(Q,S) < esi, qi: amplitudes at time ID is defined asApply GEMINIBut how F( ) and Df( ) are defined?

i ii qsSQD 2-),(

Page 14: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

14

Definition of F, D

DFT maps signals s=(s1,s2,…sn) to the frequency spectrum S=(S1,S2,…Sn)

F( ) takes first fc Fourier coefficientsfc: “cut-off” frequency (e.g., fc = 5)

Signals become points in an f = 2fc space (because the coefficients s are complex numbers)

Df is defined as i iif QSSQD 2-),(

Page 15: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

15

Df Lower Bounds D

Let S, Q be the DFTs of s, qParseval’s: the energy in the time

and frequency domains is the same

This implies that and D(Q,S) <= D (q,s) because D is

computed using fc <= n fewer terms

i ii i SsSs

2222

qsQS

--

Page 16: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

16

Experiments

Faster than sequential for all set sizesSlower but more accurate for more

coefficientsThe trade-of reaches an equilibrium for f=3 or

4

Page 17: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

17

Intuition

For the majority of 1D signals there will be a few frequencies with high amplitudes

If we index only the first few fc (fc < 5 or 10) coefficients we shall have only a few false drops

R-trees can handle up to 20 dimensions for point data

Page 18: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

18

NN Queries [Korn. et. al. 98]

Find the k-NN’s of query Q:1. Search the SAM to the find the k-NN’s

[e.g., Rous95] using Df

2. Compute D for all these k objects3. Let E = max{D(q,si)}, 1<= i <= k

4. Issue a range query D(q,s) <= E on the SAM and retrieve a new set of objects

5. Compute their actual distances D(q,s)6. Output the nearest k objects

Page 19: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

19

Correctness of NN AlgorithmLemma: the algorithm has no missesProof: Let sk be the k-NN retrieved object and sl

be the l-th NN object (l < k), prove D(q,sl) < D(q,sk) (then the l-th object is retrieved too !!)

If the algorithm did not retrieve sl then the range query (step 4) has missed it: Df(q,sl) > E

From lower bounding: D(q,sl) > Df(q,sl) > E ®However, Df(q,sk) < E and by combination

Df(q,sl) > D(q,sk) which contradicts ®

Page 20: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

20

Partial Matching [Faloutsos94]

Problem: given N data sequences S1,S2,…SN and a query Q, locate data subsequences that match a query subsequence locate stock prices with similar

monthly patterns of growthextract f features, apply a SAM etc.

Page 21: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

21

Methodology

Locate matching window of length w on signal (length(S)–w+1 positions)

Assume minimum query length w the method handles any queryshorter queries are of no interest

Longer queries are split into w-queries

Page 22: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

22

Splitting a Query

Mapping sequences S=(s1,s2,s3) and S’=(s’1,s’2) and query Q=(q1,q2)

q1

q2

s1

s2

s3

s’1

s’2

e

e

F2

F1

Page 23: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

23

Indexing Subsequences

I-naive method: index all w-trailsInefficient in terms of space and speed1:f increase in storage, tall, slow R-tree

ST-index: index the w-trails in groupsSubsequent trails are similar Grouping in the f-dimensional feature

spaceIndex rectangles containing similar trails

Page 24: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

24

Grouping of Subsequences

Organize w-trails in the f space in rectangles so that disk accesses are minimizedFixed number of points per rectangle, but

which is the optimal number?Smaller rectangles, less disk accesses

a rectangle L=(l1,l2,…ln) causes Π(li+0.5) accesses

an m point rectangle causes Π(li+0.5)/m accesses

Page 25: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

25

I-Adaptive Algorithm

Map the points of w-trails in rectangles in the f space

Assign the first point of a w-trail to a rectangle

For each successive point, if it increases the cost of the rectangle start a new rectangle, else include it in the same rectangle

Page 26: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

26

Naïve Method

Fixed number of points per rectangle

Page 27: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

27

I-Adaptive Method

Variable number of points per rectangleSmaller rectangles, less disk accesses

Page 28: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

28

Range Queries [Petrakis 02]

Input: query Q, distances D,Df, tolerance e

Output: signals S satisfying D(Q,S) <= e

1. Decompose Q = (q1,q2,…,qn)

2. Apply Df(qi,sj) <= e, store results in Ai

3. Compute 4. For each S in A compute D(Q,S)5. Output sequences satisfying D(Q,S) <= e

n

1=

iA=Ai

Page 29: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

29

NN Queries [Petrakis 02]

Input: query Q, distance D, Df,, number k Output: the k sequences most similar to Q1. Decompose Q = (q1,q2,…,qn)

2. Apply a k-NN query for each qi Retrieve k distinct w-trails (incremental k-NN

search) [Hjaltason 99] Compute ei their max distance from Q

3. Compute e = min{ei}4. Apply a range query D(Q,S) <=e5. Output the k sequences closest to Q

Page 30: E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal

E.G.M. Petrakis Searching Signals and Patterns

30

References R. Agrawal, C. Faloutsos, A. Swani, “

Efficient Similarity Search in Sequence Databases”, Proc. of FODO Conf, Oct. 1993

C. Faloutsos, M. Ranganathan, Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases”, Proc. of SIGMOD, May 1994

P. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, Z. Protopapas, “Fast and Effective Retrieval of Medical Tumor Shapes”, IEEE TKDE, Vol. 11, 1998

Euripides G.M. Petrakis: "Fast Retrieval by Spatial Structure in Image DataBases", Journal of Visual Languages and Computing, 2002 (to appear)

N. Rousopoulos, S. Kelley, F. Vincent: “Nearest-Neighbor Queries”, Proc. ACM SIGMOD, May 1995

G. R. Hjaltason and H. Samet: “Distance Browsing in Spatial Databases”, ACM Trans. on Inf.Syst., 24(2):265–318, 1999