Upload
hester-pamela-turner
View
225
Download
0
Embed Size (px)
Citation preview
E.G.M. Petrakis Searching Signals and Patterns
1
Searching Signals and Patterns
Given a query Q and a collection of N objects O1,O2,…ON search exactly or approximately
The ideal method should be: Fast: faster than sequential scanningCorrect: returns all qualifying objectDynamic: allows for insertions,
deletions, updates
E.G.M. Petrakis Searching Signals and Patterns
2
Similarity Queries
Range queries: find all objects within distance e from the queryD(Q,I) < e, where D,e: user defined
Nearest Neighbor (NN): find the k most similar objects
All-pairs (“spatial join”) queries: find all pairs of objects Oi,Oj within distance e of each other D(Oi,Oj) < e
E.G.M. Petrakis Searching Signals and Patterns
3
Similarity queries (cont,d)
Whole matching: the whole query Q matches an object Oi
the image is 512x512, the query is 512x512
Partial matching: the query specifies only a part of an object find parts of objects that match the querythe images are 512x512, the query is
32x32
E.G.M. Petrakis Searching Signals and Patterns
4
Object Types
1D signals: time sequencesscientific datadigitized voice or music
2D signals: digitized images (gray scale, color)video clips
General objects: text, multimedia documents
E.G.M. Petrakis Searching Signals and Patterns
5
Applications
In many applications searching for similar patterns helps in predictions, decision making, data mining etc.FinancialMarketing & production of 1D signalsScientific databasesDNA/genome databasesAudio databasesImage and Video databases
E.G.M. Petrakis Searching Signals and Patterns
6
Queries
Find companies whose stock prices move similarly or with similar pattern of growth
Find products with similar selling patterns
Find if a musical score is similar to one of the copyrighted scores
Find images that look like a sunsetFind X-rays showing lung tumor
E.G.M. Petrakis Searching Signals and Patterns
7
Indexing [Agrawal et.al 93]
To achieve faster than sequential scanning the objects are indexed
Extract f features from each object and apply a SAM to index this objectSearch the SAM to retrieve promising objectsClean-up the response
The indexing method must be correct (i.e., has no “misses”), have small space overhead and be dynamic
E.G.M. Petrakis Searching Signals and Patterns
8
Objects are mapped to points A query Q becomes a sphere with radius e
Mapping Objects to Space
E.G.M. Petrakis Searching Signals and Patterns
9
Mapping Objects to Points
F( ): mapping functionDf: object distance in feature spaceD: object distance in actual spaceSelection of F( ) and Df ?
Ideally, Df(Qi,Oj) = D(Qi,Oj) The mapping preserves the distances
The mapping should guarantee no misses
E.G.M. Petrakis Searching Signals and Patterns
10
GEMINI [Faloutsos 96]
GEMINI: Generic Multimedia Indexing1. Define F( ): mapping of objects to f features
(objects become vectors)2. Determine the distance function Df in the f
space3. Guarantee correctness: prove that Df < D
4. Apply a SAM (e.g., R-tree) to index the f-dimensional vectors
5. Apply the Search Algorithm to eliminate flase drops.
E.G.M. Petrakis Searching Signals and Patterns
11
Search Algorithm
Problem: Retrieve all objects satisfying D(Q,O) < eRetrieve points Df(Qi,Oj) < eRetrieve the actual objects SKeep only those satisfying D(Q,S) < e (discard false alarms)
E.G.M. Petrakis Searching Signals and Patterns
12
Lower Bounding
Lemma: To guarantee no false dismissals F( ) should satisfyDf(Q,Oi) <= D(Q,Oi) for all Q, Oi
Proof: prove that if an object qualifies for the query, it will be retrieved in the feature spaceDf(Q,Oi) <= e but since Df(Q,Oi) <=
D(Q,Oi) we have that D(Q,Oi) <= e
E.G.M. Petrakis Searching Signals and Patterns
13
Indexing 1D Signals
Find all signals S=(s1,s2,…Sn) within distance e from Q=(q1,q2,…qn)D(Q,S) < esi, qi: amplitudes at time ID is defined asApply GEMINIBut how F( ) and Df( ) are defined?
i ii qsSQD 2-),(
E.G.M. Petrakis Searching Signals and Patterns
14
Definition of F, D
DFT maps signals s=(s1,s2,…sn) to the frequency spectrum S=(S1,S2,…Sn)
F( ) takes first fc Fourier coefficientsfc: “cut-off” frequency (e.g., fc = 5)
Signals become points in an f = 2fc space (because the coefficients s are complex numbers)
Df is defined as i iif QSSQD 2-),(
E.G.M. Petrakis Searching Signals and Patterns
15
Df Lower Bounds D
Let S, Q be the DFTs of s, qParseval’s: the energy in the time
and frequency domains is the same
This implies that and D(Q,S) <= D (q,s) because D is
computed using fc <= n fewer terms
i ii i SsSs
2222
qsQS
--
E.G.M. Petrakis Searching Signals and Patterns
16
Experiments
Faster than sequential for all set sizesSlower but more accurate for more
coefficientsThe trade-of reaches an equilibrium for f=3 or
4
E.G.M. Petrakis Searching Signals and Patterns
17
Intuition
For the majority of 1D signals there will be a few frequencies with high amplitudes
If we index only the first few fc (fc < 5 or 10) coefficients we shall have only a few false drops
R-trees can handle up to 20 dimensions for point data
E.G.M. Petrakis Searching Signals and Patterns
18
NN Queries [Korn. et. al. 98]
Find the k-NN’s of query Q:1. Search the SAM to the find the k-NN’s
[e.g., Rous95] using Df
2. Compute D for all these k objects3. Let E = max{D(q,si)}, 1<= i <= k
4. Issue a range query D(q,s) <= E on the SAM and retrieve a new set of objects
5. Compute their actual distances D(q,s)6. Output the nearest k objects
E.G.M. Petrakis Searching Signals and Patterns
19
Correctness of NN AlgorithmLemma: the algorithm has no missesProof: Let sk be the k-NN retrieved object and sl
be the l-th NN object (l < k), prove D(q,sl) < D(q,sk) (then the l-th object is retrieved too !!)
If the algorithm did not retrieve sl then the range query (step 4) has missed it: Df(q,sl) > E
From lower bounding: D(q,sl) > Df(q,sl) > E ®However, Df(q,sk) < E and by combination
Df(q,sl) > D(q,sk) which contradicts ®
E.G.M. Petrakis Searching Signals and Patterns
20
Partial Matching [Faloutsos94]
Problem: given N data sequences S1,S2,…SN and a query Q, locate data subsequences that match a query subsequence locate stock prices with similar
monthly patterns of growthextract f features, apply a SAM etc.
E.G.M. Petrakis Searching Signals and Patterns
21
Methodology
Locate matching window of length w on signal (length(S)–w+1 positions)
Assume minimum query length w the method handles any queryshorter queries are of no interest
Longer queries are split into w-queries
E.G.M. Petrakis Searching Signals and Patterns
22
Splitting a Query
Mapping sequences S=(s1,s2,s3) and S’=(s’1,s’2) and query Q=(q1,q2)
q1
q2
s1
s2
s3
s’1
s’2
e
e
F2
F1
E.G.M. Petrakis Searching Signals and Patterns
23
Indexing Subsequences
I-naive method: index all w-trailsInefficient in terms of space and speed1:f increase in storage, tall, slow R-tree
ST-index: index the w-trails in groupsSubsequent trails are similar Grouping in the f-dimensional feature
spaceIndex rectangles containing similar trails
E.G.M. Petrakis Searching Signals and Patterns
24
Grouping of Subsequences
Organize w-trails in the f space in rectangles so that disk accesses are minimizedFixed number of points per rectangle, but
which is the optimal number?Smaller rectangles, less disk accesses
a rectangle L=(l1,l2,…ln) causes Π(li+0.5) accesses
an m point rectangle causes Π(li+0.5)/m accesses
E.G.M. Petrakis Searching Signals and Patterns
25
I-Adaptive Algorithm
Map the points of w-trails in rectangles in the f space
Assign the first point of a w-trail to a rectangle
For each successive point, if it increases the cost of the rectangle start a new rectangle, else include it in the same rectangle
E.G.M. Petrakis Searching Signals and Patterns
26
Naïve Method
Fixed number of points per rectangle
E.G.M. Petrakis Searching Signals and Patterns
27
I-Adaptive Method
Variable number of points per rectangleSmaller rectangles, less disk accesses
E.G.M. Petrakis Searching Signals and Patterns
28
Range Queries [Petrakis 02]
Input: query Q, distances D,Df, tolerance e
Output: signals S satisfying D(Q,S) <= e
1. Decompose Q = (q1,q2,…,qn)
2. Apply Df(qi,sj) <= e, store results in Ai
3. Compute 4. For each S in A compute D(Q,S)5. Output sequences satisfying D(Q,S) <= e
n
1=
iA=Ai
E.G.M. Petrakis Searching Signals and Patterns
29
NN Queries [Petrakis 02]
Input: query Q, distance D, Df,, number k Output: the k sequences most similar to Q1. Decompose Q = (q1,q2,…,qn)
2. Apply a k-NN query for each qi Retrieve k distinct w-trails (incremental k-NN
search) [Hjaltason 99] Compute ei their max distance from Q
3. Compute e = min{ei}4. Apply a range query D(Q,S) <=e5. Output the k sequences closest to Q
E.G.M. Petrakis Searching Signals and Patterns
30
References R. Agrawal, C. Faloutsos, A. Swani, “
Efficient Similarity Search in Sequence Databases”, Proc. of FODO Conf, Oct. 1993
C. Faloutsos, M. Ranganathan, Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases”, Proc. of SIGMOD, May 1994
P. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, Z. Protopapas, “Fast and Effective Retrieval of Medical Tumor Shapes”, IEEE TKDE, Vol. 11, 1998
Euripides G.M. Petrakis: "Fast Retrieval by Spatial Structure in Image DataBases", Journal of Visual Languages and Computing, 2002 (to appear)
N. Rousopoulos, S. Kelley, F. Vincent: “Nearest-Neighbor Queries”, Proc. ACM SIGMOD, May 1995
G. R. Hjaltason and H. Samet: “Distance Browsing in Spatial Databases”, ACM Trans. on Inf.Syst., 24(2):265–318, 1999