View
0
Download
0
Category
Preview:
Citation preview
Efficient Similarity Queries via Lossy Compression
Idoia Ochoa, Amir Ingber and Tsachy Weissman Electrical Engineering Department, Stanford University
Introduction Problem Formulation
Preliminaries Given two sequences x and y, we measure their similarity with a distortion function.
- Hamming distortion = 3/10
Two sequences are D-similar if d(x, y) < D
Constraints
For each sequence x in the database, store a signature T(x). Given a query sequence y, find the sequences in the database that are D-similar to y, based only on their signature T(x). Apply a decision rule that ensures no false negatives and minimizes false positives: g(T(x), y) = maybe for all x, y s.t. d(x,y) < D
Problem Description
Compress sequences in a database so that similarity queries can still be performed efficiently on the compressed database.
We consider queries of the form: Which sequences in the database are similar to a given sequence y?
False positives are not allowed.
Importance
The amount of data stored in databases is growing exponentially.
Executing queries is timely and challenging.
Solutions
Due to the smaller size, the compressed database can be stored in several locations:
- Easier and faster access.
- More queries can be performed.
Applications
Databases consisting of genomic data:
- Genbank: almost 200 million DNA sequences.
- BIOZON: More than 100 million records.
Similarity queries are important in genomics.
For example, in molecular phylogenetics, relationships among species are established by the similarity between their respective DNA sequences.
X = A C G G T T A C C G
Y = A C T G A T A A C G
Theoretical Framework
For a given similarity threshold D, there is a tradeoff between compression rate and reliability.
Let X and Y be independent random vectors, drawn from Px. Definitions
1. A rate R is said to be D-achievable if there exists a sequence of rate-R admissible schemes (T(n), g(n)) s.t. lim Pr(g(n)(T(n)(X), Y)= maybe) = 0.
2. The identification rate RID(D) is the infimum of D-achievable rates.
3. The identification exponent EID(R) is defined as lim sup -1/n Pr(g(n)(T(n)(X), Y) = maybe).
Fundamental limits
For symmetric sources with Hamming distortion, RID(D) and EID(R) can be explicitly characterized.
Proposed Architecture
Compression Scheme: T(x) = (i, d(x, x´))
Based on fixed-length lossy compressors
- Encoding function fn: x [1:2nR]
- Decoding function gn: [1:2nR] x´
and side-information d(x, x´).
Simulation Results
Decision Rule:
g(T(x), y) = maybe if d - D < d(x´, y) < d + D
For any distortion satisfying the triangle inequality, the above decision rule guarantees zero false negatives.
We want to minimize the probability of maybe.
Rate improvement:
We quantize the side-information d(x´, x) by using the k-means algorithm, and modify the decision rule accordingly.
Databases:
1. 1000 i.i.d. uniform 4-ary sequences of length 100.
2. 1000 DNA sequences of length 100 taking from BIOZON (empirical distribution: pA = 0.25, pC = 0.23, pG = 0.29, pT = 0.23).
Results:
We show the resulting P[maybe] for both databases and two approximations.
For D = 0.1, we get a probability of maybe of 0.001 with a reduction in size of 83.5%. For D = 0.2 and R = 0.47, the probability of maybe is 0.01.
Recommended