Efficient Similarity Queries via Lossy Compression

Idoia Ochoa, Amir Ingber and Tsachy Weissman Electrical Engineering Department, Stanford University

Introduction Problem Formulation

Preliminaries Given two sequences x and y, we measure their similarity with a distortion function.

- Hamming distortion = 3/10

Two sequences are D-similar if d(x, y) < D

Constraints

For each sequence x in the database, store a signature T(x). Given a query sequence y, find the sequences in the database that are D-similar to y, based only on their signature T(x). Apply a decision rule that ensures no false negatives and minimizes false positives: g(T(x), y) = maybe for all x, y s.t. d(x,y) < D

Problem Description

Compress sequences in a database so that similarity queries can still be performed efficiently on the compressed database.

We consider queries of the form: Which sequences in the database are similar to a given sequence y?

False positives are not allowed.

Importance

The amount of data stored in databases is growing exponentially.

Executing queries is timely and challenging.

Solutions

Due to the smaller size, the compressed database can be stored in several locations:

- Easier and faster access.

- More queries can be performed.

Applications

Databases consisting of genomic data:

- Genbank: almost 200 million DNA sequences.

- BIOZON: More than 100 million records.

Similarity queries are important in genomics.

For example, in molecular phylogenetics, relationships among species are established by the similarity between their respective DNA sequences.

X = A C G G T T A C C G

Y = A C T G A T A A C G

Theoretical Framework

For a given similarity threshold D, there is a tradeoff between compression rate and reliability.

Let X and Y be independent random vectors, drawn from Px. Definitions

1. A rate R is said to be D-achievable if there exists a sequence of rate-R admissible schemes (T(n), g(n)) s.t. lim Pr(g(n)(T(n)(X), Y)= maybe) = 0.

2. The identification rate RID(D) is the infimum of D-achievable rates.

3. The identification exponent EID(R) is defined as lim sup -1/n Pr(g(n)(T(n)(X), Y) = maybe).

Fundamental limits

For symmetric sources with Hamming distortion, RID(D) and EID(R) can be explicitly characterized.

Proposed Architecture

Compression Scheme: T(x) = (i, d(x, x´))

Based on fixed-length lossy compressors

- Encoding function fn: x [1:2nR]

- Decoding function gn: [1:2nR] x´

and side-information d(x, x´).

Simulation Results

Decision Rule:

g(T(x), y) = maybe if d - D < d(x´, y) < d + D

For any distortion satisfying the triangle inequality, the above decision rule guarantees zero false negatives.

We want to minimize the probability of maybe.

Rate improvement:

We quantize the side-information d(x´, x) by using the k-means algorithm, and modify the decision rule accordingly.

Databases:

1. 1000 i.i.d. uniform 4-ary sequences of length 100.

2. 1000 DNA sequences of length 100 taking from BIOZON (empirical distribution: pA = 0.25, pC = 0.23, pG = 0.29, pT = 0.23).

Results:

We show the resulting P[maybe] for both databases and two approximations.

For D = 0.1, we get a probability of maybe of 0.001 with a reduction in size of 83.5%. For D = 0.2 and R = 0.47, the probability of maybe is 0.01.

Efficient Similarity Queries via Lossy Compression · 2014. 3. 31. · Executing queries is timely...

Documents

Block and Sliding-Block Lossy Compression via MCMCweb.stanford.edu/~tsachy/pdf_files/block and sliding block lossy... · One popular trend in designing universal lossy compression

Executing SPARQL Queries over Mapped Document Stores with SparqlMap-M

Query Execution - University of Cretehy460/pdf/006.pdf · 2003-11-19 · QUERY EXECUTION 6.1 An Algebra for Queries In order to talk about good algorithms for executing queries, we

A visual language for modeling and executing traceability ...sarec.nd.edu/Preprints/VTML.pdfA visual language for modeling and executing traceability queries ry. One goal of any such

Lossy Compression Andang

Executing SQL Queries and Making Plugins

Executing Queries as a form of artistic practice

Strategies for executing federated queries in SPARQL1polleres/publications/buil-etal... · 2014. 9. 9. · Strategies for executing federated queries in SPARQL1.1 Carlos Buil-Aranda1?,

IBM Guardium S-GATEpublic.dhe.ibm.com/software/tw/data/S_Gate_Data_Level_DS.pdf · • Executing queries on sensitive tables • Changing sensitive data values ... Extruded data is

Executing SQL queries over encrypted character strings in the …staff.ustc.edu.cn/~cheneh/paper_pdf/2012/ZongdaWu-KBS12.pdf · 2017-03-17 · Executing SQL queries over encrypted

Tunable Asymmetric Transmission via Lossy Acoustic ...people.ee.duke.edu/...LossyAsymmetricTransmission.pdf · Tunable Asymmetric Transmission via Lossy Acoustic Metasurfaces Yong

ModelHub: Lifecycle Management for Deep Learning · PDF fileModelHub: Lifecycle Management for Deep Learning Hui Miao, ... ciently executing complex DQL queries and searching ... the

Efficient Electrocardiogram (ECG) Lossy Compression …

Executing Queries over Schemaless RDF ... - cs.uwaterloo.catozsu/publications/rdf... · Cheriton School of Computer Science, University of Waterloo fgaluc,tamer.ozsu,kdaudjee,ohartigg@uwaterloo.ca

Multimedia Compression ( Lossy Compression)

Lossy compression - Stanford University

Evaluating Probabilistic Queries over Imprecise Datadsm/ics280sensor/readings/data/sigmod03.pdf · the executing queries. Extensive experiments are performed to examine the eﬀectiveness

Queries on Compressed Data - EECS at UC Berkeley · 2019. 11. 2. · challenges using a fundamentally new approach — executing a wide range of queries (e.g., search, random access,

7. Lossy image compression

Lossy Compression