View
1
Download
0
Category
Preview:
Citation preview
Near Neighbor Problem Made Fair
Sariel Har-PeledUIUC
Sepideh MahabadiTTIC
Nearest Neighbor Problems
β’ Nearest Neighbor: Given a set of objects, find the closest one to the query object.
Nearest Neighbor Problems
β’ Nearest Neighbor: Given a set of objects, find the closest one to the query object.
Nearest Neighbor Problems
β’ Nearest Neighbor: Given a set of objects, find the closest one to the query object.
β’ Near Neighbor: given a set of objects, find one that is close enough to the query object.
There are many applications of NN
Searching for the closest object
Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππ
Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππA query point ππ comes online
ππ
ππ
Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππA query point ππ comes online
Goal: β’ Find a point ππβ in the ππ-neighborhood
ππππβ
ππ
Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππA query point ππ comes online
Goal: β’ Find a point ππβ in the ππ-neighborhoodβ’ Do it in sub-linear time and small space
ππππβ
ππ
Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππA query point ππ comes online
Goal: β’ Find a point ππβ in the ππ-neighborhoodβ’ Do it in sub-linear time and small space
All existing algorithms for this problemβ’ Either space or query time depending exponentially on ππβ’ Or assume certain properties about the data, e.g., bounded
intrinsic dimension
ππππβ
ππ
Approximate Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππA query point ππ comes online
Goal: β’ Find a point ππβ in the ππ-neighborhoodβ’ Do it in sub-linear time and small spaceβ’ Approximate Near Neighbor
β Report a point in distance cππ for c > 1
ππππ
ππππππ
Approximate Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππA query point ππ comes online
Goal: β’ Find a point ππβ in the ππ-neighborhoodβ’ Do it in sub-linear time and small spaceβ’ Approximate Near Neighbor
β Report a point in distance cππ for c > 1β For Hamming (and Manhattan) query time is ππππ(1/ππ) [IM98] β and for Euclidean it is ππππ( 1
ππ2) [AI08]
ππ
ππππππππ
Fair Near Neighbor
Sample a neighbor of the query uniformly at random
Individual fairness: every neighbor has the same chance of being reported. Remove the bias inherent in the NN data structure
Fair Near Neighbor
Sample a neighbor of the query uniformly at random
Individual fairness: every neighbor has the same chance of being reported. Remove the bias inherent in the NN data structure
Applications:Removing noise, k-NN classificationAnonymizing the dataCounting the neighborhood size
Fair Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππA query point ππ comes online
Goal: β’ Return each point ππ in the neighborhood of ππ with uniform
probabilityβ’ Do it in sub-linear time and small space
ππ12
ππ 12
Approximate Fair Near Neighbor
Dataset of ππ points ππ in a metric space, e.g. βππ, and a parameter ππA query point ππ comes online
Goal of Approximate Fair NNβ Any point ππ in ππ(ππ, ππ) is reported with βalmost uniformβ
probability, i.e., ππππ(ππ) where
11 + ππ ππ ππ, ππ
β€ ππππ(ππ) β€1 + ππππ ππ, ππ
ππ12
ππ 12
Results on (1 + ππ)-Approximate Fair NN
πππ΄π΄π΄π΄π΄π΄ and πππ΄π΄π΄π΄π΄π΄ are the space and query time of standard ANN
Domain Guarantee Space Query
Exact Neighborhoodππ(ππ, ππ)
w.h.p ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄ β ππ ππ, ππππππ ππ, ππ
)
ππ
ππ
Results on (1 + ππ)-Approximate Fair NN
πππ΄π΄π΄π΄π΄π΄ and πππ΄π΄π΄π΄π΄π΄ are the space and query time of standard ANN
Approximate neighborhood: a set ππ such that ππ ππ, ππ β ππ β ππ ππ, ππππ
Domain Guarantee Space Query
Exact Neighborhoodππ(ππ, ππ)
w.h.p ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄ β ππ ππ, ππππππ ππ, ππ
)
Approximate Neighborhoodππ ππ, ππ β ππ β ππ(ππ, ππππ)
In expectation ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄)
ππ
ππππ
ππππππ
Results on (1 + ππ)-Approximate Fair NN
πππ΄π΄π΄π΄π΄π΄ and πππ΄π΄π΄π΄π΄π΄ are the space and query time of standard ANN
Approximate neighborhood: a set ππ such that ππ ππ, ππ β ππ β ππ ππ, ππππ
Dependence on ππ is O(log(1ππ))
Domain Guarantee Space Query
Exact Neighborhoodππ(ππ, ππ)
w.h.p ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄ β ππ ππ, ππππππ ππ, ππ
)
Approximate Neighborhoodππ ππ, ππ β ππ β ππ(ππ, ππππ)
In expectation ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄)
Results on (1 + ππ)-Approximate Fair NN
πππ΄π΄π΄π΄π΄π΄ and πππ΄π΄π΄π΄π΄π΄ are the space and query time of standard ANN
Approximate neighborhood: a set ππ such that ππ ππ, ππ β ππ β ππ ππ, ππππ
Dependence on ππ is O(log(1ππ))
Experiments
Domain Guarantee Space Query
Exact Neighborhoodππ(ππ, ππ)
w.h.p ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄ β ππ ππ, ππππππ ππ, ππ
)
Approximate Neighborhoodππ ππ, ππ β ππ β ππ(ππ, ππππ)
In expectation ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄)
Results on (1 + ππ)-Approximate Fair NN
πππ΄π΄π΄π΄π΄π΄ and πππ΄π΄π΄π΄π΄π΄ are the space and query time of standard ANN
Approximate neighborhood: a set ππ such that ππ ππ, ππ β ππ β ππ ππ, ππππ
Dependence on ππ is O(log(1ππ))
Experiments
Recent paper [Aumuller, Pagh, Silvestryβ19] defining the same notion
Domain Guarantee Space Query
Exact Neighborhoodππ(ππ, ππ)
w.h.p ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄ β ππ ππ, ππππππ ππ, ππ
)
Approximate Neighborhoodππ ππ, ππ β ππ β ππ(ππ, ππππ)
In expectation ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄)
Locality Sensitive Hashing (LSH)One of the main approaches to solve the Nearest Neighbor problems
Hashing scheme s.t. close points have higher probability of collision than far points
Locality Sensitive Hashing (LSH)
Hashing scheme s.t. close points have higher probability of collision than far pointsHash functions: ππ1 , β¦ ,πππΏπΏ
β’ ππππ is an independently chosen hash function
Locality Sensitive Hashing (LSH)
ππ1
ππ2
ππ3
πππΏπΏ
Hashing scheme s.t. close points have higher probability of collision than far pointsHash functions: ππ1 , β¦ ,πππΏπΏ
β’ ππππ is an independently chosen hash function
If ππ β ππβ² β€ ππ , they collide w.p. β₯ ππβππππβIf ππ β ππβ² β₯ ππππ , they collide w.p. β€ ππππππππ
For ππβππππβ β₯ ππππππππ
Locality Sensitive Hashing (LSH)
ππ1
ππ2
ππ3
πππΏπΏ
Retrieval: [Indyk, Motwaniβ98]β’ The union of the query buckets is roughly
the neighborhood of ππ
β’ βππ π΅π΅ππ ππππ ππ is roughly the neighborhood
Locality Sensitive Hashing (LSH)
ππ
ππ1
ππ2
ππ3
πππΏπΏ
Retrieval: [Indyk, Motwaniβ98]β’ The union of the query buckets is roughly
the neighborhood of ππ
β’ βππ π΅π΅ππ ππππ ππ is roughly the neighborhood
β’ How to report a uniformly random neighbor from union of these buckets?
Locality Sensitive Hashing (LSH)
ππ
ππ1
ππ2
ππ3
πππΏπΏ
Retrieval: [Indyk, Motwaniβ98]β’ The union of the query buckets is roughly
the neighborhood of ππ
β’ βππ π΅π΅ππ ππππ ππ is roughly the neighborhood
β’ How to report a uniformly random neighbor from union of these buckets?
β’ Collecting all points might take ππ(ππ) time
Locality Sensitive Hashing (LSH)
ππ
ππ1
ππ2
ππ3
πππΏπΏ
Approaches
How to output a random neighbor from βππ π΅π΅ππ ππππ ππ :
1. Choose a uniformly random bucket2. Choose a uniformly random point in the
bucket
Approach 1: Uniform/Uniform
ππ
ππ1
ππ2
ππ3
πππΏπΏ
How to output a random neighbor from βππ π΅π΅ππ ππππ ππ :
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket
Approach 2: Weighted/Uniform
ππ
ππ1
ππ2
ππ3
πππΏπΏ
How to output a random neighbor from βππ π΅π΅ππ ππππ ππ :
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ππ in the neighborhood is picked
w.p. proportional to its degree ππππ
Approach 2: Weighted/Uniform
ππ
Number of buckets that ππappears in
ππ1
ππ2
ππ3
πππΏπΏ
How to output a random neighbor from βππ π΅π΅ππ ππππ ππ :
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ππ in the neighborhood is picked
w.p. proportional to its degree ππππ
3. Keep ππ with probability 1ππππ
, o.w. repeat
Approach 3: Optimal
ππ
ππ1
ππ2
ππ3
πππΏπΏ
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ππ in the neighborhood is picked
w.p. proportional to its degree ππππ
3. Keep ππ with probability 1ππππ
, o.w. repeat
Uniform probability
ππ
Approach 3: Optimal
ππ1
ππ2
ππ3
πππΏπΏ
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ππ in the neighborhood is picked
w.p. proportional to its degree ππππ
3. Keep ππ with probability 1ππππ
, o.w. repeat
Uniform probability Need to spend ππ(πΏπΏ) to find the degree
ππ
Approach 3: Optimal
ππ1
ππ2
ππ3
πππΏπΏ
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ππ in the neighborhood is picked
w.p. proportional to its degree ππππ
3. Keep ππ with probability 1ππππ
, o.w. repeat
Uniform probability Need to spend ππ(πΏπΏ) to find the degree Might need ππ ππππππππ = ππ(πΏπΏ) samples Total time is ππ(πΏπΏ2)
ππ
Approach 3: Optimal
ππ1
ππ2
ππ3
πππΏπΏ
Approximate the degree ππππSample ππ( πΏπΏ
ππππβ ππ2) buckets out of πΏπΏ buckets to (1 + ππ)-approximate the degree.
Still if the degree is low this takes ππ(πΏπΏ) samples.
Approximate the degree ππππSample ππ( πΏπΏ
ππππβ ππ2) buckets out of πΏπΏ buckets to (1 + ππ)-approximate the degree.
Still if the degree is low this takes ππ(πΏπΏ) samples.
Case 1: Small degree π π ππ:β’ More samples are required to estimateβ’ Reject with lower probability -> Fewer queries of this type
Case 2: Large degree π π ππ:β’ Fewer samples are required to estimateβ’ Reject with higher probability -> More queries of this type
Approximate the degree ππππSample ππ( πΏπΏ
ππππβ ππ2) buckets out of πΏπΏ buckets to (1 + ππ)-approximate the degree.
Still if the degree is low this takes ππ(πΏπΏ) samples.
Case 1: Small degree π π ππ:β’ More samples are required to estimateβ’ Reject with lower probability -> Fewer queries of this type
Case 2: Large degree π π ππ:β’ Fewer samples are required to estimateβ’ Reject with higher probability -> More queries of this type
This decreases ππ(πΏπΏ2) runtime to οΏ½ππ(πΏπΏ) Large dependency on ππ of the form ππ( 1
ππ2)
Via a different sampling approach we show how to reduce the dependency to logarithmic ππ(log 1
ππ).
Experiments
Setupβ’ Take MNIST as the data setβ’ Ask a query several times and compute the empirical distribution of the
neighbors.β’ Compute the statistical distance of the empirical distribution to the uniform
distribution
Experiments
Setupβ’ Take MNIST as the data setβ’ Ask a query several times and compute the empirical distribution of the
neighbors.β’ Compute the statistical distance of the empirical distribution to the uniform
distributionComparisonβ’ Our algorithm performs 2.5 times worse than the optimal algorithm, but the
other two perform 7 and 10 times worse than the optimal.β’ Four times faster than the optimal but 15 times slower than the other two
Conclusion
Domain Guarantee Space Query
Exact Neighborhoodππ(ππ, ππ)
w.h.p ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄ β ππ ππ, ππππππ ππ, ππ
)
Approximate Neighborhoodππ ππ, ππ β ππ β ππ(ππ, ππππ)
In expectation ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄)
The dependence on the parameter ππ matches the standard Nearest Neighbor.We get an independent near neighbor each time we draw a sample. More generally the approach works for sampling form a sub-
collection of sets.
Conclusion
Domain Guarantee Space Query
Exact Neighborhoodππ(ππ, ππ)
w.h.p ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄ β ππ ππ, ππππππ ππ, ππ
)
Approximate Neighborhoodππ ππ, ππ β ππ β ππ(ππ, ππππ)
In expectation ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄)
The dependence on the parameter ππ matches the standard Nearest Neighbor.We get an independent near neighbor each time we draw a sample. More generally the approach works for sampling form a sub-
collection of sets.Open Problem:
o Finding the optimal dependency on the density parameter: π΄π΄ ππ,πππππ΄π΄ ππ,ππ
Conclusion
Domain Guarantee Space Query
Exact Neighborhoodππ(ππ, ππ)
w.h.p ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄ β ππ ππ, ππππππ ππ, ππ
)
Approximate Neighborhoodππ ππ, ππ β ππ β ππ(ππ, ππππ)
In expectation ππ(πππ΄π΄π΄π΄π΄π΄) οΏ½ππ(πππ΄π΄π΄π΄π΄π΄)
ThanksQuestions?
The dependence on the parameter ππ matches the standard Nearest Neighbor.We get an independent near neighbor each time we draw a sample. More generally the approach works for sampling form a sub-
collection of sets.Open Problem:
o Finding the optimal dependency on the density parameter: π΄π΄ ππ,πππππ΄π΄ ππ,ππ
Recommended