MUFIN: Multi Feature Indexing Network 1
Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications
Pavel ZezulaFaculty of Informatics
Masaryk University, Brno
23.1.2012
MUFIN: Multi Feature Indexing Network 2
Outline of the talk
• Why similarity• Principles of metric similarity searching• The MUFIN approach• Demo applications• Future directions
23.1.2012
MUFIN: Multi Feature Indexing Network 3
Real-Life MotivationThe social psychology view
• Any event in the history of organism is, in a sense, unique.
• Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity.
• Similarity (proximity, resemblance, communality, representativeness, psychological distance, etc.) is fundamental to theories of perception, learning, judgment, etc.
23.1.2012
MUFIN: Multi Feature Indexing Network 4
Contemporary Networked MediaThe digital data view
• Almost everything that we see, read, hear, write, measure, or observe can be digital.
• Users autonomously contribute to production of global media and the growth is exponential.
• Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events.
• The elements of networked media are related by numerous multi-facet links of similarity.
23.1.2012
MUFIN: Multi Feature Indexing Network 5
Examples with Similarity
• Does the computer disk of a suspected criminal contain illegal multimedia material?
• What are the stocks with similar price histories?
• Which companies advertise their logos in the direct TV transmission of football match?
• Is it the situation on the web getting close to any of the network attacks which resulted in significant damage in the past?
23.1.2012
MUFIN: Multi Feature Indexing Network 6
Challenge
• Networked media is getting close to the human “fact-bases”– the gap between physical and digital has blurred
• Similarity data management is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections.
WHY?It is the similarity which is in the world revealing.
23.1.2012
MUFIN: Multi Feature Indexing Network 7
Limitations: Data Types
We have• Attributes
– Numbers, strings, etc.
• Text (text-based)– Documents, annotations
We need• Multimedia
– Image, video, audio
• Security – Biometrics
• Medicine– EKG, EEG, EMG, EMR, CT, etc.
• Scientific data– Biology, chemistry, physics, life
sciences, economics
• Others– Motion, emotion, events, etc.
23.1.2012
MUFIN: Multi Feature Indexing Network 8
Limitations: Models of Similarity
We have• Simple geometric models,
typically vector spaces
We need• More complex model• Non metric models• Asymmetric similarity• Subjective similarity• Context aware similarity• Complex similarity• Etc.
23.1.2012
MUFIN: Multi Feature Indexing Network 9
Limitations: Queries
We have• Simple query
– Nearest neighbor– Range
We need• More query types
– Reverse NN, distinct NN, similarity join
• Other similarity-based operations– Filtering, classification, event
detection, clustering, etc.
• Similarity algebra– May become the basis of a
“Similarity Data Management System”
23.1.2012
MUFIN: Multi Feature Indexing Network 10
Limitations: Implementation Strategies
We have• Centralized or parallel
processing
We need• Scalable and distributed
architectures• MapReduce like approaches• P2P architectures• Cloud computing• Self-organized architectures• Etc.
23.1.2012
MUFIN: Multi Feature Indexing Network 11
Search Strategy Evolution
Scalability● data volume - exponential● number of users (queries)● variety of data types● multi-lingual, -feature –modal queries
Determinismexact match similarity►precise approximate►same answer good answer; recommendation►fixed query personalized; context aware►fixed infrastr. dynamic mapping; mobile dev.►
grad
e
high
low
well established cutting-edge research
peer
-to-
peer
cent
raliz
ed
para
llel
dist
ribut
ed
self-
orga
nize
d
23.1.2012
MUFIN: Multi Feature Indexing Network 12
similarityeffectiveness efficiency
stimuli
matching extraction
evaluation execu
tion
algebra
Similarity Data Management System
SimilarityData
ManagementSystem
23.1.2012
MUFIN: Multi Feature Indexing Network 13
Metric Search Grows in Popularity
Hanan SametFoundation of Multidimensional andMetric Data StructuresMorgan Kaufmann, 2006
P. Zezula, G. Amato, V. Dohnal, and M. BatkoSimilarity Search: The Metric Space ApproachSpringer, 2006
23.1.2012
MUFIN: Multi Feature Indexing Network 14
The MUFIN Approach
MUFIN: MUlti-Feature Indexing Network
SEARCHdata
& q
uerie
s
infrastructure
index structureScalability
P2P structureExtensibilitymetric space
IndependenceInfrastructure as a service
23.1.2012
MUFIN: Multi Feature Indexing Network 15
Extensibility: Metric Abstraction of Similarity
• Metric space: M = (D,d)– D – domain– distance function d(x,y)
x,y,z D• d(x,y) > 0 - non-negativity• d(x,y) = 0 x = y - identity• d(x,y) = d(y,x) - symmetry• d(x,y) ≤ d(x,z) + d(z,y) - triangle inequality
23.1.2012
MUFIN: Multi Feature Indexing Network 16
Examples of Distance Functions
• Lp Minkovski distance (for vectors)• L1 – city-block distance
• L2 – Euclidean distance
• L¥ – infinity
• Edit distance (for strings)• minimal number of insertions, deletions and substitutions• d(‘application’, ‘applet’) = 6
• Jaccard’s coefficient (for sets A,B)
n
iii yxyxL
11 ||),(
n
iii yxyxL
1
22 ),(
ii
n
iyxyxL
max),(
1
BA
BABAd 1,
23.1.2012
MUFIN: Multi Feature Indexing Network 17
Examples of Distance Functions
• Mahalanobis distance– for vectors with correlated dimensions
• Hausdorff distance– for sets with elements related by another distance
• Earth movers distance– primarily for histograms (sets of weighted features)
• and many others
23.1.2012
MUFIN: Multi Feature Indexing Network 18
Similarity Search Problem
• For X D in metric space M,pre-process X so that the similarity queriesare executed efficiently.
No total ordering exists!
23.1.2012
MUFIN: Multi Feature Indexing Network 1923.1.2012
Similarity Queries
• Range query• Nearest neighbor query• Similarity join• Combined queries• Complex queries
MUFIN: Multi Feature Indexing Network 2023.1.2012
Similarity Range Query
• range query– R(q,r) = { x X | d(q,x) ≤ r }
… all museums up to 2km from my hotel …
rq
MUFIN: Multi Feature Indexing Network 2123.1.2012
Nearest Neighbor Query
• the nearest neighbor query– NN(q) = x– x X, "y X, d(q,x) ≤ d(q,y)
• k-nearest neighbor query– k-NN(q,k) = A– A X, |A| = k– x A, y X – A, d(q,x) ≤ d(q,y)
… five closest museums to my hotel …
q
k=5
MUFIN: Multi Feature Indexing Network 2223.1.2012
Similarity Join Queries
• similarity join of two data sets
• similarity self join X = Y
…pairs of hotels and museumswhich are five minutes walk
apart …
m
}),(:),{(),,(
0,,
yxdYXyxYXJ
YX DD
MUFIN: Multi Feature Indexing Network 2323.1.2012
Combined Queries
• Range + Nearest neighbors
• Nearest neighbor + similarity joins– by analogy
}),(),(),(
:,||,{),(
rxqdyqdxqd
RXyRxkRXRrqkNN
MUFIN: Multi Feature Indexing Network 2423.1.2012
Complex Queries
• Find the best matches of circular shape objects with red color
• The best match for circular shape or red color needs not be the best match combined
• A0 algorithm• Threshold algorithm
MUFIN: Multi Feature Indexing Network 2523.1.2012
Partitioning Principles
• Given a set X D in M=(D,d), basic partitioning principles have been defined:
– Ball partitioning– Generalized hyper-plane partitioning– Excluded middle partitioning
– Clustering
MUFIN: Multi Feature Indexing Network 2623.1.2012
Ball Partitioning
• Inner set: { x X | d(p,x) ≤ dm }
• Outer set: { x X | d(p,x) > dm }
pdm
MUFIN: Multi Feature Indexing Network 2723.1.2012
Generalized Hyper-plane
• { x X | d(p1,x) ≤ d(p2,x) }
• { x X | d(p1,x) > d(p2,x) }
p2
p1
MUFIN: Multi Feature Indexing Network 2823.1.2012
Excluded Middle Partitioning
• Inner set: { x X | d(p,x) ≤ dm - }• Outer set: { x X | d(p,x) > dm + }
• Excluded set: otherwise
pdm
2r
pdm
MUFIN: Multi Feature Indexing Network 2923.1.2012
Clustering
• Cluster data into sets– bounded by a ball region– { x X | d(pi,x) ≤ ri
c }
MUFIN: Multi Feature Indexing Network 30
Scalability: Peer-to-Peer Indexing• Local search: M-tree, D-Index, M-Index• Native metric techniques: GHT*, VPT*• Transformation techniques: M-CAN, M-Chord
23.1.2012
MUFIN: Multi Feature Indexing Network 31
The M-tree [Ciaccia, Patella, Zezula, VLDB 1997]
1)Paged organization2)Dynamic3) Suitable for arbitrary metric spaces4) I/O and CPU optimization - computing d can be
time-consuming
23.1.2012
MUFIN: Multi Feature Indexing Network 32
The M-tree Idea
• Depending on the metric, the “shape” of index regions changes
C D E F
A B
B
F D
EA
C
Metric: L2 (Euclidean)
L1 (city-block) L (max-metric) weighted-Euclidean quadratic form
23.1.2012
MUFIN: Multi Feature Indexing Network 3323.1.2012
o7
M-tree: Example
o1o6
o10
o3
o2
o5
o4
o9
o8
o11
o1 4.5 -.- o2 6.9 -.-
o1 1.4 0.0 o10 1.2 3.3 o7 1.3 3.8 o2 2.9 0.0 o4 1.6 5.3
o2 0.0 o8 2.9o1 0.0 o6 1.4 o10 0.0 o3 1.2
o7 0.0 o5 1.3 o11 1.0 o4 0.0 o9 1.6
Covering radius
Distance to parentDistance to parentDistance to parent
Distance to parentLeaf entries
MUFIN: Multi Feature Indexing Network 34
M-tree family
• Bulk loading• Slim-tree• Multi-way insertion• PM-tree• M2-tree• etc.
23.1.2012
MUFIN: Multi Feature Indexing Network 35
D-Index [Dohnal, Gennaro, Zezula, MTA 2002]
4 separable buckets at the first level
2 separable buckets at the second level
exclusion bucket of the whole structure
23.1.2012
MUFIN: Multi Feature Indexing Network 36
D-index: Insertion
23.1.2012
MUFIN: Multi Feature Indexing Network 37
D-index: Range Search
q
r
q
r
q
r
q
r
q
r
q
r
23.1.2012
MUFIN: Multi Feature Indexing Network 38
Implementation Postulates of Distributed Indexes
• dynamism – nodes can be added and removed
• no hot-spots – no centralized nodes, no flooding by messages (transactions)
• update independence – network update at one site does not require an immediate change propagation to all the other sites
23.1.2012
MUFIN: Multi Feature Indexing Network 39
DistributedSimilarity Search Structures
• Native metric structures:– GHT* (Generalized Hyperplane Tree)– VPT* (Vantage Point Tree)
• Transformation approaches:– M-CAN (Metric Content Addressable Network)– M-Chord (Metric Chord)
23.1.2012
MUFIN: Multi Feature Indexing Network 4023.1.2012
GHT* Address Search Tree
• Based on the Generalized Hyperplane Tree [Uhl91]– two pivots for binary partitioning
p6
p5
p3
p4
p1
p2
p1 p2
p5 p6p3 p4
MUFIN: Multi Feature Indexing Network 4123.1.2012
GHT* Address Search Tree
• Inner node– two pivots (reference objects)
• Leaf node– BID pointer to a bucket if
data stored on the current peer
– NNID pointer to a peer ifdata stored on a different peer
p1 p2
p5 p6p3 p4
BID1 BID2 BID3 NNID2
Peer 2
MUFIN: Multi Feature Indexing Network 4223.1.2012
GHT* Address Search Tree
Peer 2 Peer 3
Peer 1
MUFIN: Multi Feature Indexing Network 4323.1.2012
BID1 BID2 BID3 NNID2
Peer 2
p1 p2
p5 p6p3 p4
Peer 2
BID3 NNID2
p5 p6
p1 p2
GHT* Range Query• Range query R(q,r)
– traverse peer’s own AST– search buckets for all BIDs found– forward query to all NNIDs found
p6
p5
p3
p4
rq
p1
p2
MUFIN: Multi Feature Indexing Network 4423.1.2012
AST: Logarithmic replication
• Full AST on every peer is space consuming– replication of pivots grows in a linear way
• Store only a part of the AST:– all paths to local buckets
• Deleted sub-trees:– replaced by NNID
of the leftmost peer
p13 p14p11 p12
p5 p6
p1 p2
p3 p4
p7 p8 p9 p10
NNID2 NNID3BID1 NNID4 NNID5 NNID6 NNID7 NNID8
p1 p2
p3 p4
p7 p8
BID1 NNID3 NNID5
MUFIN: Multi Feature Indexing Network 4523.1.2012
AST: Logarithmic Replication (cont.)
• Resulting tree– replication of pivots grows in a logarithmic way
p1 p2
p3 p4
p7 p8
NNID2
NNID3
BID1
NNID5
p1 p2
p3 p4
p7 p8
BID1
MUFIN: Multi Feature Indexing Network 4623.1.2012
p1
r1
p3
r3
VPT* Structure• Similar to the GHT* - ball partitioning is used
for ASTBased on the Vantage Point Tree [Yia93]
• inner nodes have one pivot and a radius• different traversing conditions
p2
r2
p1 (r1)
p2 (r2) p3 (r3)
MUFIN: Multi Feature Indexing Network 47
M-Chord: The Metric Chord• Transform metric space to one-dimensional domain
– Use M-Index - a generalized version of the iDistance
• Divide the domain into intervals – assign each interval to a peer
• Use the Chord P2P protocol for navigation• The Skip graphs distributed protocol can be used,
alternatively
23.1.2012
MUFIN: Multi Feature Indexing Network 48
– range query R(q,r): identify intervals of interest
• Generalization to metric spaces
– select pivots – then partition: Voronoi-style
M-Chord: Indexing the Distance
• iDistance – indexing technique for vector domains
– cluster analysis = centers = reference points pi
– assign iDistance keys to objects iCx
cixpdxiDist i ),()(
},...,{ 0 npp
23.1.2012
MUFIN: Multi Feature Indexing Network 49
M-Chord: Chord Protocol
• Peer-to-Peer navigation protocol• Peers are responsible for intervals of keys• hops to localize a node storing a key
M-Chord set the iDistance domain make it uniform: function h
Use Chord on this domain
)(log n
)),(()( cixpdhxmchord i
23.1.2012
MUFIN: Multi Feature Indexing Network 50
M-Chord: Range Query
• Node Nq initiates the search
• Determine intervals– generalized iDistance
• Forward requests to peers on intervals
• Search in the nodes– using local organization
• Merge the received partial answers
23.1.2012
MUFIN: Multi Feature Indexing Network 5123.1.2012
M-CAN: The Metric CAN
• Based on the Content-Addressable Network (CAN)– a DHT navigating in an N-dimensional vector space
• The Idea:1. Map the metric space to a vector space
– given N pivots: p1, p2 , … , pN, transform every o into vector F(o)
2. Use CAN to
– distribute the vector space zones among the nodes– navigate in the network
MUFIN: Multi Feature Indexing Network 5223.1.2012
CAN: Principles & Navigation
• CAN – the principles– the space is divided in zones – each node “owns” a zone– nodes know their neighbors
• CAN – the navigation– greedy routing– in every step, move to the
neighbor closer to the target location
2-d
imensi
onal vect
or
space
1
6 2
53
4
x,y
MUFIN: Multi Feature Indexing Network 5323.1.2012
M-CAN: Contractiveness & Filtering
• Use the L∞ as a distance measure
– the mapping F is contractive
• More pivots better filtering– but, CAN routing is better for less dimensions
• Additional filtering– some pivots are only used for filtering data (inside
the explored nodes)– they are not used for mapping into CAN vector space
),())(),(( yxdyFxFL
MUFIN: Multi Feature Indexing Network 54
Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework
Metric space (D,d) Operations Storage
Centralized index structures
Distributed index structures
Communication
NetVectors• Lp and quadratic formStrings • (weighted) edit and
protein sequence
Insert, delete,range query,k-NN query,Incremental k-NN
Volatile memoryPersistent memory
Performance statistics
23.1.2012
MUFIN: Multi Feature Indexing Network 57
Applications: a Word Cloud
23.1.2012
MUFIN: Multi Feature Indexing Network 58
Concepts of the Image searchImage base
similar?
23.1.2012
MUFIN: Multi Feature Indexing Network 59
Images and their Descriptors
Image level
R
B
G
Descriptor level
23.1.2012
MUFIN: Multi Feature Indexing Network 60
• Largest publicly available collection of high-quality images metadata: 106 million images
• Each image contains:• Five MPEG-7 VDs: Scalable Color, Color Structure, Color Layout, Edge
Histogram, Homogeneous Texture• Other textual information: title, tags, comments, etc.
• Photos have been crawled from the Flickr photo-sharing site.
http://cophir.isti.cnr.it/
100Mimages + metadata + MPEG-7 VDs
CoPhIR: Content-based PhotoImage Retrieval
23.1.2012
MUFIN: Multi Feature Indexing Network 61
MUFINSEARCHENGINEda
ta &
que
ries
infrastructureindex structure
ScalabilityM-Chord + M-Index
ExtensibilityCOPHIR
edge histogram
color structure
scalable color
homogeneous texture
color layout
6 x IBM server x3400 – 2 servers used
Image Search Demohttp://mufin.fi.muni.cz/imgsearch/
23.1.2012
MUFIN: Multi Feature Indexing Network 62
MUFIN demos
• http://mufin.fi.muni.cz/imgsearch/similar• http://www.pixmac.com/• http://mufin.fi.muni.cz/twenga/random• http://mufin.fi.muni.cz/fingerprints/random• http://mufin.fi.muni.cz/subseq/random• http://mufin.fi.muni.cz/plugins/annotation
23.1.2012
MUFIN: Multi Feature Indexing Network 63
MUFIN Future Research Directions
• MUFIN - a universal similarity search technology
• Research directions in:– Core technology– Applications– A style of computing
MUFINSearchEngine
data
& q
uerie
s
infrastructure
index structure
ScalabilityP2P structures
Extensibilitymetric space
Performance Tuning
23.1.2012
MUFIN: Multi Feature Indexing Network 64
MUFIN Future Research Directions
October 28, 2011
MUFINSearchEngine
data
& q
uerie
s
infrastructure
index structureM
ore scalable, reliable, robust
Multi-layer architectures
Self-organizing architecturesNew
que
ry ty
pes
Flexib
le su
b-se
quen
ce m
atch
ing
Efficie
nt m
ulti-
feat
ure
proc
essin
g
New style of computingCloud Computing
Similarity Search as Service
23.1.2012
MUFIN: Multi Feature Indexing Network 65
Major Applications
– Images:• Sub-image retrieval• Ranking• Annotation• Categorization• Benchmarking
– Biometrics:• Face recognition• Fingerprint recognition• Gait recognition
– Signals:• Audio recognition• Time series similarity
– Videos:• Event detection
23.1.2012
MUFIN: Multi Feature Indexing Network 66
A New Style of Computing
• From the project-oriented approach towards similarity cloud for multimedia findability
through similarity searchingAdvantages:
– Cloud makes similarity search accessible to common users
– Computational resources are shared – users don’t need to maintain any hardware infrastructure
– Users don’t need to care for the OS, security, software platform, etc.
23.1.2012