Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
SEMWEB 1
MUFIN Basics
MUFIN teamFaculty of Informatics,
Masaryk UniversityBrno, Czech Republic
SEMWEB 3
The thesis(intellectual proposition)
Search systems are more and more complexFuture search system will be born on the divergence of:
scale and determinism
SEMWEB 4
Trends in Scalability of Search
data volume - exponential growthnumber of users - increasing fastvariety of data types - digital databasesmulti-queries - lingual, feature, modal
SEMWEB 5
Trends in Determinism of Search
• Exact match• Precise answer• Unvaried answer
• Fixed query
• Dedicated hardware
• Similarity• Approximate answer• Satisfactory answer (advice,
recommendation)• Personalized, context aware,
proximate• Dynamic mapping, mobile
devices, infrastructure services
SEMWEB 6
Search systemsScalability
● data volume – exponential grows● number of users (queries) increase● variety of data types - digitization● multi-lingual (feature, modal) queries
Determinismexact match ► similarityprecise ► approximateunvaried answer ► good answer; advicefixed query ► personalized; context awarefixed infrastruct. ► dynamic mapping; mobile
grad
e
high
low
well established cutting-edge research
peer
-to-p
eer
cent
raliz
ed
para
llel
dist
ribut
ed
self-
orga
nize
dtime
SEMWEB 7
The MUFIN Approach
SEARCHda
ta &
que
ries
infrastructureindex structure
ScalabilityP2P structure
Extensibilitymetric space
Minkowski distance
Edit distanceJaccard’s coef.
Mahalanobis distance
Hausdforff distance
etc.
MUFIN: MUlti-Feature Indexing Network
Cloud computinginfrastructure as a service
SEMWEB 8
EXTENSIBILITYMetric Space: Abstraction of Similarity
Metric space: M = (D,d)D – domaindistance function d(x,y)∀x,y,z ∈ D
d(x,y) > 0 - non-negativityd(x,y) = 0 ⇔ x = y - identityd(x,y) = d(y,x) - symmetryd(x,y) ≤ d(x,z) + d(z,y) - triangle inequality
SEMWEB 9
Why Can the Metric Approach be Useful
Many application areas:biology, securityaudio-visual, geo. searchsoftware copy detectiondata cleaning, integration,etc.
Query by example paradigmone query image contains a lot of informationone image is worth 1000 wordsadvantage for mobile devices – min. click
SEMWEB 10
Metric Search Grows in Popularity
Hanan SametFoundation of Multidimensional andMetric Data StructuresMorgan Kaufmann, 2006
P. Zezula, G. Amato, V. Dohnal, and M. BatkoSimilarity Search: The Metric Space ApproachSpringer, 2006
SEMWEB 11
Examples of Distance FunctionsLp Minkowski distance of order p
L1 – city-block distance
L2 – Euclidean distance
L∞ – infinity
edit distance (for strings)minimal number of insertions, deletions and substitutionsd(‘application’, ‘applet’) = 6
Jaccard’s coefficient (for sets A,B)
∑=
−=n
iii yxyxL
11 ||),(
( )∑=
−=n
iii yxyxL
1
22 ),(
ii
n
iyxyxL −=
=∞ max),(
1
( )UI
BA
BABAd −=1,
SEMWEB 12
Examples of Distance FunctionsMahalanobis distance
for vectors with correlated dimensions
Hausdorff distancefor sets with elements related by another distance
Earth movers distanceprimarily for histograms (sets of weighted features)
and many others
SEMWEB 13
Image MUFIN overlayA demo on Cophir 50 M dataset (280 dim vectors)
Five combined MPEG7 global descriptor:Color Structure, max. dist.: 40, weight: 3Color Layout, max. dist.: 300, weight: 2Scalable Color, max. dist.: 3000weight: 2Edge Histogram, max. dist.: 68, weight: 4Homogeneous Texture, max. dist.: 25, weight: 0.5
SEMWEB 14
Face searchFace search demo – 6k images with people
face detection – 10k detected facesface description – 64 dimensional vectorsface comparison - advanced face des. MPEG7
Based on a publicly available software
SEMWEB 15
SCALABILITYStructured P2P networks
Objectives To scale into contemporary audio-visual data volume and query execution throughput, i.e.:
billions of objectsonline response timehundreds of queries per sec.
A peerContains metric objects, can issue/answer queries, and knows few other peers
SEMWEB 16
Why structured P2P in MUFINStructured P2P network employ a globally considered protocol to ensure that any peer can efficiently route a search to some peerthat has the desired data
Structured P2P networks are used in MUFIN for:no bottleneck, no central componentmultiple access points to the networks distribution of workload – parallel query executiondynamic structure of peers – (controlled) resilience, join, leavemechanisms for fault tolerance, replication and load balancing
SEMWEB 17
P2P Architecture of MUFIN• Native metric techniques: GHT*, VPT*• Transformation techniques: MCAN, M-Chord
(Skip-Graphs, Kademlia, etc.)
SEMWEB 18
P2P Architecture of MUFINPeers are not necessarily computersA peer size determines a lower-bound on the query response timePeer’s data can be searched by:
FilteringM-treeD-indexI-distanceEtc.
SEMWEB 19
Scalability test1M: 50 peers – memory based10M: 500 peers – memory based50M: 2000 peers – disk based
Effectiveness improves with data volumeEfficiency
lower-bounded by the peer size (20k, 20k, 25k)does not change significantly
SEMWEB 20
Infrastructure as a ServiceWhy:
Performance tuningQuery response timeQuery execution throughput
Performance adjustmentDifferent performance requirements (day – night, weekend – working days)
Experimental trials Test an applicationPurchase a new hardware
Availability - reliability
SEMWEB 21
MUFIN Hardware Mapping10M network, 500 peers, memory-basedBatch of 250 queries started from 10 peers
3248169147237969692169128100,663801
23201687802032448316566541,341862
18471624681221035516532652,66944
180617032487573618117875,10498
16051832596726911849589,262716
maxminavgmaxminavg
single query [ms]total (s)
single query [ms]queries/stotal [s]
Sequential from 1 peerParallel from 10 peers
CPUs
SEMWEB 22
Externalindex
Featureextraction
MUFIN Overview
Peer-to-Peer Networks
Multi-overlay structure
Forms• range • k-nearest• complex
Strategies• precise• approximate• social
insertdelete
features
Web service
Universal• batch, telnet, GUI
Specialized• image web interface
SEMWEB 23
MUFIN pluginNews web-sites contain images
CNN, BBC, SEZNAM, iDNESPhotography collection of US National Parks
TERRA GALLERIA Image text searchGoogle, Yahoo, Yandex, Ask, Seznam, Rajče, exalead