29
Algorithms for Building Highly Scalable Distributed Data Storages Alexander Ponomarenko 2nd International Scientific Conference “SCIENCE OF THE FUTURE” September 20-23 2016, Kazan, Russia.

Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

AlgorithmsforBuildingHighlyScalableDistributedDataStorages

AlexanderPonomarenko2ndInternationalScientificConference“SCIENCEOFTHEFUTURE”

September20-232016,Kazan,Russia.

Page 2: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

HierarchyvsHeterarchyData

Page 3: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

HierarchyvsHeterarchyData

Page 4: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

HierarchyvsHeterarchyData

Page 5: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

HierarchyvsHeterarchyData

Page 6: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

DHTprotocolsandimplementations

• Aeropike• ApacheCassandra• BATONOverlay• MainlineDHT-StandardDHTusedbyBitTorrent(basedonKademlia)• CAN(ContentAddressableNetwork)• Chord• Koorde• Kademlia• Pastry• P-Grid• Riak• Tapestry• TomP2P• Voldemort

6

Page 7: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

ApplicationsemployingDHTs

• BTDigg:BitTorrentDHTsearchengine• cjdns:rou\ngengineformesh-basednetworks• CloudSNAP:adecentralizedwebapplica\ondeploymentpla]orm• Codeen:webcaching• CoralContentDistribu\onNetwork• FAROO:peer-to-peerWebsearchengine• Freenet:acensorship-resistantanonymousnetwork• GlusterFS:adistributedfilesystemusedforstoragevirtualiza\on• GNUnet:Freenet-likedistribu\onnetworkincludingaDHTimplementa\on• Hazelcast:Open-sourcein-memorydatagrid• I2P:Anopen-sourceanonymouspeer-to-peernetwork.• I2P-Bote:serverlesssecureanonymouse-mail.• JXTA:open-sourceP2Ppla]orm• OracleCoherence:anin-memorydatagridbuiltontopofaJavaDHTimplementa\on• Retroshare:aFriend-to-friendnetwork[17]• YaCy:adistributedsearchengine• Tox:aninstantmessagingsystemintendedtofunc\onasaSkypereplacement• Twister:amicrobloggingpeer-to-peerpla]orm• PerfectDark:apeer-to-peerfile-sharingapplica\onfromJapan

7

Page 8: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

StructuredPeer-to-PeerNetworks:ChordProtocol

Searchingofkey54staringfrom«N8».Routingtableofnode«N8»

Eachnode,n,maintainsaroutingtablewith(atmost)mentries,calledthefingertable.Thei-thentryinthetableatnodencontainstheidentityofthefirstnode,s,thatsucceedsnbyatleast2^(i-1)ontheidentifiercircle,i.e.,s=successor(n+2^(i-1)),where1<=i<=m

Distancefunction:d(x,y)=(y–x)mod2^m

Page 9: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

StructuredPeer-to-PeerNetworks:Kademlia

IdentifierspaceofKademlia

MaymounkovP.,MazieresD.Kademlia:Apeer-to-peerinformationsystembasedonthexormetric//Peer-to-PeerSystems.–SpringerBerlinHeidelberg,2002.–С.53-65.

Distancefunction:d(x,y)=xxory

Page 10: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

toOvercomeDHTDisadvantages

10

• DHTusesverysimpledistancefunctions• Hashingdestroyssemanticofthedata• It’shardtoperformcomplexqueries

Usenearestneighboursearchinhighdimensionalmetricspaceinsteadofexactsearch

Page 11: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

11

Let–domain-distancefunc1onwhichsa1sfiesproper1es:

– strictposi1veness:d(x,y)>0�x≠y,– symmetry:d(x,y)=d(y,x),– reflexivity:d(x,x)=0,– triangleinequality:d(x,y)+d(y,z)≥d(x,z).

NearestNeighborSearch

GivenafinitesetX={p1,…,pn}ofnpointsinsomemetricspace(D,d),needtobuildadatastructureonXsothatforagivenquerypointq∈Donecanfindapointp∈Xwhichminimizesd(p,q)withasfewdistancecomputa<onsaspossible

);0[: +∞→× RDDdD

Page 12: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

12

ExamplesofDistanceFunc3ons

•  LpMinkovskidistance(forvectors)•  L1–city-blockdistance

•  L2–Euclideandistance

•  L∞ –infinity

•  Editdistance(forstrings)•  minimalnumberofinser3ons,dele3onsandsubs3tu3ons•  d(‘applica3on’,‘applet’)=6

•  Jaccard’scoefficient(forsetsA,B)

∑=

−=n

iii yxyxL

11 ||),(

( )∑=

−=n

iii yxyxL

1

22 ),(

ii

n

iyxyxL −=

=∞ max),(

1

( )∪∩BA

BABAd −=1,

Page 13: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

13

2(| ( 1, 2) | | ( 1, 2) |)( 1, 2)(| ( 1) | | ( 1) |) (| ( 2) | | ( 2) |)

V G G E G Gsim G GV G E G V G E G

+=

+ ⋅ +

( 1, 2) 1 ( 1, 2)d G G sim G G= −

MaxCommonSubgraphSimilarity

Page 14: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Kleinberg’sNavigableSmallWorld

[KleinbergJ.Thesmall-worldphenomenon:Analgorithmicperspective//Proceedingsofthethirty-secondannualACMsymposiumonTheoryofcomputing.–ACM,2000.–С.163-170.]

Page 15: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

VoroNet,RayNet:AscalableobjectnetworkbasedonVoronoitessellations

BeaumontO.etal.VoroNet:AscalableobjectnetworkbasedonVoronoitessellations.–2006.

Distancefunction: 2 22),( yxyxd +=

Page 16: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

MetrizedSmallWorldAlgorithmu=1

u=2

u=3+

+=

Navigablesmallworld

“Toplevel”– first(oldest)elements

“Bottom”level- allelements

u=log(N)

u=log(N)-1

queryelement

R1 R2

[Y.Malkov,A.Ponomarenko,A.Logvinov,andV.Krylov,“Scalabledistributedalgorithmforapproximatenearestneighborsearchprobleminhighdimensionalgeneralmetricspaces,”inSimilaritySearchandApplications.Springer,2012,pp.32–147][Y.Malkov,A.Ponomarenko,A.Logvinov,andV.Krylov,“Approximatenearestneighboralgorithmbasedonnavigablesmallworldgraphs,”InformationSystems,vol.45,2014,pp.61–68.][PonomarenkoA.Query-BasedImprovementProcedureandSelf-AdaptiveGraphConstructionAlgorithmforApproximateNearestNeighborSearch//InternationalConferenceonSimilaritySearchandApplications.–SpringerInternationalPublishing,2015.–С.314-319.]

Page 17: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Booleannon-linearprogrammingformulationforoptimalgraphstructure

17

Page 18: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

0.6

0.5

0.52 0.52

0.4

0.3

0.42

0.42 0.42

0.41

0.2

0.2

0.20.2

0.41

0.3

Query

EntryPoint

Searchbygreedyalgorithm

Page 19: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Constructionalgorithm

0.5

0.8

0.90.4

0.7 0.3

0.7

0.90.2

Page 20: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Datasets

• CoPhIR(L2)isthecollectionof208-dimensionalvectorsextractedfromimagesinMPEG7format.

• SIFTisapartoftheTexMexdatasetcollectionavailablehttp://corpus-texmex.irisa.frIthasonemillion128-dimensionalvectors.EachvectorcorrespondstodescriptorextractedfromimagedatausingScaleInvariantFeatureTransformation(SIFT)

• Unfi64issyntheticdatasetof64-dimensionalvectors.Thevectorsweregeneratedrandomly,independentlyanduniformlyintheunithypercube.

Page 21: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Performanceofa10-NNsearchfor:plotsinthesamecolumncorrespondtothesamedataset

2L

[PonomarenkoA.etal.ComparativeAnalysisofDataStructuresforApproximateNearestNeighborSearch//DATAANALYTICS2014,TheThirdInternationalConferenceonDataAnalytics.–2014.–С.125-130.]

Page 22: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

KL-divergence: ∑=i

ii y

xxyxd log),( Final16,Final64,andFinal256:aresetsof0.5milliontopic

histogramsgeneratedusingtheLatentDirichletAllocation(LDA).

Page 23: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Wikipediadataset

1, 2, ,( , ,..., )j j j n jd w w w=

1, 2, ,( , ,..., )q q n qq w w w=

, ,1

2 2, ,

1 1

( , )|| || || ||

n

i j i qj i

j n nj

i j i qi i

w wd qsim d q

d qw w

=

= =

⋅= =

∑ ∑

VectorSpaceModel

Wikipedia(cosinesimilarity):isadatasetthatcontains3.2millionvectorsrepresentedinasparseformat.Thissethasanextremelyhighdimensionality(morethan100thousandelements).Yet,thevectorsaresparse:Onaverageonlyabout600elementsarenon-zero.

Page 24: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Dis

tanc

e C

ompu

tatio

ns

0

450000

900000

1350000

1800000

Number of Elements0 1000000 2000000 3000000 4000000

perm. vp-tree perm. incr. sort. msw perm. inv. index

ScalingofmethodsonWikipediadataset

Wikipediaisdatasetthatcontains3.2millionvectorsrepresentedinasparseformat.EachvectorcorrespondstothefrequencytermvectoroftheWikipediapageextractedusingthegensimlibrary.Thissethasanextremelyhighdimensionality(morethan100thousandelements).

Recall=0.9

Page 25: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Dis

tanc

e C

ompu

tatio

ns

0

10000

20000

30000

40000

Number of objects10000 100000 1000000 10000000

ScalingofMSWdatastructure

Page 26: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

Summingup• Algorithmisverysimple• Algorithmusesonlydistancevaluesbetweentheobjects,makingitsuitablefor

arbitraryspaces.• Proposeddatastructurehasnorootelement.• Alloperations(additionandsearch)useonlylocalinformationandcanbeinitiated

fromanyelementthatwaspreviouslyaddedtothestructure.• Accuracyoftheapproximatesearchcanbetunedwithoutrebuildingdatastructure• Algorithmhighscalablebothin

sizeanddatadimensionality

Goodbaseforbuildingmanyreal-worldextremedatasetsizehighdimensionalitysimilaritysearchapplications

Page 27: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

27

hwps://github.com/searchivarius/NonMetricSpaceLibhwps://github.com/aponom84/MetrizedSmallWorld

SourceCode

Page 28: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

28

Questions?

Page 29: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate

29

Questions?

WhyCERNdoesn’tuseDHT?