1
Distributed Adaptive Radix Tree for Efficient Metadata Search on HPC Systems Abstract Ø Affix-based keyword search is fundamental to metadata search on HPC systems. Ø Building inverted index to boost the performance of affix- based keyword search is a common practice but is challenging in distributed system like HPC. Ø We propose Distributed Adaptive Radix Tree (DART) to address the challenge of building such index in distributed environment. Ø Our experimental evaluation exhibits the effectiveness and efficiency of DART. Methods Ø DART Partition Tree Initialization While initializing DART on character set with k characters, we have to make sure the total number of leaf nodes in DART is larger than the total number of machines M. Thus, we initialize the height of the DART partition tree d as: Then, we have the following partition tree: Ø DART Base Virtual Node Selection Then, for a given string, starting from the root node of the partition tree, we locate the leaf node by following the path containing the sequence of first d characters in the given string. We choose such node as the base virtual node. Ø DART Alternative Virtual Node Selection For load balance purpose, we choose an alternative virtual node. We defined a different function for selecting the alternative virtual node. Ø DART Eventual Node Selection After selecting the base virtual node and alternative node, we detect the number of keywords indexed on them, and select the one with less keywords indexed as the eventual node. Ø DART Operation Complexity References: [1] W. Zhang, H. Tang, S. Byna and Y. Chen. DART: Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems. Accepted to appear in the Proc. of The 27th International Conference on Parallel Architectures and Compilation Techniques (PACT'18), 2018. (acceptance rate: 36/126=28.6%) [2] H. Tang, S. Byna, B. Dong, J. Liu, and Q. Koziol. 2017. SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 359–369. Background Ø Building inverted index for keyword search problem is a common practice. Ø In distributed system, building inverted index for keywords involves two different paradigms – document-partitioned approach and term-partitioned approach. Ø Document-partitioned approach suffers from heavy overhead introduced by query broadcasting. Ø “Full String Hashing” in term-based approach still suffers from query broadcasting in prefix search and suffix search Ø “Initial Hashing” in term-based approach suffers from imbalanced workload introduced by uneven keyword distribution led by different leading character. Ø Also, uneven query workload on popular keyword is noticeable. Acknowledgement: This research is supported in part by the National Science Foundation under grant CNS-1338078, IIP-1362134, CCF-1409946, and CCF-1718336. This work is supported in part by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC0205CH11231. (Project: Proactive Data Containers, Program manager: Dr. Lucy Nowell). This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility. Wei Zhang , Houjun Tang , Suren Byna , Yong Chen Department of Computer Science, Texas Tech University. Lawrence Berkeley National Laboratory. Search Performance at Scale Ø Search Throughput DART is able to achieve efficient affix-based keyword search at scale. As compared to two DHT cases, DART performs up to 55 times better in prefix search and suffix search while maintaining comparable performance in exact search and infix search. Ø Query Latency As the number of server increases, DART is able to maintain decent query latency which remains under 250 microseconds for prefix search, suffix search and exact search. For infix query, the query latency on DART still remains within 20 milliseconds on 256 servers, which is the largest scale we have ever tested. Load Balance at Scale DART is able to maintain load balance at scale. With different number of servers and different dataset, the dispersion of keyword distribution and query request distribution of DART is always between “full string hashing” and “initial hashing”. A ={A, B, C} {B,BB,BBB*} {CBC*} {AB, ABB*} C B A C B A A C B C A C B B B A C B A A C B C A C B A B A C B A A C B C A C B DART Partition Tree 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Prefix Search Throughput(TPS) Suffix Search Throughput(TPS) Exact Search Throughput(TPS) Infix Search Throughput(TPS) 124k 124k 125k 125k 126k 126k 0 1 2 3 4 5 6 7 8 9 a b c d e f 0 10k 20k 30k 40k 50k 60k ABCDEFGHI JKLMNOPQRSTUVWXYZ UUID DICT 0 500k 1000k 1500k 2000k 2500k 3000k 3500k ! #$ % & ' ( ) * + , - 0123456789: ; < > ? @ [ \ ] ^_`abcdefghi jklmnopqrs tuvwxyz {|} WIKI Requested Keywords Number of Request on each WIKI Keyword CV of UUID Keyword Distribution CV of DICT Keyword Distribution CV of WIKI Keyword Distribution CV of WIKI Request Distribution Conclusion Ø We have developed a trie-based inverted indexing technique, called DART, by which efficient affix-based keyword search can be performed. Ø Our evaluation result shows that DART can successfully facilitate affix-based keyword search and achieve up to 55 times better search performance than “Full String Hashing” DHT use case, while maintaining load balance introduced by imbalanced keyword distribution by different leading characters and uneven query workload of different keyword.

Distributed Adaptive Radix Tree for Efficient Metadata Search on … · 2020. 2. 5. · Contract No. DE-AC0205CH11231. (Project: Proactive Data Containers, Program manager: Dr. Lucy

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed Adaptive Radix Tree for Efficient Metadata Search on … · 2020. 2. 5. · Contract No. DE-AC0205CH11231. (Project: Proactive Data Containers, Program manager: Dr. Lucy

Distributed Adaptive Radix Tree for Efficient Metadata Search on HPC Systems

Abstract

Ø Affix-based keyword search is fundamental to metadata

search on HPC systems.

Ø Building inverted index to boost the performance of affix-

based keyword search is a common practice but is

challenging in distributed system like HPC.

Ø We propose Distributed Adaptive Radix Tree (DART) to

address the challenge of building such index in distributed

environment.

Ø Our experimental evaluation exhibits the effectiveness and

efficiency of DART.

MethodsØ DART Partition Tree Initialization

While initializing DART on character set with k characters, we have tomake sure the total number of leaf nodes in DART is larger than thetotal number of machines M. Thus, we initialize the height of theDART partition tree d as:

Then, we have the following partition tree:

Ø DART Base Virtual Node Selection

Then, for a given string, starting from the root node of the partition tree,we locate the leaf node by following the path containing the sequenceof first d characters in the given string. We choose such node as thebase virtual node.

Ø DART Alternative Virtual Node Selection

For load balance purpose, we choose an alternative virtual node. Wedefined a different function for selecting the alternative virtual node.

Ø DART Eventual Node Selection

After selecting the base virtual node and alternative node, we detect thenumber of keywords indexed on them, and select the one with lesskeywords indexed as the eventual node.

Ø DART Operation Complexity

References:[1] W. Zhang, H. Tang, S. Byna and Y. Chen. DART: Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems. Accepted to appear in the Proc. of The 27th International Conference on Parallel Architectures and Compilation Techniques (PACT'18), 2018. (acceptance rate: 36/126=28.6%)

[2] H. Tang, S. Byna, B. Dong, J. Liu, and Q. Koziol. 2017. SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 359–369.

BackgroundØ Building inverted index for keyword search problem is a common

practice.

Ø In distributed system, building inverted index for keywordsinvolves two different paradigms – document-partitioned approachand term-partitioned approach.

Ø Document-partitioned approach suffers from heavy overheadintroduced by query broadcasting.

Ø “Full String Hashing” in term-based approach still suffers fromquery broadcasting in prefix search and suffix search

Ø “Initial Hashing” in term-based approach suffers from imbalancedworkload introduced by uneven keyword distribution led bydifferent leading character.

Ø Also, uneven query workload on popular keyword is noticeable. Acknowledgement:This research is supported in part by the National Science Foundation under grant CNS-1338078, IIP-1362134, CCF-1409946, and CCF-1718336.

This work is supported in part by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC0205CH11231. (Project: Proactive Data Containers, Program manager: Dr. Lucy Nowell).

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility.

Wei Zhang✞, Houjun Tang✣, Suren Byna✣, Yong Chen✞✞Department of Computer Science, Texas Tech University. ✣Lawrence Berkeley National Laboratory.

Search Performance at ScaleØ Search ThroughputDART is able to achieve efficient affix-based keyword search at scale. Ascompared to two DHT cases, DART performs up to 55 times better in prefixsearch and suffix search while maintaining comparable performance inexact search and infix search.

Ø Query LatencyAs the number of server increases, DART is able to maintain decent querylatency which remains under 250 microseconds for prefix search, suffixsearch and exact search. For infix query, the query latency on DART stillremains within 20 milliseconds on 256 servers, which is the largest scale wehave ever tested.

Load Balance at ScaleDART is able to maintain load balance at scale. With different numberof servers and different dataset, the dispersion of keyword distributionand query request distribution of DART is always between “full stringhashing” and “initial hashing”.

A={A, B, C}

{B,BB,BBB*}

{CBC*}

{AB, ABB*}

C B

AC

B A

AC

B

C

A

CB

B

B

AC

B

A

ACB

C

AC

B

AB

A C B

A

AC

B

C

ACB

DARTPartition Tree

0

1

23 4 5

6

7

8

910

1112

13

14

1516

17181920

21

2223

2425

26

Prefix Search Throughput(TPS) Suffix Search Throughput(TPS)

Exact Search Throughput(TPS) Infix Search Throughput(TPS)

124k

124k

125k

125k

126k

126k

0 1 2 3 4 5 6 7 8 9 a b c d e f0

10k

20k

30k

40k

50k

60k

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

UUID DICT

0500k1000k1500k2000k2500k3000k3500k

! # $% & ' ( ) * + , - 0 1 2 3 4 5 6 7 8 9 : ; < > ? @ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }

WIKI Requested Keywords

Number of Request on each WIKI Keyword

CV of UUID Keyword Distribution CV of DICT Keyword Distribution

CV of WIKI Keyword Distribution CV of WIKI Request Distribution

ConclusionØ We have developed a trie-based inverted indexing technique,

called DART, by which efficient affix-based keyword searchcan be performed.

Ø Our evaluation result shows that DART can successfullyfacilitate affix-based keyword search and achieve up to 55times better search performance than “Full String Hashing”DHT use case, while maintaining load balance introduced byimbalanced keyword distribution by different leadingcharacters and uneven query workload of different keyword.