Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Distributed Adaptive Radix Tree for Efficient Metadata Search on HPC Systems
Abstract
Ø Affix-based keyword search is fundamental to metadata
search on HPC systems.
Ø Building inverted index to boost the performance of affix-
based keyword search is a common practice but is
challenging in distributed system like HPC.
Ø We propose Distributed Adaptive Radix Tree (DART) to
address the challenge of building such index in distributed
environment.
Ø Our experimental evaluation exhibits the effectiveness and
efficiency of DART.
MethodsØ DART Partition Tree Initialization
While initializing DART on character set with k characters, we have tomake sure the total number of leaf nodes in DART is larger than thetotal number of machines M. Thus, we initialize the height of theDART partition tree d as:
Then, we have the following partition tree:
Ø DART Base Virtual Node Selection
Then, for a given string, starting from the root node of the partition tree,we locate the leaf node by following the path containing the sequenceof first d characters in the given string. We choose such node as thebase virtual node.
Ø DART Alternative Virtual Node Selection
For load balance purpose, we choose an alternative virtual node. Wedefined a different function for selecting the alternative virtual node.
Ø DART Eventual Node Selection
After selecting the base virtual node and alternative node, we detect thenumber of keywords indexed on them, and select the one with lesskeywords indexed as the eventual node.
Ø DART Operation Complexity
References:[1] W. Zhang, H. Tang, S. Byna and Y. Chen. DART: Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems. Accepted to appear in the Proc. of The 27th International Conference on Parallel Architectures and Compilation Techniques (PACT'18), 2018. (acceptance rate: 36/126=28.6%)
[2] H. Tang, S. Byna, B. Dong, J. Liu, and Q. Koziol. 2017. SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 359–369.
BackgroundØ Building inverted index for keyword search problem is a common
practice.
Ø In distributed system, building inverted index for keywordsinvolves two different paradigms – document-partitioned approachand term-partitioned approach.
Ø Document-partitioned approach suffers from heavy overheadintroduced by query broadcasting.
Ø “Full String Hashing” in term-based approach still suffers fromquery broadcasting in prefix search and suffix search
Ø “Initial Hashing” in term-based approach suffers from imbalancedworkload introduced by uneven keyword distribution led bydifferent leading character.
Ø Also, uneven query workload on popular keyword is noticeable. Acknowledgement:This research is supported in part by the National Science Foundation under grant CNS-1338078, IIP-1362134, CCF-1409946, and CCF-1718336.
This work is supported in part by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC0205CH11231. (Project: Proactive Data Containers, Program manager: Dr. Lucy Nowell).
This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility.
Wei Zhang✞, Houjun Tang✣, Suren Byna✣, Yong Chen✞✞Department of Computer Science, Texas Tech University. ✣Lawrence Berkeley National Laboratory.
Search Performance at ScaleØ Search ThroughputDART is able to achieve efficient affix-based keyword search at scale. Ascompared to two DHT cases, DART performs up to 55 times better in prefixsearch and suffix search while maintaining comparable performance inexact search and infix search.
Ø Query LatencyAs the number of server increases, DART is able to maintain decent querylatency which remains under 250 microseconds for prefix search, suffixsearch and exact search. For infix query, the query latency on DART stillremains within 20 milliseconds on 256 servers, which is the largest scale wehave ever tested.
Load Balance at ScaleDART is able to maintain load balance at scale. With different numberof servers and different dataset, the dispersion of keyword distributionand query request distribution of DART is always between “full stringhashing” and “initial hashing”.
A={A, B, C}
{B,BB,BBB*}
{CBC*}
{AB, ABB*}
C B
AC
B A
AC
B
C
A
CB
B
B
AC
B
A
ACB
C
AC
B
AB
A C B
A
AC
B
C
ACB
DARTPartition Tree
0
1
23 4 5
6
7
8
910
1112
13
14
1516
17181920
21
2223
2425
26
Prefix Search Throughput(TPS) Suffix Search Throughput(TPS)
Exact Search Throughput(TPS) Infix Search Throughput(TPS)
124k
124k
125k
125k
126k
126k
0 1 2 3 4 5 6 7 8 9 a b c d e f0
10k
20k
30k
40k
50k
60k
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
UUID DICT
0500k1000k1500k2000k2500k3000k3500k
! # $% & ' ( ) * + , - 0 1 2 3 4 5 6 7 8 9 : ; < > ? @ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }
WIKI Requested Keywords
Number of Request on each WIKI Keyword
CV of UUID Keyword Distribution CV of DICT Keyword Distribution
CV of WIKI Keyword Distribution CV of WIKI Request Distribution
ConclusionØ We have developed a trie-based inverted indexing technique,
called DART, by which efficient affix-based keyword searchcan be performed.
Ø Our evaluation result shows that DART can successfullyfacilitate affix-based keyword search and achieve up to 55times better search performance than “Full String Hashing”DHT use case, while maintaining load balance introduced byimbalanced keyword distribution by different leadingcharacters and uneven query workload of different keyword.