Upload
lucidworks
View
877
Download
0
Embed Size (px)
Citation preview
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Shenghua Wan
Sr Software Engineer, @WalmartLabs [email protected]
Solr Distributed Indexing in WalmartLabs
Background
• Search big data, part of Polaris Search Team in WalmartLabs • Audience management, Axciom Inc. • HPC computational scientist, UTSW Medical Center
3
Our perspective :
• To help make Solr indexing more scalable • From a big data engineer perspective • Solr/Lucene internals are not covered in this talk
4
Problem definition • Input
96 gzipped xml files
• Output 3 shards of binary indexes, one for every 32 xml files • Dedicated indexing servers not scalable • Indexing time in dev environment at least 4 hours -> slow down development iteration
5
Existing “Wheels” for Solr Distributed Indexing • “Indexing Files via Solr and Java MapReduce” (Adam
Smieszny since 2012)
• LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)
• MapReduceIndexerTool (Mark Miller since late 2013)
6
Existing “Wheels” for Solr Distributed Indexing
q “Indexing Files via Solr and Java MapReduce” (Adam Smieszny since 2012)
q LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)
ü MapReduceIndexerTool (Mark Miller since late 2013) This tool is closest to our use case.
7
Start from MapReduceIndexerTool Anatomy of this tool • MorphlineMapper use Morphlines to convert document to SolrInputDocument • SolrRecordWriter
create a embedded Solr instance to index the document • TreeMergeRecordWriter
merge multiple binary indexes into one References: 1. https://github.com/apache/lucene-solr/tree/trunk/solr/
contrib/map-reduce 2. https://github.com/markrmiller/solr-map-reduce-example
8
Our Challenges • Not using Solr Cloud • Not using Zookeeper • Solr version 4.0 (when we did experiments) • Environment • Hadoop version 1 • MapR File System • XML input format
• Easy to maintain and debug • Documentation A runnable example with source code is the best. Thanks to https://github.com/markrmiller/solr-map-reduce-example.
9
Customize Design to Our Use Case Breaking down to two fundamental utilities • Index Generator
replace Morphlines with XmlInputFormat from Apache Mahout and reuse SolrOutputFormat
• Index Merger reuse TreeMergeOutputFormat
References: 1.https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java 2.https://github.com/apache/lucene-solr/tree/trunk/solr/contrib/map-reduce
10
Customize Design to Our Use Case – cont. Breaking down to two fundamental utilities • Index Generator • Index Merger More complicated logic can be built on top of these two simple map-only jobs. Where is reduce? Our use case does not need it. We want it lean and fast. But you may need it.
11
Experiments and Observations
• Index Generation ü CPU-bound ü can easily scale and be parallel ü Map-only wins 12~15% over Map-Reduce in our
experiments ü ~5GB decompressed Xml document indexed within 10
minutes using 7x3 mappers
12
Experiments and Observations – cont.
• Index Merging ü IO-bound Disk and Network. But network was our pain ü Two stages: logical merge and optimize o Logical merge: file movement o Optimize: reduce number of index segments
13
Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1
Nothing suspicious
14
Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1
Go sharp suddenly? • Too many shards • Resource
contention
15
Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1
Optimize time >> Logical merge time 5x ~ 8x (though 64-way is an exception, considered to be outlier because of shared environment)
16
New Challenges
After contacting cluster owner team, we were told the connection of that cluster consist of almost five dozen nodes is 1Gb/s Ethernet.
17
Experiments and Observations – cont. How about “tree” structure merge?
Seems to be attractive
18
Experiments and Observations – cont. Comparing hierarchical merge and n-way merge total time
Kind of unexpected
19
Experiments and Observations Comparing hierarchical merge and n-way merge total time
Relatively isolated environment: no network, but disk IO (4 cores x 2 threads)
4 small reads + 2 large reads
4 small reads
20
Lessons Learnt
• Index generation in parallel is easy
• Merging is not
• N-way merging all shards is better
• Data locality is key
21
Our Solutions
• Plan A “Hey, Sir/Madam, could you please get us 48Gb/s InfiniBand network ASAP? Or 10Gb/s is also fine.” • Plan B A small dedicated indexing Hadoop cluster (starting from one node)
22
Our Solutions A small dedicated indexing Hadoop cluster (starting from one node)
environment! Disk IO (MB/s)!shared! ~44!
Mac Pro (SSD)! ~250!Dedicated! ~202!
Dedicated cluster: • 1 node • 32 cores • 128GB mem
23
Tips
Tunable Parameter • Split Size (Map-Reduce) • Batch Size (Solr Index) • RAM Buffer Size (Solr Index) • Max number of Segments (Solr Index)
24
Opportunities
There are some parts missing in our tool which are allowed by our use case but you may want to have them: 1. Reduce functions (deduplication, other processing logic) 2. Try Spark or equivalent (bottleneck is embedded Solr
instance when merging)
25
Thanks! We are hiring!
Questions? 26