Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Shenghua Wan

Sr Software Engineer, @WalmartLabs [email protected]

Solr Distributed Indexing in WalmartLabs

Background

•  Search big data, part of Polaris Search Team in WalmartLabs •  Audience management, Axciom Inc. •  HPC computational scientist, UTSW Medical Center

3

Our perspective :

•  To help make Solr indexing more scalable •  From a big data engineer perspective •  Solr/Lucene internals are not covered in this talk

4

Problem definition •  Input

96 gzipped xml files

•  Output 3 shards of binary indexes, one for every 32 xml files •  Dedicated indexing servers not scalable •  Indexing time in dev environment at least 4 hours -> slow down development iteration

5

Existing “Wheels” for Solr Distributed Indexing •  “Indexing Files via Solr and Java MapReduce” (Adam

Smieszny since 2012)

•  LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)

•  MapReduceIndexerTool (Mark Miller since late 2013)

6

Existing “Wheels” for Solr Distributed Indexing

q “Indexing Files via Solr and Java MapReduce” (Adam Smieszny since 2012)

q LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)

ü MapReduceIndexerTool (Mark Miller since late 2013) This tool is closest to our use case.

7

Start from MapReduceIndexerTool Anatomy of this tool •  MorphlineMapper use Morphlines to convert document to SolrInputDocument •  SolrRecordWriter

create a embedded Solr instance to index the document •  TreeMergeRecordWriter

merge multiple binary indexes into one References: 1.  https://github.com/apache/lucene-solr/tree/trunk/solr/

contrib/map-reduce 2.  https://github.com/markrmiller/solr-map-reduce-example

8

Our Challenges •  Not using Solr Cloud •  Not using Zookeeper •  Solr version 4.0 (when we did experiments) •  Environment •  Hadoop version 1 •  MapR File System •  XML input format

•  Easy to maintain and debug •  Documentation A runnable example with source code is the best. Thanks to https://github.com/markrmiller/solr-map-reduce-example.

9

Customize Design to Our Use Case Breaking down to two fundamental utilities •  Index Generator

replace Morphlines with XmlInputFormat from Apache Mahout and reuse SolrOutputFormat

•  Index Merger reuse TreeMergeOutputFormat

References: 1.https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java 2.https://github.com/apache/lucene-solr/tree/trunk/solr/contrib/map-reduce

10

Customize Design to Our Use Case – cont. Breaking down to two fundamental utilities •  Index Generator •  Index Merger More complicated logic can be built on top of these two simple map-only jobs. Where is reduce? Our use case does not need it. We want it lean and fast. But you may need it.

11

Experiments and Observations

•  Index Generation ü CPU-bound ü  can easily scale and be parallel ü Map-only wins 12~15% over Map-Reduce in our

experiments ü ~5GB decompressed Xml document indexed within 10

minutes using 7x3 mappers

12

Experiments and Observations – cont.

•  Index Merging ü  IO-bound Disk and Network. But network was our pain ü  Two stages: logical merge and optimize o  Logical merge: file movement o  Optimize: reduce number of index segments

13

Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1

Nothing suspicious

14


Go sharp suddenly? •  Too many shards •  Resource

contention

15


Optimize time >> Logical merge time 5x ~ 8x (though 64-way is an exception, considered to be outlier because of shared environment)

16

New Challenges

After contacting cluster owner team, we were told the connection of that cluster consist of almost five dozen nodes is 1Gb/s Ethernet.

17

Experiments and Observations – cont. How about “tree” structure merge?

Seems to be attractive

18

Experiments and Observations – cont. Comparing hierarchical merge and n-way merge total time

Kind of unexpected

19

Experiments and Observations Comparing hierarchical merge and n-way merge total time

Relatively isolated environment: no network, but disk IO (4 cores x 2 threads)

4 small reads + 2 large reads

4 small reads

20

Lessons Learnt

•  Index generation in parallel is easy

•  Merging is not

•  N-way merging all shards is better

•  Data locality is key

21

Our Solutions

•  Plan A “Hey, Sir/Madam, could you please get us 48Gb/s InfiniBand network ASAP? Or 10Gb/s is also fine.” •  Plan B A small dedicated indexing Hadoop cluster (starting from one node)

22

Our Solutions A small dedicated indexing Hadoop cluster (starting from one node)

environment! Disk IO (MB/s)!shared! ~44!

Mac Pro (SSD)! ~250!Dedicated! ~202!

Dedicated cluster: •  1 node •  32 cores •  128GB mem

23

Tips

Tunable Parameter •  Split Size (Map-Reduce) •  Batch Size (Solr Index) •  RAM Buffer Size (Solr Index) •  Max number of Segments (Solr Index)

24

Opportunities

There are some parts missing in our tool which are allowed by our use case but you may want to have them: 1.  Reduce functions (deduplication, other processing logic) 2.  Try Spark or equivalent (bottleneck is embedded Solr

instance when merging)

25

Thanks! We are hiring!

Questions? 26

Technology

Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs