Upload
lucidworks
View
100
Download
3
Embed Size (px)
Citation preview
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Rebalance API for SolrCloud
Nitin Sharma - Senior Software Engineer, Netflix Suruchi Shah - Software Engineer, Bloomreach
3
Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○ Query Performance ○ Redistribution of Data ○ Removing/Replacing Nodes ○ Indexing Performance
➢ Allocation Strategies
➢ Open Source ➢ Summary
4
Motivation ● Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter
search architecture based on solrcloud
● Scaling issues related to Query Performance:
■ Inability to auto scale Solr serving with increasing index sizes
■ Dynamic Shard Setup based on index size
● SLA issues due to Availability:
■ Nightmare to manipulate cluster/collection setup with expanding clusters and frequent node replacements (AWS issues)
■ Flexible Replica Allocation based on custom strategies to guarantee 99.995% availability
● Data Freshness (aka Indexing Performance) hiccups:
■ Break the tight coupling between Indexing and Serving latency (Tp 95) SLAs
● Generic framework for Solr SLA management that can be open-sourced
5
Rebalance API ● Fine-grained SLA management in Solr
● Smarter Index, Cluster & Data Management for SolrCloud
● Forms the basis for Solr Auto Scale
● Admin handler in Solr
● 2 levels of abstraction
● Scaling Strategies
○ Aid with shard, replica and cluster manipulation techniques to guarantee SLA
○ Zero Downtime operations
○ Tunable for Availability, Performance or Cost
● Allocation Strategies
○ Decides Core Placement
○ Tunable for Availability, Performance or Cost
Rebalance API
Scaling Strategies Allocation Strategies
Auto Shard
Redistribute
Smart Merge
Replace
Scale Up
Scale Down
Least Used
Unused
AZ Aware
6
Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○ Query Performance ○ Redistribution of Data ○ Removing/Replacing Nodes ○ Indexing Performance
➢ Allocation Strategies
➢ Open Source ➢ Summary
7
01 Query Performance Issues
● Indexing doubles documents - Per shard latency goes up
● Tp 95 shoots up ● No way to change shards
dynamically ● Delete, Recreate, Re-index ● Availability goes down
Node 1
Zk Ensemble
Node 2
Solr Collection A
Indexing 2x documents
Shard 1 Shard 2
Query Tp95
Shard 3 Shard 4
8
03 Rebalance Auto Shard
● Re-sharding existing collection to any number of destination shards. (e.g can help with reducing latency)
● Includes re-distributing the index and configs consistently.
● Zero downtime - No query failures
● Avoiding any heavy re-indexing processes.
● Can be size based as well
● Sample API Call:
/admin/collections?action=REBALANCE &scaling_strategy=AUTO_SHARD &collection=collection_name&num_shards=number of shards &size_cap=1G &allocation_strategy=least_used
9
Solr Collection A
Node 1
Zk Ensemble
Node 2
Shard 1 Shard 2
Rebalance Auto Shard - Overview
Shard 3 Shard 4
10
01 Auto Shard (1) - Internals - Simple Strategy
● Merge documents from all shards - Lucene library based
● Split the merged shard into desired
number ● Auto Zk update for shard ranges ● Heavyweight but works ● Based on size, could take in 20-30 of
minutes to complete
Solr Collection A
Shard 1 Shard 2
Merged Shard (Temp)
Shard 1 Shard 2 Shard 3 Shard 4
Solr Collection A’
merge
Even Split
11
01 Auto Shard (2) - Internals - Smart Split Strategy
● Identify minimum number of splits ● Split shards in parallel to required desired
setup ● Relatively high performance ○ 2 Tb index from 2 to 4 shards - 2.5 mins
● Auto Zk Update
Solr Collection A
Shard 1 Shard 2
Shard 1_1
Shard 1_2
Shard 2_1
Shard 2_2
Solr Collection A
Shard 1 Shard 2 Shard 3 Shard 4
Solr Collection A
Smart Split Renamed Shards
12
01 Solution : Auto Sharding
● Dynamically Increase/Decrease shards
● E.g. Increase shards from 2 to 4 to reduce latency
● E.g. Tp 95 reduced from > 1 sec to 250
ms.
Solr Collection A
Node 1
Zk Ensemble
Node 2
Shard 1 Shard 2
Shard 3 Shard 4
13
03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○ Query Performance ○ Redistribution of Data ○ Removing/Replacing Nodes ○ Indexing Performance ○ Data Consistency
➢ Allocation Strategies
➢ Open Source ➢ Summary
14
01 Data distribution issues
● Adding a new node - Does nothing ● No Redistribution of solr cores ● Machines running out of disk space -
heavier collections need to be moved out ● Problem amplifies at large scale - 100s of
nodes, 1000s of collections ● Manual Management of core placement
becomes an issue
Default Solr Behavior
Node 1
Zk Ensemble
Node 2
Node 3
Core A2
Core B1
Core D2
Core D1
Core A1
Core B2
Core C1
Core C2
Node 4
15
01 Re-distribute Strategy - Internals
● Internal topology construction from ZK
● Desired Core placement computation - External
or Trigger based
● Migration of cores within the cluster
● Knobs to control min/max to reduce cluster
load
● Zero downtime - No query failures and
resiliency to node failures
● API Call: /admin/collections?
action=REBALANCE&scaling_strategy=REDIST
RIBUTE
Compute & Redistribute
Node 1
Zk Ensemble
Node 2
Node 3
Core A2
Core B1
Core D1
Core A1
Core B2
Core C1
Node 1
Zk Ensemble
Node 2
Node 3
Core A2
Core B1
Core D1
Core A1
Core B2
Core C1
16
01 Solution: Auto Redistribution of Data
Redistribute
● Adding new node -
triggers redistribution
● Respects the core
placement allocation
strategy
● Zero downtime
Node 1
Zk Ensemble
Node 2
Node 3
Core A2
Core B1
Core D2
Core D1
Core A1
Core B2
Core C1
Core C2
Node 4
Node 1
Zk Ensemble
Node 2
Node 3
Core A2
Core B1
Core D2
Core D1
Core A1
Core B2
Core C1
Core C2
Node 4
17
03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○ Query Performance ○ Redistribution of Data ○ Removing/Replacing Nodes ○ Indexing Performance
➢ Allocation Strategies
➢ Open Source ➢ Summary
18
01 Replace Solr Nodes Default Solr Behavior
● A node might die, need to be replaced,
decommissioned
● Default behavior - Do nothing
● Can cause downtime - Heavy cores on
the nodes
● Problem exacerbated with 1000s of
nodes/collections
Node 1
Zk
Ense
mbl
e
Node 2
Node 3
Core A2
Core B1
Core D1
Core A1
Core B2
Core C1
19
01 Replace Nodes with Rebalance
● Read the Topology of cluster
● Migrate replicas from node about to die to
new node
● Zero downtime
● API Call:
○ /admin/collections?
action=REBALANCE&scaling_strategy=REPLACE
&collection=collectionName&source_node=so
urce_host &dest_node=dest_host
Node 1
Zk
Ense
mbl
e
Node 2
Node 3
Core A2
Core B1
Core D1
Core A1
Core B2
Core C1
Node 4
Core B1
Core A1
Replaced Node
20
03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○ Query Performance ○ Redistribution of Data ○ Removing/Replacing Nodes ○ Indexing Performance
➢ Allocation Strategies
➢ Open Source ➢ Summary
21
01 Indexing Performance ● Higher the shards, faster the indexing
(parallelism) ● Faster indexing - Data Freshness SLA ● Solr - # shards is the same for indexing
vs serving ● Shard setup - tweaked for serving
query performance ● E.g. ○ Indexing 100M docs in 2 shards - 2
hours ○ Serving 100M docs in 2 shards - Tp
95 < 100 ms
Shard 1 Shard 2 Shard 3 Shard 4
Indexing
500M Documents
Performance Hit
Serving Queries
22
01 Indexing Performance - Smart Merge
● Separate Indexing shard setup vs serving
● More shards for indexing - Merged into lesser shards for serving
● Post Indexing issue API to merge into serving collection
● API Call: ○ /admin/collections?
action=Rebalance&scaling_strategy=SMART_MERGE_DISTRIBUTED&collection=collectionName&num_shards=numRequiredShards
● Parallel Merge
● Zero downtime
23
01 Indexing Performance - Smart Merge
● Index vs Serving has different collections
● Indexed Collection
merged into Serving - Using smart merge call
● Indexing can be tuned
independently for performance
● Serving SLA unaffected
Shard 1 Shard 2 Shard 3 Shard 4
Indexing
500M Documents
Serving Queries
Shard 1
Shard 4
Shard 5
Shard 8
Shard 9
Shard 12
Shard 13
Shard 16… … … …
Collection A_Indexing
Collection A
Parallel merge Parallel merge Parallel merge Parallel merge
24
Rebalance API ● Fine-grained SLA management in Solr
● Smarter Index, Cluster & Data Management for SolrCloud
● Forms the basis for Solr Auto Scale
● Admin handler in Solr
● 2 levels of abstraction
● Scaling Strategies
○ Aid with shard, replica and cluster manipulation techniques to guarantee SLA
○ Zero Downtime operations
○ Tunable for Availability, Performance or Cost
● Allocation Strategies
○ Decides Core Placement
○ Tunable for Availability, Performance or Cost
Rebalance API
Scaling Strategies Allocation Strategies
Auto Shard
Redistribute
Smart Merge
Replace
Scale Up
Scale Down
Least Used
Unused
AZ Aware
25
01 Allocation Strategies
● Abstracts out the core placement methodology
● Least Used Strategy - Pick the node that has the least amount of cores
● AZ aware Strategy - Pick the node that is in a different availability zone than the other
cores for a given collection
● Unused Strategy - Pick the node that does not have any cores for a given collection
● All of them are compatible with all scaling strategies
26
01 Open Source
● Fully open sourced - SOLR-9241
(4.6.1).
● Contributed patch works on top of
4.6.1 and tested up to 4.10
● SOLR-9241 (epic) - patches/features
on master. Has sub patches
○ SOLR-93{16-21}
○ SOLR-9407
● Actively working with community to
get the rest of the API 6+ compatible.
27
01 Summary ● Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter search
architecture based on solrcloud
● Rebalance API
○ Scaling Strategies - How to scale?
○ Allocation Strategies - Where to place cores?
● Forms the basis for Solr Auto Scale
● Zero Downtime operations for
○ Dynamically changing shard setup
○ Decoupling indexing SLA from Serving
○ Replacing Nodes
○ Auto -Redistributing data with cluster expansion
● Open Source
28
01 Speakers
Nitin Sharma https://www.linkedin.com/in/knitinsharma [email protected]
Suruchi Shah https://www.linkedin.com/in/suruchishah [email protected]