Upload
lucenerevolution
View
4.640
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Presented by Erick Erickson, Lucid Imagination - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 The next major release of Solr (4.0) will include "SolrCloud", which provides new distributed capabilities for both in-house and externally-hosted Solr installations. Among the new capabilities are: Automatic Distributed Indexing, High Availability and Failover, Near Real Time searching and Fault Tolerance. This talk will focus, at a high level, on how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.
Citation preview
How SolrCloud Changes the User Experience In a Sharded Environment Lucene Revolution, 9-May-2012
Erick Erickson, Lucid Imagination
2
! “Erick is just some guy, you know” • Your geekiness score is increased if you know where that quote comes
from, and your age is hinted at ! 30+ years in the programming business, mostly as a developer ! Currently employed by Lucid Imagination in Professional Services
• I get to see how various organizations interpret “search” and I’m amazed at the different problems Solr is used to solve
! Solr/Lucene committer ! [email protected] ! Sailor, anybody need crew for sailboat delivery?
Who am I?
What we’ll cover
3
! Briefly, what else is coming in 4.0 ! SolrCloud (NOT Solr-in-the-cloud), upcoming in 4.0
• What it is • Why you may care
! Needs SolrCloud addresses • DR/HA • Distributed indexing • Distributed searching
! I’m assuming basic familiarity with Solr
I’m not the implementer, Mark is
4
! Well, Mark Miller and others ! Mark’s talk (tomorrow) is a deeper technical dive, I recommend it
highly
• Anything I say that contradicts anything Mark says, believe Mark − After all, he wrote much of the code
! Mark insisted on the second slide after this one
5
6
When and Where can we get 4.0?
7
! When will it be released? Hopefully 2012 • Open Source; have you ever tried herding cats? • Alpha/Beta planned, this is unusual • 3.6 probably last 3x release
! How usable are nightly builds? • LucidWorks Enterprise runs on trunk, so trunk is quite stable and in
production ! There’s lots of new stuff!
• “unstable” doesn’t really mean unstable code − Changing APIs, index format may change
! Nightly builds: https://builds.apache.org//view/S-Z/view/Solr/ ! Source code and build instructions: http://wiki.apache.org/solr/
HowToContribute
Cool stuff in addition to SolrCloud in 4.0
8
Other cool 4.0 (trunk) features
9
! Similarity calculations decoupled from Lucene. ! Scoring is pluggable ! There are several different OOB implementations now (e.g. BM25)
! FST (Finite State Automata/Transducer) based work. Speed and size improvements http://www.slideshare.net/otisg/finite-state-queries-in-lucene ! FST for fuzzy queries, 100x faster (McCandless’ blog)
! You can plug in your own index codec. See pulsing and SimpleTextCodec. This is really your own index format • Can be done on a per field basis • Text output as an example
! Much more efficient in-memory structures ! NRT (Near Real Time) searching and “soft commits” ! Spatial (LSP) rather than spatial contrib
More cool new features
10
! Adding PivotFacetComponent for Hierarchical faceting. See Yonik's presentation, “useful URLs” section
! Pseudo-join queries – See Yonik’s presentation URL in “useful URLs” section
! New Admin UI ! Can’t over-emphasize the importance of CHANGES.txt
• Solr • Lucene • Please read them when upgrading. Really
SolrCloud setup and use
11
What is SolrCloud
12
! SolrCloud is a set of new distributed capabilities in Solr that: • Automatically distributes updates (i.e. indexes documents) to the
appropriate shard • Uses transaction logs for robust update recovery • Automatically distributes searches in a sharded environment • Automatically assigns replicas to shards when available • Supports Near Real Time searching (NRT) • Uses Zookeeper as a repository for cluster state
Common pain points (why you may care)
13
! Every large organization seems to have a recurring set of issues: • Sharding – have to do it yourself, usually through SolrJ or similar. • Capacity expansion – what to do when you need more capacity • System status – getting alerts when machines die • Replication – configuration • Finding recently-indexed data – everyone wants “real time”
− Often not as important as people think, but... • Inappropriate configuration
− Trying for “real time” by replicating every 5 seconds − Committing every document/second/packet − Mismatched schema or config files on masters and slaves
Common Pain Points (Why you may care)
14
! Maintaining different configuration files (and coordinating them) for masters and slaves
! SolrCloud addresses most of these. ! SolrCloud is currently “a work in progress”
! Multiple Indexers ! Query Slaves
• 1 or more per indexer ! Yes, you can shard & distribute
Typical sharding setup Indexing
Searching
Load Balancer
Steps to set this up
16
! Figure out how many shards required ! Configure all masters, which may be complex
• Point your indexing at the appropriate master ! Configure all slaves
• Configure distributed searching • Make sure the slaves point at the correct master • Find out where you mis-configured something, e.g. “I’m getting duplicate
documents”.. Because you indexed the same doc to two shards? • Deal with your manager wanting to know why the doc she just indexed
isn’t showing up in the search (replication delay) • Rinse, Repeat…
How is this different with SolrCloud?
17
! Decide how many shards you need ! Ask the ops folks how many machines you can have ! Start your servers:
• On the Zookeeper machine (s): java -Dbootstrap_confdir=./solr/conf -DzkRun –DnumShards=### -jar start.jar
• On all the other machines: java –DzkHost=<ZookeeperMachine:port> [,<ZookeeperMachine:port>…] -jar start.jar
! Index any way you want • To any machine you want, perhaps in parallel
! Send search to any machine you want ! Note: Demo uses embedded Zookeeper
• Most production installations will probably use “ensembles”
Diving a little deeper (indexing)
18
Diving a little deeper (indexing)
19
! How are shard machines assigned? • It’s magic, ask Mark. • As each machine is started, it’s assigned shard N+1 until numShards is
reached • The information is recorded in Zookeeeper where it’s available to all
! How are leaders elected? • Initially, on a first-come-first-served basis, so at initial setup each shard
machine will be a leader (numShards == num available machines) ! How are replicas assigned?
• See above (magic), but conceptually it’s on a “round robin” basis • As each machine is started for the first time, it’s assigned to the shard
with the fewest replicas (tie-breaking on lowest shard ID)
Assigning machines
20
-DnumShards=3 -Dbootstrap_confdir=./solr/conf -DzkHost=<host>:<port>[,<host>:<port>]
Leader shard1
ZK Host(s
)
Assigning machines
21
-DzkHost=<host>:<port>[,<host>:<port>]
Leader shard2
Leader shard1
ZK Host(s
)
Assigning machines
22
-DzkHost=<host>:<port>[,<host>:<port>] At this point you can index and search, you have one machine/shard
Leader shard2
Leader shard3
Leader shard1
ZK Host(s
)
Assigning machines
23
-DzkHost=<host>:<port>[,<host>:<port>]
Leader shard2
Leader shard3
Leader shard1
Replica shard1
ZK Host(s
)
Assigning machines
24
-DzkHost=<host>:<port>[,<host>:<port>]
Leader shard2
Leader shard3
Replica shard2
Leader shard1
Replica shard1
ZK Host(s
)
Assigning machines
25
-DzkHost=<host>:<port>[,<host>:<port>]
Leader shard2
Leader shard3
Replica shard2
Leader shard1
Replica shard1
Replica shard3
ZK Host(s
)
Diving a little deeper (indexing)
26
! Let’s break this up a bit ! There really aren’t any masters/slaves in SolrCloud
• “Leaders” and “replicas”. Leaders are automatically elected − Leaders are just a replica with some coordination responsibilities for
the associated replicas • If a leader goes down, one of the associated replicas is elected as the
new leader • You don’t have to do anything for this to work
! When you send a document to a machine for indexing the code (DistributedUpdateProcessor) does several things: • If I’m a replica, forward the request to my leader • If I’m a leader
− determine which shard each document should go to and forwards the doc (in batches of 10 presently) to that leader
− Indexes any documents for this shard to itself and replicas
Diving a little deeper (indexing)
27
! When new machines are added and get assigned to a shard • Probably an old-style replication will occur initially, it’s most efficient for
bulk updates − This doesn’t require user intervention
• Any differences between the replication and the current state of the leader will be replayed from the transaction log until the new machine’s index is identical to the leader
• When this is complete, search requests are forwarded to the new machine
Diving a little deeper (indexing)
28
! Transaction log, huh? ! A record of updates is kept in the “transaction log”. This allows for
more robust indexing • Any time the indexing process in interrupted, any uncommitted updates
can be replayed from the transaction log ! Synchronizing replicas has some heuristics applied.
• If there are “a lot” of updates (currently 100) to be synchronized, then an old-style replication is triggered
• Otherwise, the transaction log is “replayed” to synchronize the replica
Diving a little deeper (indexing)
29
! “Soft commits”, huh? ! Solr 4.0 introduces the idea of “soft commits” to handle “near real
time” searching • Historically, Solr required a “commit” to close segments. At that point:
− New searchers were opened so those documents could be seen − Slaves couldn’t search new documents until after replication
! Think of soft commits as adding documents to an in-memory, writeable segment • On a hard commit, the currently-open segment is closed and the in-
memory structures are reset ! Soft commits can happen as often as every second ! Soft commits (and NRT) are used by SolrCloud, but can be used
outside of the SolrCloud framework
Diving a little deeper (searching) and all the rest
30
Diving a little deeper (searching)
31
! Searching “just happens” • There’s no distinction between masters and slaves, so any request can
be sent to any machine in the cluster ! Searching is NRT. Since replication isn’t as significant now, this is
automatic • There is a small delay while the documents are forwarded to all the
replicas ! Shard information does not need to be configured in Solr
configuration files
Diving a little deeper (the rest)
32
! Capacity expansion ! System status ! Replication ! NRT ! Zookeeper
Capacity expansion
33
! Whew! Let’s say that you have your system running just fine, and you discover that you are running close to the edge of your capacity. What do you need to do to expand capacity? • Install Solr on N more machines • Start them up with the –DzkHost parameter • Register them with your fronting load balancer • Sit back and watch the magic
! Well, what about reducing capacity • Shut the machines down
System Status
34
! There is a new Admin UI that graphically shows the state of your cluster, especially active machines
! But overall, sending alerts etc. isn’t in place today, although it’s under discussion
Replication
35
! But we’ve spent a long time understanding replication! ! Well, it’s largely irrelevant now. When using SolrCloud, replication is
automatically handled • This includes machines being temporarily down. When they come back
up, SolrCloud re-synchronizes them with the master and forwards queries to them after they are synchronized
• This includes temporary glitches (say your network burps)
Finding Recently-indexed Docs (NRT)
36
! NRT has been a long time coming, but it’s here ! Near Real Time because there are still slight delays from 2 sources
• Until a “soft commit” happens, which can be every second • Some propagation delay while incoming index requests are:
− Perhaps forwarded to the shard leader − Forwarded to the proper shard − Forwarded to the replicas from the shard leader
• But these delays probably won’t be noticed
Zookeeper
37
! ZooKeeper is “a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.”
! A lot of complexity for maintaining Solr installations is solved with Zookeeper
! Zookeeper is the repository for cluster state information ! See: http://zookeeper.apache.org/
Using Zookeeper with SolrCloud
38
! The –DzkRun flag (in the demo) causes an embedded Zookeeper server to run in that server • Simple to use in the tutorials, but probably not the right option for
production • An enterprise installation will probably run Zookeeper as an “ensemble”,
external to Solr servers ! Zookeeper works on a quorum model where N/2+1 Zookeepers
must be running • It’s best to run an odd number of them (and three or more!) to avoid
Zookeeper being a single point of failure ! Yes, setting up Zookeeper and making SolrCloud aware of them is
an added bit of complexity, but TANSTAAFL (more age/geek points if you know where that comes from)
Gotchas
39
! This is new and changing • Optimistic locking not fully in place yet • At least one machine/shard must be running.
! _version_ is a magic field, don’t change it ! It’s a whole new world, some of your infrastructure is obsolete ! We’re on the front end of the learning curve ! Some indexing speed penalty ! This is trunk, index formats may change etc.
Useful URLs
40
! The Solr Wiki: http://wiki.apache.org/solr/ ! Source code, builds, etc:
http://wiki.apache.org/solr/HowToContribute ! Main Solr/Lucene website: http://wiki.apache.org/solr/ ! Really good blogs:
• Simon Willnauer: http://www.searchworkings.org/blog/-/blogs/ • Mike McCandless: http://blog.mikemccandless.com/ • Lucid Imagination: http://www.lucidimagination.com/blog/
! Lucene Spatial Playground/Spatial4J: http://code.google.com/p/lucene-spatial-playground/
More useful URLs
41
! DocumentWriterPerThread (DWPT) writeup (Simon Willnauer): http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
! FST and fuzzy query 100X faster: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html
! Solr Cloud: http://wiki.apache.org/solr/SolrCloud • NOT Solr-in-the-cloud
! Lucene JIRA: https://issues.apache.org/jira/browse/lucene ! Solr JIRAs: https://issues.apache.org/jira/browse/SOLR
Even more useful URLs
42
! Yonik Seeley presentations: http://people.apache.org/~yonik/presentations/ • See particularly the LuceneRevolution2011 presentation, re: pivot
faceting. ! Grant Ingersoll’s memory estimator prototype (trunk) http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/ ! Memory improvements: http://www.lucidimagination.com/blog/2012/04/06/memory-comparisons-between-solr-3x-and-trunk/ ! Zookeeper http://zookeeper.apache.org/
Thank You, Questions? Erick Erickson [email protected]