Upload
trongho1
View
219
Download
0
Embed Size (px)
Citation preview
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
1/34
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
2/34
OpenLogic, Inc.
Agenda
Introduction
The Problem
The Solution
Lessons Learned
Public Clouds &Big Data
Final Thoughts
Q & A
2
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
3/34
OpenLogic, Inc.
Introduction
Rod CopeCTO & Founder of OpenLogic
25 years of software development experience
IBM Global Services, Anthem, General Electric
OpenLogicOpen Source Provisioning, Support, and Governance Solutions
Certified library w/SLA support on 500+ Open Source packageshttp://olex.openlogic.com
Over 200 Enterprise customers
3
http://olex.openlogic.com/http://olex.openlogic.com/http://olex.openlogic.com/8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
4/34
OpenLogic, Inc.
The Problem
Big DataAll the worlds Open SourceSoftware
Metadata, code, indexes
Individual tables contain manyterabytes
Relational databases arentscale-free
Growing every day
Need real-time random access to all data
Long-running and complex analysis jobs
4
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
5/34
OpenLogic, Inc.
The Solution
Hadoop, HBase, and SolrHadoop distributed file system
HBase NoSQL data store column-oriented
Solr search server based on Lucene
All are scalable, flexible, fast, well-supported,used in production environments
And a supporting cast of thousandsStargate, MySQL, Rails, Redis, Resque,
Nginx, Unicorn, HAProxy, Memcached,Ruby, JRuby, CentOS,
5
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
6/34
OpenLogic, Inc.6
Solution Architecture
Live
replication
Livereplication
Livereplication(3x)
Livereplication
Data LANInternet Application LANCaching and load balancing not shown
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
7/34OpenLogic, Inc.
Getting Data out of HBase
HBase!NoSQLThink hash table, not relational database
Scanning vs. querying
How do find my data if primary key wont cut it?
Solr to the rescueVery fast, highly scalable search server with built-in sharding
and replication based on Lucene
Dynamic schema, powerful query language, faceted search,
accessible via simple REST-like web API w/XML, JSON, Ruby,and other data formats
7
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
8/34OpenLogic, Inc.
Solr
Sharding
Query any server it executes the same query against all otherservers in the group
Returns aggregated result to original caller
Async replication (slaves poll their masters)
Can use repeaters if replicating across data centersOpenLogic
Solr farm, sharded, cross-replicated, fronted with HAProxyLoad balanced writes across masters, reads across masters and slaves
Be careful not to over-commit
Billions of lines of code in HBase, all indexed in Solr for real-timesearch in multiple ways
Over 20 Solr fields indexed per source file
8
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
9/34OpenLogic, Inc.
Solr Implementation Sharding + Replication
9
Masters
Slaves
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
10/34OpenLogic, Inc.
Write Flow
10
Masters
Slaves
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
11/34OpenLogic, Inc.
Read Flow
11
Masters
Slaves
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
12/34OpenLogic, Inc.
Delete Flow
12
Masters
Slaves
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
13/34OpenLogic, Inc.
Write Flow - Failover
13
Masters
Slaves
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
14/34
OpenLogic, Inc.
Read Flow - Failover
14
Masters
Slaves
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
15/34
OpenLogic, Inc.
Loading Big Data
Experiment with different Solr merge factors
During huge loads, it can help to use a higher factor for loadperformance
Minimize index manipulation gymnastics
Start with something like 25
When youre done with the massive initial load/import, turn itback down for search performanceMinimize number of queries
Start with something like 5
Example:
curl http://solr1:8080/solr/master/update?optimize=true&maxSegments=5
This can take a few minutes, so you might need to adjust various timeouts
Note that a small merge factor will hurt indexing performance if youneed to do massive loads on a frequent basis or continuous indexing
15
http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=58/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
16/34
OpenLogic, Inc.
Loading Big Data (cont.)
Test your write-focused load balancingLook for large skews in Solr index size
Note: you may have to commit, optimize, write again, andcommit before you can really tell
Make sure your replication slaves are keeping upUsing identical hardware helpsIf index directories dont look the same, something is wrong
16
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
17/34
OpenLogic, Inc.
Loading Big Data (cont.)
Dont commit to Solr too frequentlyIts easy to auto-commit or commit after every record
Doing this 100s of times per second will take Solr down,especially if you have serious warm up queries configured
Avoid putting large values in HBase (> 5MB)Works, but may cause instability and/or performance issuesRows and columns are cheap, so use more of them instead
17
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
18/34
OpenLogic, Inc.
Loading Big Data (cont.)
Dont use a single machine to load the cluster
You might not live long enough to see it finish
At OpenLogic, we spread raw source data acrossmany machines and hard drives via NFS
Be very careful with NFS configuration can hang machines
Load data into HBase via Hadoop map/reduce jobsTurn off WAL for much better performance
put.setWriteToWAL(false)
Index in Solr as you goGood way to test your load balancing
write schemes and replication set up
This will find your weak spots!
18
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
19/34
OpenLogic, Inc.
Scripting Languages Can Help
Writing data loading jobs can be tedious
Scripting is faster and easier than writing Java
Great for system administration tasks, testing
Standard HBase shell is based on JRuby
Very easy Map/Reduce jobs with J/Ruby andWukong
Used heavily at OpenLogic
Productivity of RubyPower of Java Virtual Machine
Ruby on Rails, Hadoop integration, GUI clients
19
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
20/34
OpenLogic, Inc.
Java (27 lines)
20
public class Filter { public static void main( String[] args ) {
List list = new ArrayList(); list.add( "Rod" );list.add( "Neeta" );
list.add( "Eric" );list.add( "Missy" );
Filter filter = new Filter();
List shorts = filter.filterLongerThan( list, 4 ); System.out.println( shorts.size() );
Iterator iter = shorts.iterator(); while ( iter.hasNext() ) { System.out.println( iter.next() ); }}
public List filterLongerThan( List list, int length ) { List result = new ArrayList(); Iterator iter = list.iterator(); while ( iter.hasNext() ) { String item = (String) iter.next(); if ( item.length()
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
21/34
OpenLogic, Inc.
Scripting languages (4 lines)
Groovy
list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.find_all { |name| name.size 2
-> Rod Eric
list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.findAll { name -> name.size() println name }
-> 2 -> Rod Eric
JRuby
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
22/34
OpenLogic, Inc.
Cutting Edge
HadoopSPOF around Namenode, append functionality
HBaseBackup, replication, and indexing solutions
in fluxSolr
Several competing solutions around cloud-likescalability and fault-tolerance, includingZooKeeper and Hadoop integration
No clear winner, none quite ready for production
22
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
23/34
OpenLogic, Inc.
Configuration is Key
Many moving parts
Its easy to let typos slip throughConsider automated configuration
via Chef, Puppet, or similar
Pay attention to the details
Operating system max open files,sockets, and other limits
Hadoop and HBase configurationhttp://wiki.apache.org/hadoop/Hbase/Troubleshooting
Solr merge factor and normsDont starve HBase or Solr for memory
Swapping will cripple your system
23
http://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshooting8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
24/34
OpenLogic, Inc.
Commodity Hardware
Commodity hardware != 3 year old desktop
Dual quad-core, 32GB RAM, 4+ disks
Dont bother with RAID on Hadoop data disksBe wary of non-enterprise drives
Expect ugly hardware issues at some point
24
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
25/34
OpenLogic, Inc.
OpenLogics Hadoop Implementation
Environment100+ CPU cores
100+ Terabytes of disk
Machines dont have identity
Add capacity by plugging innew machines
Why not Amazon EC2?Great for computational bursts
Expensive for long-term storage of Big DataNot yet consistent enough for mission-critical usage of HBase
25
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
26/34
OpenLogic, Inc.
OpenLogics Hadoop and Solr Deployment
Dual quad-core and dual hex-core
Dell boxes32-64GB RAM
ECC (highly recommended by Google)
6 x 2TB enterprise hard drives
RAID 1 on two of the drivesOS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc.
Key source data backups
Hadoop datanode gets remaining drivesRedundant enterprise switches
Dual- and quad-gigabit NICs
26
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
27/34
OpenLogic, Inc.
Public Clouds and Big Data
Amazon EC2EBS Storage
100TB * $0.10/GB/month = $120k/year
Double Extra Large instances13 EC2 compute units, 34.2GB RAM
20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year3 year reserved instances
20 * 4k = $80k up front to reserve
(20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/yearto operate
Totals for 20 virtual machines1styear cost: $120k + $80k + $86k = $286k2nd& 3rdyear costs: $120k + $86k = $206k
Average: ($286k + $206k + $206k) / 3 = $232k/year
27
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
28/34
OpenLogic, Inc.
Private Clouds and Big Data
Buy your own20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k
Over 33 EC2 compute units each
Total: $53k/year (amortized over 3 years)
28
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
29/34
OpenLogic, Inc.
Public Clouds are Expensive for Big Data
Amazon EC2
20 instances * 13 EC2 compute units =260 EC2 compute units
Cost: $232k/year
Buy your own20 machines * 33 EC2 compute units =
660 EC2 compute units
Cost: $53k/year
Does not include hosting & maintenance costs
Dont think system administration goes awayYou still own all the instances monitoring, debugging, support
29
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
30/34
OpenLogic, Inc.
Expect Things to Fail A Lot
Hardware
Power supplies, hard drivesOperating System
Kernel panics, zombie processes,dropped packets
Software ServersHadoop datanodes, HBase regionservers,
Stargate servers, Solr servers
Your Code and DataStray Map/Reduce jobs, strange corner
cases in your data leading to programfailures
30
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
31/34
OpenLogic, Inc.
Not Possible Without Open Source
31
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
32/34
OpenLogic, Inc.
Not Possible Without Open Source
Hadoop, HBase, Solr
Apache, Tomcat, ZooKeeper,HAProxy
Stargate, JRuby, Lucene,
Jetty, HSQLDB, GeronimoApache Commons, JUnit
CentOS
Dozens more
Too expensive to build or buy everything
32
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
33/34
OpenLogic, Inc.
Final Thoughts
You can host your own big data
Tools are available today that didnt exist a few years agoFast to prototype production
readiness takes time
Expect to invest in training and support
HBase and Solr are fast100+ random queries/sec per instance
Give them memory and stand back
HBase scales, Solr scales (to a point)Dont worry about outgrowing a few machines
Do worry about outgrowing a rack of Solr instancesLook for ways to partition your data other than automatic sharding
33
8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop
34/34
Q & A
Any questions for [email protected]
Slides: http://www.openlogic.com/presentations/rod* Unless otherwise credited, all images in this presentation are either open source project logos or were licensed from BigStockPhoto.com
http://www.openlogic.com/presentations/rodmailto:[email protected]://www.openlogic.com/presentations/rodhttp://www.openlogic.com/presentations/rodmailto:[email protected]:[email protected]