Upload
reegan
View
30
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Real-Time Searching of Big Data with Solr and Hadoop. Rod Cope, CTO & Founder OpenLogic, Inc. Agenda. Introduction The Problem The Solution Details Final Thoughts Q & A. Introduction. Rod Cope CTO & Founder of OpenLogic 25 years of software development experience - PowerPoint PPT Presentation
Citation preview
Real-Time Searching of Big Data with Solr and HadoopRod Cope, CTO & FounderOpenLogic, Inc.
OpenLogic, Inc.
Agenda
IntroductionThe ProblemThe SolutionDetailsFinal ThoughtsQ & A
2
OpenLogic, Inc.
Introduction
Rod CopeCTO & Founder of OpenLogic25 years of software development experienceIBM Global Services, Anthem, General Electric
OpenLogicOpen Source Support, Governance, and Scanning SolutionsCertified library w/SLA support on 500+ Open Source packagesOver 200 Enterprise customers
3
OpenLogic, Inc.
The Problem
“Big Data”All the world’s Open Source SoftwareMetadata, code, indexesIndividual tables contain manyterabytesRelational databases aren’t scale-free
Growing every dayNeed real-time random access to all dataLong-running and complex analysis jobs
4
OpenLogic, Inc.
The Solution
Hadoop, HBase, and SolrHadoop – distributed file systemHBase – “NoSQL” data store – column-orientedSolr – search server based on LuceneAll are scalable, flexible, fast, well-supported,used in production environments
And a supporting cast of thousands…Stargate, MySQL, Rails, Redis, Resque, Nginx, Unicorn, HAProxy, Memcached,Ruby, JRuby, CentOS, …
5
OpenLogic, Inc. 6
Solution Architecture
RailsRailsRuby
on Rails RailsRailsResque Workers
RailsRailsSolr
Web Browser Rails
MySQL
RailsRailsHBase
Nginx & Unicorn
Livereplication
Livereplication
Livereplication(3x)
RailsRailsStargate
Redis
Livereplication
Data LANInternet Application LAN
RailsRailsMaven Repo
Scanner Client
Maven Client
Caching and load balancing not shown
OpenLogic, Inc.
Hadoop/HBase Implementation
Private Cloud100+ CPU cores100+ Terabytes of diskMachines don’t have identityAdd capacity by plugging innew machines
Why not EC2?Great for computational burstsExpensive for long-term storage of Big DataNot yet consistent enough for mission-critical usage of HBase
7
OpenLogic, Inc.
Public Clouds and Big Data
Amazon EC2EBS Storage
100TB * $0.10/GB/month = $120k/yearDouble Extra Large instances
13 EC2 compute units, 34.2GB RAM20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year3 year reserved instances
20 * 4k = $80k up front to reserve(20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/year to operate
Totals for 20 virtual machines1st year cost: $120k + $80k + $86k = $286k2nd & 3rd year costs: $120k + $86k = $206kAverage: ($286k + $206k + $206k) / 3 = $232k/year
8
OpenLogic, Inc.
Private Clouds and Big Data
Buy your own20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k
Over 33 EC2 compute units each
Total: $53k/year (amortized over 3 years)
9
OpenLogic, Inc.
Public Clouds are Expensive for Big DataAmazon EC2
20 instances * 13 EC2 compute units = 260 EC2 compute unitsCost: $232k/year
Buy your own20 machines * 33 EC2 compute units = 660 EC2 compute unitsCost: $53k/yearDoes not include hosting & maintenance costs
Don’t think system administration goes awayYou still “own” all the instances – monitoring, debugging, support
10
OpenLogic, Inc.
Getting Data out of HBase
HBase NoSQLThink hash table, not relational databaseScanning vs. querying
How do find my data if primary key won’t cut it?Solr to the rescue
Very fast, highly scalable search server with built-in sharding and replication – based on LuceneDynamic schema, powerful query language, faceted search, accessible via simple REST-like web API w/XML, JSON, Ruby, and other data formats
11
OpenLogic, Inc.
SolrSharding
Query any server – it executes the same query against all other servers in the groupReturns aggregated result to original caller
Async replication (slaves poll their masters)Can use repeaters if replicating across data centers
OpenLogicSolr farm, sharded, cross-replicated, fronted with HAProxy
Load balanced writes across masters, reads across masters and slaves
Billions of lines of code in HBase, all indexed in Solr for real-time search in multiple waysOver 20 Solr fields indexed per source file
12
OpenLogic, Inc.
Machine 2 Machine 3 Machine 26Machine 1
Solr Implementation – Sharding + Replication
13
Solr Core A
Solr Core Z’
Solr Core B
Solr Core A’
Solr Core C
Solr Core B’
Solr Core Z
Solr Core Y’
Masters
Slaves
…
HAProxyHAProxy
OpenLogic, Inc.
Machine 2 Machine 3 Machine 26Machine 1
Solr Implementation – Sharding + Replication
14
Solr Core A
Solr Core Z’
Solr Core B
Solr Core A’
Solr Core C
Solr Core B’
Solr Core Z
Solr Core Y’
Masters
Slaves
…
HAProxyHAProxy
OpenLogic, Inc.
Machine 2 Machine 3 Machine 26Machine 1
Write Example
15
Solr Core A
Solr Core Z’
Solr Core B
Solr Core A’
Solr Core C
Solr Core B’
Solr Core Z
Solr Core Y’
Masters
Slaves
…
HAProxyHAProxy
OpenLogic, Inc.
Machine 2 Machine 3 Machine 26Machine 1
Read Example
16
Solr Core A
Solr Core Z’
Solr Core B
Solr Core A’
Solr Core C
Solr Core B’
Solr Core Z
Solr Core Y’
Masters
Slaves
…
HAProxyHAProxy
OpenLogic, Inc.
Machine 2 Machine 3 Machine 26Machine 1
Delete Example
17
Solr Core A
Solr Core Z’
Solr Core B
Solr Core A’
Solr Core C
Solr Core B’
Solr Core Z
Solr Core Y’
Masters
Slaves
…
HAProxyHAProxy
OpenLogic, Inc.
Machine 2 Machine 3 Machine 26Machine 1
Write Example - Failover
18
Solr Core A
Solr Core Z’
Solr Core B
Solr Core A’
Solr Core C
Solr Core B’
Solr Core Z
Solr Core Y’
Masters
Slaves
…
HAProxyHAProxy
OpenLogic, Inc.
Machine 2 Machine 3 Machine 26Machine 1
Read Example - Failover
19
Solr Core A
Solr Core Z’
Solr Core B
Solr Core A’
Solr Core C
Solr Core B’
Solr Core Z
Solr Core Y’
Masters
Slaves
…
HAProxyHAProxy
OpenLogic, Inc.
Configuration is Key
Many moving partsIt’s easy to let typos slip throughConsider automated configurationvia Chef, Puppet, or similar
Pay attention to the detailsOperating system – max open files, sockets, and other limitsHadoop and HBase configuration
http://wiki.apache.org/hadoop/Hbase/Troubleshooting
Solr merge factor and norms
Don’t starve HBase or Solr for memorySwapping will cripple your system
20
OpenLogic, Inc.
Commodity Hardware
“Commodity hardware” != 3 year old desktopDual quad-core, 32GB RAM, 4+ disksDon’t bother with RAID on Hadoop data disks
Be wary of non-enterprise drives
Expect ugly hardware issues at some point
21
OpenLogic, Inc.
OpenLogic’s Hadoop and Solr Deployment
Dual quad-core and dual hex-core Dell boxes32-64GB RAM
ECC (highly recommended by Google)
6 x 2TB enterprise hard drivesRAID 1 on two of the drives
OS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc.Key “source” data backups
Hadoop datanode gets remaining drivesRedundant enterprise switchesDual- and quad-gigabit NIC’s
22
OpenLogic, Inc.
Expect Things to Fail – A Lot
HardwarePower supplies, hard drives
Operating SystemKernel panics, zombie processes, dropped packets
Software ServersHadoop datanodes, HBase regionservers, Stargate servers, Solr servers
Your Code and DataStray Map/Reduce jobs, strange corner cases in your data leading to program failures
23
OpenLogic, Inc.
Cutting Edge
HadoopSPOF around Namenode, append functionality
HBaseBackup, replication, and indexing solutions in flux
Solr Several competing solutions around cloud-like scalability and fault-tolerance, including ZooKeeper and Hadoop integrationNo clear winner, none quite ready for production
24
OpenLogic, Inc.
Loading Big Data
Experiment with different Solr merge factorsDuring huge loads, it can help to use a higher factor for load performance
Minimize index manipulation gymnasticsStart with something like 25
When you’re done with the massive initial load/import, turn it back down for search performance
Minimize number of queriesStart with something like 5Example:
curl http://solr1:8080/solr/master/update?optimize=true&maxSegments=5This can take a few minutes, so you might need to adjust various timeouts
Note that a small merge factor will hurt indexing performance if you need to do massive loads on a frequent basis or continuous indexing
25
OpenLogic, Inc.
Loading Big Data (cont.)
Test your write-focused load balancingLook for large skews in Solr index sizeNote: you may have to commit, optimize, write again, and commit before you can really tell
Make sure your replication slaves are keeping upUsing identical hardware helpsIf index directories don’t look the same, something is wrong
26
OpenLogic, Inc.
Loading Big Data (cont.)
Don’t commit to Solr too frequentlyIt’s easy to auto-commit or commit after every recordDoing this 100’s of times per second will take Solr down, especially if you have serious warm up queries configured
Avoid putting large values in HBase (> 5MB)Works, but may cause instability and/or performance issuesRows and columns are cheap, so use more of them instead
27
OpenLogic, Inc.
Loading Big Data (cont.)
Don’t use a single machine to load the clusterYou might not live long enough to see it finish
At OpenLogic, we spread raw source data across many machines and hard drives via NFS
Be very careful with NFS configuration – can hang machines
Load data into HBase via Hadoop map/reduce jobsTurn off WAL for much better performance
put.setWriteToWAL(false)
Index in Solr as you goGood way to test your load balancingwrite schemes and replication set up
This will find your weak spots!
28
OpenLogic, Inc.
Scripting Languages Can Help
Writing data loading jobs can be tediousScripting is faster and easier than writing JavaGreat for system administration tasks, testingStandard HBase shell is based on JRubyVery easy Map/Reduce jobs with J/Ruby and WukongUsed heavily at OpenLogic
Productivity of RubyPower of Java Virtual MachineRuby on Rails, Hadoop integration, GUI clients
29
OpenLogic, Inc.
Java (27 lines)
30
public class Filter { public static void main( String[] args ) {
List list = new ArrayList(); list.add( "Rod" ); list.add( "Neeta" ); list.add( "Eric" ); list.add( "Missy" );
Filter filter = new Filter(); List shorts = filter.filterLongerThan( list, 4 ); System.out.println( shorts.size() );
Iterator iter = shorts.iterator(); while ( iter.hasNext() ) { System.out.println( iter.next() ); }}
public List filterLongerThan( List list, int length ) { List result = new ArrayList(); Iterator iter = list.iterator(); while ( iter.hasNext() ) { String item = (String) iter.next(); if ( item.length() <= length ) { result.add( item ); } } return result;}
}
OpenLogic, Inc.
Scripting languages (4 lines)
31
Groovy
list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.find_all { |name| name.size <= 4 }puts shorts.sizeshorts.each { |name| puts name }
-> 2 -> Rod Eric
JRuby
list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.findAll { name -> name.size() <= 4 }println shorts.sizeshorts.each { name -> println name }
-> 2 -> Rod Eric
OpenLogic, Inc.
Not Possible Without Open Source
32
OpenLogic, Inc.
Not Possible Without Open Source
Hadoop, HBase, SolrApache, Tomcat, ZooKeeper, HAProxyStargate, JRuby, Lucene, Jetty, HSQLDB, GeronimoApache Commons, JUnitCentOSDozens more
Too expensive to build or buy everything
33
OpenLogic, Inc.
Final ThoughtsYou can host big data in your own private cloud
Tools are available today that didn’t exist a few years agoFast to prototype – production readiness takes timeExpect to invest in training and support
HBase and Solr are fast100+ random queries/sec per instanceGive them memory and stand back
HBase scales, Solr scales (to a point)Don’t worry about outgrowing a few machinesDo worry about outgrowing a rack of Solr instances
Look for ways to partition your data other than “automatic” sharding
34
OpenLogic, Inc.
Q & A
Any questions for [email protected]
* Unless otherwise credited, all images in this presentation are either open source project logos or were licensed from BigStockPhoto.com