Upload
couchbase
View
1.760
Download
1
Tags:
Embed Size (px)
Citation preview
Which Freaking Database Should I Use?
Andrew C. Oliver Open Software Integrators
www.osintegrators.com @osintegrators
Andrew C. Oliver
10
• Programming since I was about 8 • Java since ~1997 • Founded POI project (currently hosted at Apache) with
Marc Johnson ~2000 o Former member Jakarta PMC o Emeritus member of Apache Software Foundation
• Joined JBoss ~2002 • Former Board Member/current helper/lifetime member:
Open Source Initiative (http://opensource.org) • Column in InfoWorld:
http://www.infoworld.com/author-bios/andrew-oliver o I make fanboys cry. Open Software Integrators
Open Software Integrators
• Founded Nov 2007 by Andrew C. Oliver (me) o in Durham, NC
Revenue and staff has at least doubled every year since 2009.
• New office (2012) in Chicago, IL o we're hiring mid to senior level as well as UI Developers
(JQuery, Javascript, HTML, CSS) o up to 25% travel, salary + bonus, 401k, health, etc etc o preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS,
JQuery o nice to have: Hadoop, Neo4j, CouchBase, Ruby, at least one
Cloud platform
6 Open Software Integrators
• Why not just use the RDBMS for everything?
• Operational vs Analytical
• Key Value
• Column Family
• Document
• Graph
• Hadoop?
• Convergence of "clustered filesystems" and "databases"
• Conclusions 12
Overview
Open Software Integrators
Before we begin...
11
• Let's handle the Elephant or rather the teddy bears in the room:
http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html/
Open Software Integrators
• Great at consistency • Okay at availability • Not so great at partition tolerance...
RDBMS CAP characteristics
15 Open Software Integrators
• Lots of servers with many connections to few servers.
Single process model
15 Open Software Integrators
• Lots of servers with many connections to as many servers as we need.
Many process model
15 Open Software Integrators
• 10mb disks were "big" • Scalability meant more disks, controllers and
possilby CPUs • CPUs went from 4.77 Mhz to 3.4ghz • Disks went from 64kps@70ms to 6gb/s • Network speeds went from under 4mb to gigabit
to bonded gigabit and beyond. • Disk speeds for a long time didn't keep up with
CPU...
Historical Scalability
15 Open Software Integrators
• RDBMS is based on "Relational Algebra" which is just an extension of basic "set theory"
• Not every problem is a set problem: "direct path" or "which thing contains this other thing which has this other thing" (foaf)
• Sometimes relationships are as important as the data • Sometimes data is even simpler than the relational
model but needs higher levels of availability, etc. • One size never really did fit all
The Mathematical model
15 Open Software Integrators
Datarrhea
15 Open Software Integrators
• Yes I've already registered that ;-) • The cheapness of storing data has yielded more
demand o economics predicted this
• Moore's law ended while you slept o Intel says next year (but when did CPU speeds last
double?) • Massive parallelization is the most feasible way to get at
it (counter trended with an explosion in disk speeds)
...but
15 Open Software Integrators
• If o your data is tabular; o fits cleanly in a relational model; o you aren't having scalability issues; o you don't have a large dataset; or o a dataset/problem that lends itself to massive
parallelization... • you can probably stick with your RDBMS for now
o ...and probably aren't at this conference anyhow.
JPA/RDBMS Tables Example
PersonID
Firstname
Lastname
CompanyID
2
Andy Oliver 3
CompanyID
Name
City
State
3
Open Software Integrators
Durham NC
PhoneNumber
Type
PersonID
919.627.1236 google 2
919.321.0119 work 2
Operational vs Analytical
15 Open Software Integrators
• One DB type is unlikely to be well suited for all of your problems.
• The system doing "short and sweet" "lightweight" transactions is your operational system.
• The system doing long running reports and generating charts and graphs and statistics is your analytical system.
• There is also search. There are recommendation engines, etc.
• Examples: Couchbase 1.8, Cassandra o also: Gemfire, Infinispan (distributed caches)
• Constant Time O(1) - Lookup by key • Good enough for "right now" stock quotes • Usually combined with an index for search, but the
structure isn't inherently indexed. • Generally works well with Map Reduce. • Extremely scalable, easy to partition
Key-Value Stores
17 Open Software Integrators
• Many Key-Value support "column families" o Cassandra
• Some we designed this way o HBase
• Keys and values become composite • essentially a hashmap with a multi-dimensional array
o each column is a row of data
• map-reduce friendly • Stock quote with time ranges
Column Family / Big Table
19 Open Software Integrators
HBase Example
23
Row key
First name
Last name Company City State Phone
number Phone type
5bfbd4a0-d02a-11e1-9b23-0800200c9a66
Andy Oliver Open Software Integrators
Durham NC 919-627-1236 google
7b2435c0-d02a-11e1-9b23-0800200c9a66
Andy Oliver Open Software Integrators
Durham NC 919-321-0119 work
Open Software Integrators
• Many developers think these are the "holy grail" since the fit nicely with object-oriented programming.
• Couchbase 2.0, CouchDB, MongoDB • JSON documents • One way to think of this is a Key-Value store that
understands the values. • Not as map-reduce friendly, larger datasets require
indexes. • clearly rest services, operational store
Document databases
19 Open Software Integrators
• JSON document: {
"firstname" : "Andy", "lastname" : "Oliver", "company" : "Open Software Integrators", "location" : { "city" : "Durham", "state" : "NC" }, "phone" : [ { "number" : "123 456 7890", "type" : "mobile" }, { "number" : "123 654 1234", "type" : "work" } ] }
Document databases
19 Open Software Integrators
• Based on Graph Theory • Less about volume of the data and more about
complexity • Many are transactional
o often the transactions are "more correct" than those offered by a relational database.
• FOAF, direct path operations are easy o very complicated/inefficient in RDBMS
• Usually paired with an index for search
Graph Databases
19 Open Software Integrators
Phone Number: 919.627.1236 Type : googlevoice
HAS
Phone Number: 919.321.0119 Type : work
Company: Open Software Integrators
LOCATED
FOUNDED
HAS
Firstname: Andrew Lastname: Oliver
City: Durham State: NC
Neo4j Graph Example
21
WORKS FOR
LOCATED City: Chicago State: IL
HAS
RESIDES
Open Software Integrators
Note the extra relationships and details here - graph databases are just fun and easy to understand.
• NoSQL • Software Framework (lots of pieces/lots of choices):
o Pig - scripting language used to quickly write MapReduce code to handle unstructured sources
o Hive - facilitates structure for the data o HCatalog - provides inter-operability between these
internal systems o HBase - Bigtable-type database o HDFS - Hadoop file system
• Excellent choice for data processing and data analysis • MapReduce
Where does Hadoop fit?
22 Open Software Integrators
• Hadoop HDFS is...a distributed filesystem • So is Gluster, Ceph, GFS, etc • Hadoop can use Ceph or Gluster in place of HDFS
Convergence of Filesystems and Databases
22 Open Software Integrators
• Triplestores o Apache Jenna
• OODBMS /ORDMS o Cache
Other Derivatives
22 Open Software Integrators
• Persistence o Asynch / Synch
• Replication • Availability • Transactions / Consistancy • "Locality" • Language • Resources
o http://en.wikipedia.org/wiki/Comparison_of_structured_storage_software
o http://sevenweeks.org/
Things you may consider
22 Open Software Integrators
• RDBMS may not scale to your needs • Your data may not map efficiently to tables • Key Value Store - data by key, fast, scalable, can't
handle complex data • Column Family/Big Table - fast, scalable, denormalized,
map reduce, good for series, not efficient for complex data
• Document - a good operational system, not your analytical, moderately scalable, matches OO
• Graph - great for complex data, transactional, less scalable
• Filesystems and "databases" are converging
Conclusions
53 Open Software Integrators