32
Which Freaking Database Should I Use? Andrew C. Oliver @acoliver {Great Wide Open | Atlant {Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Which Freaking Database Should I Use?

Embed Size (px)

DESCRIPTION

Great Wide Open 2014 - Day 1 Andrew Oliver - OS Integrators 10:15 AM - Operations 1 (Databases)

Citation preview

Page 1: Which Freaking Database Should I Use?

Which Freaking Database Should I

Use?Andrew C. Oliver

@acoliver

{Great Wide Open | Atlanta}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 2: Which Freaking Database Should I Use?

Andrew C. Oliver• Programming since I was about 8

• Java since ~1997

• Founded POI project (currently hosted at Apache) with Marc Johnson ~2000o Former member Jakarta PMCo Emeritus member of Apache Software Foundation

• Joined JBoss ~2002

• Former Board Member/current helper/lifetime member: Open Source Initiative (http://opensource.org)

• Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-olivero I make fanboys cry.

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 3: Which Freaking Database Should I Use?

Open Software Integrators• Founded Nov 2007 by Andrew C. Oliver (me)

o in Durham, NC

Revenue and staff has at least doubled every year since 2009.

• New office (2012) in Chicago, ILo we're hiring mid to senior level as well as UI

Developers (JQuery, Javascript, HTML, CSS)o up to 25% travel, salary + bonus, 401k, health, etc etco preferred: Java, Tomcat, JBoss, Hibernate, Spring,

RDBMS, JQueryo nice to have: Hadoop, Neo4j, CouchBase, Ruby, at least

one Cloud platform

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 4: Which Freaking Database Should I Use?

• Why not just use the RDBMS for everything?

• Operational vs Analytical

• Key Value

• Column Family

• Document

• Graph

• Hadoop?

• Convergence of "clustered filesystems" and "databases"

• Conclusions

Overview

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 5: Which Freaking Database Should I Use?

{2014 Great Wide Open | Atlanta}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Why Not "Just Use" RDBMS for

Everything?

Page 6: Which Freaking Database Should I Use?

Before we begin...

• Let's handle the Elephant or rather the teddy bears in the room:

http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html/{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 7: Which Freaking Database Should I Use?

The CAP theorem

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 8: Which Freaking Database Should I Use?

RDBMS CAP characteristics

• Great at consistency

• Okay at availability

• Not so great at partition tolerance...

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 9: Which Freaking Database Should I Use?

• Lots of servers with many connections to few servers.

Single process model

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 10: Which Freaking Database Should I Use?

Multiprocess Model

Data Manager Cluster Manager Data Manager Cluster Manager Data Manager Cluster ManagerData Manager Cluster Manager

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 11: Which Freaking Database Should I Use?

• 10mb disks were "big"

• Scalability meant more disks, controllers and possibly CPUs

• CPUs went from 4.77 Mhz to 3.4ghz

• Disks went from 64kps@70ms to 6gb/s

• Network speeds went from under 4mb to gigabit to bonded gigabit and beyond.

• Disk speeds for a long time didn't keep up with CPU...

Historical Scalability

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 12: Which Freaking Database Should I Use?

• RDBMS is based on "Relational Algebra" which is just an extension of basic "set theory"

• Not every problem is a set problem: "direct path" or "which thing contains this other thing which has this other thing" (foaf)

• Sometimes relationships are as important as the data

• Sometimes data is even simpler than the relational model but needs higher levels of availability, etc.

• One size never really did fit all

The Mathematical model

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 13: Which Freaking Database Should I Use?

Data Complexity

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 14: Which Freaking Database Should I Use?

Datarrhea

• Yes I've already registered that ;-)

• The cheapness of storing data has yielded more demando economics predicted this

• Moore's law ended while you slepto Intel says next year (but when did CPU speeds

last double?)

• Massive parallelization is the most feasible way to get at it (counter trended with an explosion in disk speeds)

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 15: Which Freaking Database Should I Use?

...but

• Ifo your data is tabular;o fits cleanly in a relational model;o you aren't having scalability issues;o you don't have a large dataset; oro a dataset/problem that lends itself to massive

parallelization...

• you can probably stick with your RDBMS for nowo ...and probably aren't at this conference

anyhow.{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 16: Which Freaking Database Should I Use?

JPA/RDBMS Tables Example

PersonID Firstname Lastname CompanyID

2 Andy Oliver 3

CompanyID Name City State

3 Open Software Integrators

Durham NC

PhoneNumber Type PersonID

919.627.1236 google 2

919.321.0119 work 2

Page 17: Which Freaking Database Should I Use?

Operational vs Analytical

• One DB type is unlikely to be well suited for all of your problems.

• The system doing "short and sweet" "lightweight" transactions is your operational system.

• The system doing long running reports and generating charts and graphs and statistics is your analytical system.

• There is also search. There are recommendation engines, etc.

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Which Freaking Database Should I Use?

Page 18: Which Freaking Database Should I Use?

{2014 Great Wide Open | Atlanta}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Other Types of Databases

Page 19: Which Freaking Database Should I Use?

• Examples: Couchbase 1.8, Cassandrao also: Gemfire, Infinispan (distributed caches)

• Constant Time O(1) - Lookup by key

• Good enough for "right now" stock quotes

• Usually combined with an index for search, but the structure isn't inherently indexed.

• Generally works well with Map Reduce.

• Extremely scalable, easy to partition

Key-Value Stores

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 20: Which Freaking Database Should I Use?

• Many Key-Value support "column families"

o Cassandra

• Some we designed this way

o HBase

• Keys and values become composite

• essentially a hashmap with a multi-dimensional array

o each column is a row of data

• map-reduce friendly

• Stock quote with time ranges

Column Family / Big Table

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 21: Which Freaking Database Should I Use?

HBase Example

Row key

First name

Last name

Company City StatePhone number

Phone type

5bfbd4a0-d02a-11e1-9b23-0800200c9a66

Andy OliverOpen Software Integrators

Durham NC919-627-1236

google

7b2435c0-d02a-11e1-9b23-0800200c9a66

Andy OliverOpen Software Integrators

Durham NC919-321-0119

work

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 22: Which Freaking Database Should I Use?

• Many developers think these are the "holy grail" since the fit nicely with object-oriented programming.

• Couchbase 2.0, CouchDB, MongoDB

• JSON documents

• One way to think of this is a Key-Value store that understands the values.

• Not as map-reduce friendly, larger datasets require indexes.

• clearly rest services, operational store

Document databases

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 23: Which Freaking Database Should I Use?

• JSON document:{

"firstname" : "Andy", "lastname" : "Oliver", "company" : "Open Software Integrators", "location" : { "city" : "Durham", "state" : "NC" }, "phone" : [ { "number" : "123 456 7890", "type" : "mobile" }, { "number" : "123 654 1234", "type" : "work" } ]}

Document databases

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 24: Which Freaking Database Should I Use?

• Based on Graph Theory

• Less about volume of the data and more about complexity

• Many are transactionalo often the transactions are "more correct" than

those offered by a relational database.

• FOAF, direct path operations are easyo very complicated/inefficient in RDBMS

• Usually paired with an index for search

Graph Databases

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 25: Which Freaking Database Should I Use?

Design: RDBMS vs Graph

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 26: Which Freaking Database Should I Use?

Phone Number: 919.627.1236Type : googlevoice

HAS

Phone Number: 919.321.0119Type : work

Company: Open Software Integrators

LOCATED

FOUNDED

Firstname: AndrewLastname: Oliver

City: DurhamState: NC

Neo4j Graph Example

WORKS FOR

LOCATEDCity: ChicagoState: IL

HAS

RESIDES

Note the extra relationships and details here - graph databases are just fun and easy to understand.

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

HAS

Page 27: Which Freaking Database Should I Use?

• NoSQL

• Software Framework (lots of pieces/lots of choices):

o Pig - scripting language used to quickly write MapReduce code to handle unstructured sources

o Hive - facilitates structure for the data

o HCatalog - provides inter-operability between these internal systems

o HBase - Bigtable-type database

o HDFS - Hadoop file system

• Excellent choice for data processing and data analysis

• MapReduce

Where does Hadoop fit?

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 28: Which Freaking Database Should I Use?

• Hadoop HDFS is...a distributed filesystem

• So is Gluster, Ceph, GFS, etc

• Hadoop can use Ceph or Gluster in place of HDFS

Convergence of Filesystems and Databases

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 29: Which Freaking Database Should I Use?

• Triplestoreso Apache Jenna

• OODBMS /ORDMSo Cache

Other Derivatives

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 30: Which Freaking Database Should I Use?

• Persistence

o Asynch / Synch

• Replication

• Availability

• Transactions / Consistency

• "Locality"

• Language

• Resources

o http://en.wikipedia.org/wiki/Comparison_of_structured_storage_software

o http://sevenweeks.org/

Things you may consider

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 31: Which Freaking Database Should I Use?

• RDBMS may not scale to your needs

• Your data may not map efficiently to tables

• Key Value Store - data by key, fast, scalable, can't handle complex data

• Column Family/Big Table - fast, scalable, denormalized, map reduce, good for series, not efficient for complex data

• Document - a good operational system, not your analytical, moderately scalable, matches OO

• Graph - great for complex data, transactional, less scalable

• Filesystems and "databases" are converging

Conclusions

Which Freaking Database Should I Use?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver@acoliver

Page 32: Which Freaking Database Should I Use?

{2014 Great Wide Open | Atlanta}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Thank you for attending!