This presentation was given at Silicon Valley Code Camp in October 2014 by Jim Driscoll, Sr. Sales Engineer at MarkLogic. Jim has been programming in Java since 1996, and has worked on such projects as the first version of servlets, and the first release of J2EE. About this presentation: Before they were called cars, they were "horseless carriages". That's what NoSQL is: something better than the past but still defined in terms of the past. For decades we had one way of storing data, the relational model. Now databases come with new data models, new index schemes, and new scaling architectures. They're also much more cloud-focused than previous systems. It's an exciting time again to be a database administrator, or a programmer, and this presentation gets you oriented. The presentation covers how we got here, where we're at, and where we're going. The discussion includes data models (key-value, document, graph), licensing (open source, commercial), and transaction models (ACID vs BASE), including examples of working code.
Citation preview
1. COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
An Introduction to NoSQL Jim Driscoll, MarkLogic
2. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 2 ALL RIGHTS
RESERVED. Agenda History of NoSQL NoSQL Terminology Taxonomy
3. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 3 ALL RIGHTS
RESERVED. HISTORY OF NOSQL
4. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 4 ALL RIGHTS
RESERVED. A Short History of Data Application Specific Databases
Size is paramount Relational Databases Size matters but break from
the application silo and provide data integrity NoSQL Databases
Agility Scalability Speed
5. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 5 ALL RIGHTS
RESERVED. RELATIONAL DOESNT MEAN WHAT YOU THINK IT DOES
6. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 6 ALL RIGHTS
RESERVED. What's wrong with Relational? Nothing, it's perfect for
square data ...where you know the relationships in advance ...where
the schema doesn't change often ...where all the data can fit on
one machine ...where a separate disk seek for every join isn't an
issue
7. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 7 ALL RIGHTS
RESERVED. SEEK AND YOU WILL FIND IN ABOUT 10MS
8. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 8 ALL RIGHTS
RESERVED. The Rise of NoSQL 1998 - Carlo Strozzi coins term 2001 -
MarkLogic founded 2003 - LiveJournal's Memcached 2003 - Google
FileSystem paper 2004 - Google MapReduce paper 2006 - Google
BigTable paper 2006 - Hadoop released to Apache 2007 - Amazon
Dynamo paper 2009 - Eric Evans popularizes term 2010 - MongoDB
(first production release)
9. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 9 ALL RIGHTS
RESERVED. MEMCACHED
10. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 10 ALL RIGHTS
RESERVED. Memcached Developed at LiveJournal as a frontend cache
for websites First released in 2003 Keep disk access at a minimum,
pool memory on many machines So useful it found wide popularity,
still under active development
11. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 11 ALL RIGHTS
RESERVED. Memcached High Performance, Distributed Memory Object
Caching System Distributed runs across many computers Memory runs
without touching disk Object cache designed to hold small lumps of
data High performance because it never touches disk, and the
objects are small, its optimized for speed
12. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 12 ALL RIGHTS
RESERVED. Memcached Client server system Servers are unaware of
each other Clients determine server to use via hashing Servers keep
content as an LRU cache So all data transitory
13. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 13 ALL RIGHTS
RESERVED. SHARDING
14. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 14 ALL RIGHTS
RESERVED. Sharding to Scale Out
15. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 15 ALL RIGHTS
RESERVED. BIGTABLE
16. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 16 ALL RIGHTS
RESERVED. Bigtable Created by Google in 2004 to store massive
amounts of data Made public in famous 2006 paper Used throughout
Google GMail, Google Maps, YouTube, Web Indexing, etc Reportedly
over 100 internal projects Never shipped externally as a product
but available for public use as part the AppEngine hosting API
17. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 17 ALL RIGHTS
RESERVED. Bigtable Rows are composed of columns, which in turn
belong to column families Column families are essentially typing,
validation and expiration info Every cell is versioned via
timestamp System is robust and crash resistant Can survive the
crash of any machine, including the master Scale out
architecture
18. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 18 ALL RIGHTS
RESERVED. MAP / REDUCE
19. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 19 ALL RIGHTS
RESERVED. Map / Reduce Massively Distributed Processes Map - sort,
filter, transform data Reduce - summarize data
20. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 21 ALL RIGHTS
RESERVED. HADOOP
21. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 22 ALL RIGHTS
RESERVED. Hadoop First envisioned as Nutch at the Internet Archive
in 2002 There were 100s of millions of webpages to index Early
versions heavily influenced by Google File System, Map Reduce
papers Goal: Perform work on large datasets using commodity
machines Development moved to Yahoo in 2006 Open Sourced to Apache,
as Hadoop A File system (HDFS) A Task Runner (MapReduce) A Task
Manager (YARN) Note: Not a database
22. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 23 ALL RIGHTS
RESERVED. Hadoop Really good at Batch Processing on incredibly
large data sets Not so good parts Latency Updates Usability
23. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 24 ALL RIGHTS
RESERVED. DYNAMO
24. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 25 ALL RIGHTS
RESERVED. Amazon Dynamo Created to power Amazons Web store Writing
with low latency more important than consistency Techniques first
made public in 2007 paper Never externally shipped but huge
influence on market Used for a variety of critical portions of
Amazons site Shopping cart User Session Succeeded by DynamoDB
Similar name, but whole new architecture (with better
consistency)
25. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 26 ALL RIGHTS
RESERVED. Amazon Dynamo Distributed Key Value store "always
writable low latency reads and writes, at the expense of
consistency asynchronous replication on put() operations mean that
get() may return a stale value updates during a network partition
can result in conflicts and the application must handle them
26. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 27 ALL RIGHTS
RESERVED. TERMINOLOGY
27. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 28 ALL RIGHTS
RESERVED. What is(n't) NoSQL No SQL Schema-less Open Source BASE
(Eventually Consistent)
28. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 29 ALL RIGHTS
RESERVED. ACID Atomicity Everything either succeeds or fails
Consistency Nothing is saved unless it passes consistency rules
Isolation No two processes can interfere with each other Durability
Once saved, data can not be lost due to system failure
29. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 30 ALL RIGHTS
RESERVED. BASE Basically Available Soft state Eventually
consistent
30. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 34 ALL RIGHTS
RESERVED. What happens without consistency? Absolute fastest
performance at lowest hardware cost Highest global data
availability at lowest hardware cost Working with one document or
row at a time Writing advanced code to create your own consistency
model Eventually consistent data Some inconsistent data that cant
be reconciled Some missing data that cant be recovered Some
inconsistent query results
31. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 35 ALL RIGHTS
RESERVED. What is NoSQL? Database Non-relational Schema on read
Scale out architecture Cluster friendly / Cloud Ready
32. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 36 ALL RIGHTS
RESERVED. SCALE UP VS SCALE OUT
33. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 37 ALL RIGHTS
RESERVED. CAP Theory Consistent Always get correct answers, when
you get answers Available Always get answers Partition Tolerant
System can be divided in two Pick Two
34. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 38 ALL RIGHTS
RESERVED. CA Database
35. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 39 ALL RIGHTS
RESERVED. CP Database
36. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 40 ALL RIGHTS
RESERVED. AP Database
37. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 41 ALL RIGHTS
RESERVED. SOFTWARE TAXONOMY
38. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 42 ALL RIGHTS
RESERVED. Taxonomy Key Value stores Document Databases Column
Databases Graph Databases
39. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 43 ALL RIGHTS
RESERVED. Key Value Taxonomy KV Cache KV Store Eventually
Consistent Store Ordered Store Data Structure Server
40. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 44 ALL RIGHTS
RESERVED. KEY VALUE STORES
41. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 45 ALL RIGHTS
RESERVED. MemcacheDB Very early KV implementation (2008) KV Store
based on Memcached source, with BerkleyDB persistent store Speaks
the memcached protocol Development stopped (2009), but still quite
popular For when you like Memcached, but want persistance
42. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 46 ALL RIGHTS
RESERVED. REDIS
43. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 47 ALL RIGHTS
RESERVED. Redis First released in 2009 Sponsored by VMWare, then
Pivotal Name means Remote Dictionary Server Fully in memory key
value store Whole db must reside in memory of one machine Limits
scalability, at the benefit of performance Often used as a front
end cache for other NoSQL databases
44. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 48 ALL RIGHTS
RESERVED. Redis Not just strings as values: Lists of strings Sets
of strings (collections of non-repeating unsorted elements) Sorted
sets of strings (collections of non-repeating elements ordered by a
floating-point number called score) Hashes where keys and values
are strings
45. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 49 ALL RIGHTS
RESERVED. Redis Master / slave replication - slave may be master to
another slave allowing tree replication also publish/subscribe API
slaves may be updated separately from master, allows
inconsistencies (!) Persistent store Append only journal Flushed
every 2 seconds by default
46. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 50 ALL RIGHTS
RESERVED. RIAK
47. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 51 ALL RIGHTS
RESERVED. Riak Basho was developing Sales Force automation
software, came up with Riak, decided to make that their business
instead First release 2011 Key Value store with CAP tunability,
predictable latency but even Strong Consistency mode is not ACID
Write conflicts handled at read time Handles JSON natively (via
Search) Provides Map/Reduce engine in JavaScript Pluggable backend
data store, can be an in memory implementation Search via Solr
(though not realtime) Rental cloud service (Riak CS)
48. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 52 ALL RIGHTS
RESERVED. DOCUMENT STORES
49. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 53 ALL RIGHTS
RESERVED. Document vs Key Value Stores Extension of Key Value - the
value is a document but also Structurally aware Indexed searches
Document formats CouchDB JSON MongoDB BSON MarkLogic - JSON,
XML
50. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 54 ALL RIGHTS
RESERVED. MARKLOGIC
51. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 55 ALL RIGHTS
RESERVED. MarkLogic The Only Enterprise NoSQL Database Founded in
2001 by search engine experts Multi-model database with built-in
search and application services Flexible data model XML, JSON, RDF,
Geospatial, text and binaries Scalable and elastic Clusters on
commodity hardware at massive scale Trusted for mission-critical
apps ACID, HA/DR, Security
52. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 56 ALL RIGHTS
RESERVED. MarkLogic Features Built-in Search Scalability and
Elasticity ACID Transactions Government-grade Security HA/DR Cloud
Deployment Hadoop-ready
53. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 57 ALL RIGHTS
RESERVED. MarkLogic Architecture Universal Index Index words,
elements, the relationships of words and elements Many indexes
(automatically) used at once In memory range indexes work like
column indexes A native triple store that supports SPARQL Search on
ranges, free text, field values, more Shared-nothing architecture,
automatic partitioning and balancing Can also use a range index for
tiered partitioning MarkLogic Connector for Hadoop
54. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 58 ALL RIGHTS
RESERVED. XML vs JSON XML is more expressive, granular Namespaces
Intermixed elements and text Standard query languages (XPath,
XQuery) JSON is more compact, simpler (for better and worse)
Programmatically accessible via standard JavaScript libraries
55. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 59 ALL RIGHTS
RESERVED. MONGODB
56. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 60 ALL RIGHTS
RESERVED. MongoDB Development began in 2007 by 10gen Name from
humongous Originally wanted to create a Google App Engine system
1.4 considered first production ready release, 2010 Horizontally
scaling
57. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 61 ALL RIGHTS
RESERVED. MongoDB Stores data in proprietary format BSON, similar
to JSON with more data types Search on field, on range, or on regex
Single index per query (secondary index optional) Replication of
databases as master/slave, with (tunable) eventual consistency
Sharding handled via a shard key, splitting by range Be sure the
key is evenly distributed Client APIs in many (40+!) languages
58. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 62 ALL RIGHTS
RESERVED. WIDE COLUMN STORES
59. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 63 ALL RIGHTS
RESERVED. Column Stores Descended from Big Table approach Excellent
for sparse data Column families need to be specified up front But
still stored sparsely No way to list all the columns in the
database Append only Updates via timestamp Deletes via tombstone
marker
60. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 64 ALL RIGHTS
RESERVED. CASSANDRA
61. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 65 ALL RIGHTS
RESERVED. Cassandra Developed at Facebook, 2008, donated to Apache
Descended from Bigtable and Dynamo One of the primary Dynamo
developers helped create Cassandra Focused on maximum throughput
Write lots of data, fast But at the expense of consistency
(tunable) Used by Twitter, Reddit, Netflix but not Facebook
62. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 66 ALL RIGHTS
RESERVED. Cassandra Share nothing architecture Partitioned via hash
(multiple strategies) Be careful choosing your Row Key! Async
masterless replication CAP Tunable from "writes never fail" to
"wait until persisted on all slaves Query with range queries,
column family, CQL Hadoop support (replaces HDFS)
63. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 67 ALL RIGHTS
RESERVED. GRAPH DATABASES
64. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 68 ALL RIGHTS
RESERVED. Nodes and Vertices
65. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 69 ALL RIGHTS
RESERVED. NEO4J
66. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 70 ALL RIGHTS
RESERVED. Neo4J Released in 2010 Written in Java, APIs are Java
centric Most popular Graph Database Powers the recommendation
engines of Glassdoor, Walmart
67. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 71 ALL RIGHTS
RESERVED. Neo4J Whole graph in memory scales to millions of
relationships But does persist to disk Transactional Replicated for
performance and robustness, master/slave Proprietary Graph query
language (Cypher) Enterprise version adds clustering, sharding
68. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 72 ALL RIGHTS
RESERVED. SEMANTIC WEB
69. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 73 ALL RIGHTS
RESERVED. Semantics: A New Way to Organize Data Data is stored in
Triples, expressed as: Subject : Predicate : Object John Smith :
livesIn : London London : isIn : England Query with SPARQL, gives
us simple lookup .. and more! Find people who live in (a place
that's in) England "John Smith" "England" livesIn "London" isIn
livesIn
70. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 74 ALL RIGHTS
RESERVED. Context from the World at Large Linking Open Data cloud
diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/ Linked Open Data Facts that are freely
available In a form thats easily consumed DBpedia (wikipedia as
structured information) Einstein was born in Germany Irelands
currency is the Euro GeoNames Doha is the capital of Qatar Doha has
these lat/long coordinates
71. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 75 ALL RIGHTS
RESERVED. IN CONCLUSION
72. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 76 ALL RIGHTS
RESERVED. The Future Feature Convergence Growth of Public / Private
Cloud use New uses, increased adoption
73. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 77 ALL RIGHTS
RESERVED. Don't Design Your System Like It's 1979
74. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 78 ALL RIGHTS
RESERVED. Further http://www.nosqlfordummies.com
75. COPYRIGHT 2014 MARKLOGIC CORPORATION. SLIDE: 79 ALL RIGHTS
RESERVED. ANY QUESTIONS?