Upload
dataversity
View
2.155
Download
0
Tags:
Embed Size (px)
Citation preview
M
D
The CIO's Guide to NoSQL
Dan McCrearyJuly 2011Version 5
M
DCopyright Kelly-McCreary & Associates, LLC
2
Agenda
• Historical Context• The Business Case for NoSQL• Terminology• How NoSQL is Different• Key NoSQL Products• Call to Action: The NoSQL Pilot Project• The Future of NoSQL
M
DCopyright Kelly-McCreary & Associates, LLC
3
Background for Dan McCreary
• Bell Labs• NeXT Computer (Steve Jobs)• Owner of Custom Object-
Oriented Software Consultancy• Federal data integration
(National Information Exchange Model)
• Native XML/XQuery – 2006• Advocate of NoSQL/XRX
systems
M
DCopyright Kelly-McCreary & Associates, LLC
4
NoSQL Training Areas
The CIO'sGuide toNoSQL
Project Manager'sGuide to NoSQL
Transitioningto NoSQL
ArchitecturalTradeoff Modeling
XQueryMapReduce HadoopFunctionalProgramming
Managers
Architects/Project
Managers
Developer
Track CourseYou Are
Here
M
DCopyright Kelly-McCreary & Associates, LLC
5
Sample of NoSQL Jargon
Document orientation
Schema free
MapReduce
Horizontal scaling
Sharding and auto-sharding
Brewer's CAP Theorem
Consistency
Reliability
Partition tolerance
Single-point-of-failure
Object-Relational mapping
Key-value stores
Column stores
Document-stores
Memcached
Indexing
B-Tree
Configurable durability
Documents for archives
Functional programming
Document Transformation
Document Indexing and Search
Alternate Query Languages
Aggregates
OLAP
XQuery
MDX
RDF
SPARQL
Architecture Tradeoff Modeling
ATAM
Note that within the context of NoSQL many of these terms have different meanings!
M
DCopyright Kelly-McCreary & Associates, LLC
6
Selecting a Database…
"Selecting the right data storage solution is no longer a trivial task."
Does it look like
document?
Use MicrosoftOffice
Use theRDBMS
Start
Stop
No
Yes
M
DCopyright Kelly-McCreary & Associates, LLC
7
Pressures on SQL Only Systems
SQLOLAP/BI/DataWarehouse
Social Networks
Scalability
AgileSchema
FreeDocument-D
ata
Large Data Sets Reliability
Linked Data
M
DCopyright Kelly-McCreary & Associates, LLC
8
Simplicity is a Virtue
• Many systems derive their strength by dramatically limiting the features in their system
• Simplicity allows database designers to focus on the primary business driver
• Examples:– Touch screen interfaces– Key/Value data stores
M
DCopyright Kelly-McCreary & Associates, LLC
9
Historical Context
Mainframe Era
• 1 CPU• COBOL and FORTRAN• Punchcards and flat files• $10,000 per CPU hour
Commodity Processors
• 10,000 CPUs• Functional programming• MapReduce "farms"• Pennies per CPU hour
M
D
Two Approaches to Computation
10Copyright 2010 Dan McCreary & Associates
Alonzo ChurchJohn Von Neumann
Manage state with a program counter. Make computations act like math functions.
Which is simpler? Which is cheaper? Which will scale to 10,000 CPUs?
1930s and 40s
M
DCopyright Kelly-McCreary & Associates, LLC
11
Standard vs. MapReduce Prices
http://aws.amazon.com/elasticmapreduce/#pricing
John's Way Alonzo's Way
M
DCopyright Kelly-McCreary & Associates, LLC
12
MapReduce CPUs Cost Less!
0
5
10
15
20
25
30
35
40Cost Per CPU Hour (Cents)
http://aws.amazon.com/elasticmapreduce/#pricing
Cuts cost from 32 to 6 cents per CPU hour!Perhaps Alanzo was right!
82% CostReduction!
Why? (hint: how "shareable" is this process)
M
DKelly-McCreary & Associates, LLC
13
Perspectives
Native XML
OLAPMDX
ObjectStores
GraphStores
NoSQL for Web 2.0
and BigData
Perspective depends on your context
M
DKelly-McCreary & Associates, LLC
14
Architectural Tradeoffs
"I want a fast car with good mileage."
"I want a scaleable database with low cost that runs well on the 1,000 CPUs in our data center."
M
DKelly-McCreary & Associates, LLC
15
Recent History
• The term NoSQL became re-popularized around 2009
• Used for conferences of advocates of non-relational databases
• Became a contagious idea "meme"• First of many "NoSQL meetups" in San
Francisco organized by Jon Oskarsson• Conversion from "No SQL" to "Not Only
SQL" in recent year
M
DKelly-McCreary & Associates, LLC
16
NoSQL on Google Trends
M
DKelly-McCreary & Associates, LLC
17
NoSQL and Web 2.0 Startups
• Many web 2.0 startups did not use Oracle or MySQL
• They built their own data stores influenced by Amazon’s Dynamo and Google’s BigTable in order to store and process huge amounts of data
• In the social community or cloud computing applications, most of these data stores became OpenSource software
M
DCopyright Kelly-McCreary & Associates, LLC
18
Google MapReduce
• 2004 paper that had huge impact of functional programming in the entire community
• Copied by many organizations, including Yahoo
M
DCopyright Kelly-McCreary & Associates, LLC
19
Google Bigtable Paper
• 2006 paper that gave focus to scaleable databases
• designed to reliably scale to petabytes of
data and thousands of machines
M
DCopyright Kelly-McCreary & Associates, LLC
20
Amazon's Dynamo Paper
• Werner Vogels• CTO - Amazon.com• October 2, 2007• Used to power Amazon's
S3 service• One of the most
influential papers in the NoSQL movement
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.
M
DKelly-McCreary & Associates, LLC
21
NoSQL "Meetups"
“NoSQLers came to share how they had overthrown the tyranny of slow, expensive relational databases in favor of more efficient and cheaper ways of managing data.”
Computerworld magazine, July 1st, 2009
M
DKelly-McCreary & Associates, LLC
22
Key Motivators
• Licensing RDBMS on multiple CPUs• The Thee "V"s
– Velocity – lots of data arriving fast– Volume – web-scale BigData– Variability – many exceptions
• Desire to escape rigid schema design• Avoidance of complex Object-Relational
Mapping (the "Vietnam" of computer science)
M
DCopyright 2008 Dan McCreary & Associates
23
Many Processes Today Are Driven By…
The constraints of yesterday…
Challenge:Ask ourselves the question…Do our current method of solving problems with tabular data…Reflect the storage of the 1950s…Or our actual business requirements?What structures best solve the actual business problem?
M
DCopyright 2008 Dan McCreary & Associates
24
No-Shredding!
• Relational databases take a single hierarchical document and shred it into many pieces so it will fit in tabular structures
• Document stores prevent this shredding
MyData
M
DCopyright 2008 Dan McCreary & Associates
25
Is Shredding Really Necessary?
• Every time you take hierarchical data and put it into a traditional database you have to put repeating groups in separate tables and use SQL “joins” to reassemble the data
M
DKelly-McCreary & Associates, LLC
26
Object Relational Mapping
• T1 – HTML into Objects• T2 –Objects into SQL Tables• T3 – Tables into Objects• T4 – Objects into HTML
T1
T3
T2
T4
Object MiddleTier
RelationalDatabaseWeb Browser
M
DCopyright Kelly-McCreary & Associates, LLC
27
"The Vietnam of Applications"
• Object-relational mapping has become one of the most complex components of building applications today
• A "Quagmire" where many projects get lost• Many "heroic efforts" have been made to
solve the problem:– Hibernate– Ruby on Rails
• But sometimes the way to avoid complexity is to keep your architecture very simple
M
D
Document Stores Need No Translation
• Documents in the database• Documents in the application• No object middle tier• No "shredding"• No reassembly• Simple!
28Copyright 2010 Dan McCreary & Associates
Application Layer Database
Document Document
M
D
Zero Translation (XML)
• XML lives in the web browser (XForms)• REST interfaces• XML in the database (Native XML, XQuery)• XRX Web Application Architecture• No translation!
29Copyright 2010 Dan McCreary & Associates
Web Browser XML database
XForms REST-Interfaces
M
D
"Schema Free"
• Systems that automatically determine how to index data as the data is loaded into the database
• No a priori knowledge of data structure• No need for up-front logical data modeling
– …but some modeling is still critical
• Adding new data elements or changing data elements is not disruptive
• Searching millions of records still has sub-second response time
30Copyright 2010 Dan McCreary & Associates
M
D
Monoculture and Mono-architecture
31
Copyright 2010 Dan McCreary & Associates
Image Source: Wikipedia
M
DKelly-McCreary & Associates, LLC
32
Eric Evans
“The whole point of seeking alternatives [to RDBMS systems] is that you need to solve a problem that relational databases are a bad fit for.”
Eric EvansRackspace
M
DCopyright Kelly-McCreary & Associates, LLC
33
Evolution of Ideas in OpenSource
• How quickly can new ideas be recombined into new database products?• OpenSource software has proved to be the most efficient way to quickly
recombine new ideas into new products
Product A
Product B
Product B
OpenSource
Proprietary SoftwareNew Database Ideas
Schema-free
MapReduceAuto-sharding
New Products
Cloud Computing
M
D 34Copyright 2010 Dan McCreary & Associates
Storage Architectural Patterns
Tables Trees
TriplesStars
M
D
Finding the Right Match
35
Copyright 2010 Dan McCreary & Associates
Schema-Free
Mature Query Language
Standards Compliant
Use CMU's Architectural Tradeoff and Modeling (ATAM) Process
M
DKelly-McCreary & Associates, LLC
36
Brewer's CAP Theorem
Consistency
Availability Partition Tolerance
You can not have all three so pick two!
M
DKelly-McCreary & Associates, LLC
37
Avoidance of Unneeded Complexity
• Relational databases provide a variety of features to ALWAYS support strict data consistency
• Rich feature set and the ACID properties implemented by RDBMSs might be more than necessary for particular applications and use cases
M
DKelly-McCreary & Associates, LLC
38
High Throughput
• Some NoSQL databases provide a significantly higher data throughput than traditional RDBMS
• Hypertable which pursues Google’s Bigtable approach allows the local search engine Zvent to store one billion data cells per day
• Google is able to process 20 petabytes a day stored in BigTable via it’s MapReduce approach
M
DKelly-McCreary & Associates, LLC
39
Complexity and Cost of Settingup Database Clusters
NoSQL databases are designedin a way that “PC clusters can be easily and cheaply expanded without the complexity and cost of ’sharding,’ which involves cutting up databases into multiple tables to run on large clusters or grids”.
Nati Shalom, CTO and founder of GigaSpaces
M
DKelly-McCreary & Associates, LLC
40
Compromising Reliability for Better Performance
• Shalom argues that there are “different scenarios where applications would be willing to compromise reliability for better performance.”
• Performance over reliability• Example: HTTP session data example
– “needs to be shared between various web servers but since the data is transient in nature (it goes away when the user logs off) there is no need to store it in persistent storage.”
M
DKelly-McCreary & Associates, LLC
41
"Once Size Fits…"
"One Size Does Not Fit All"James Hamilton Nov. 3rd, 2009
http://perspectives.mvdirona.com/CommentView,guid,afe46691-a293-4f9a-8900-5688a597726a.aspx
M
DKelly-McCreary & Associates, LLC
42
Different Thinking
Sequential Processing
• The output of any step can be used in the next step
• State must be carefully managed
Parallel Processing
• Each loop of XQuery FLOWR statements are independent thread (no side-effects)
M
DKelly-McCreary & Associates, LLC
43
Cloud Computing
• High scalability– Especially in the horizontal direction (multi
CPUs)
• Low administration overhead– Simple web page administration
M
DKelly-McCreary & Associates, LLC
44
Databases work well in the cloud
• Data warehousing specific databases for batch data processing and map/reduce operations
• Simple, scalable and fast key/value-stores• Databases containing a richer feature set
than key/value-stores fitting the gap with traditional
• RDBMS while offering good performance and scalability properties (such as document databases).
M
DCopyright Kelly-McCreary & Associates, LLC
45
Auto-Sharding
• When one database gets almost full it tells a "coordinator" system and the data automatically gets migrated to other systems
90% full
45% full
45% full
Before
After
M
DCopyright Kelly-McCreary & Associates, LLC
46
Scale Up vs. Scale Out
Scale Up• Make a single CPU as fast as
possible• Increase clock speed• Add RAM• Make disk I/O go faster
Scale Out• Make Many CPUs work
together• Learn how to divide your
problems into independent threads
M
DCopyright Kelly-McCreary & Associates, LLC
47
Functional Programming
• What does it mean to your IT staff?• What experience do they have in
functional programming?• Can they "unlearn" the habits of the
procedural world?
M
D
The NO-SQL Universe
48
Copyright 2010 Dan McCreary & Associates
Document StoresKey-Value Stores
Graph Stores
XML
Object Stores Column Stores
M
DCopyright Kelly-McCreary & Associates, LLC
49
Key Value Stores
• A table with two columns and a simple interface– Add a key-value– For this key, give me the
value– Delete a key
• Blazingly fast and easy to scale
Key Value
M
DCopyright Kelly-McCreary & Associates, LLC
50
Types of Key-Value Stores
• Eventually‐consistent Key‐Value store• Hierarchical Key-Value Stores• Key-Value Stores In RAM• Key Value Stores on Disk• Ordered Key-Value Stores
M
DCopyright Kelly-McCreary & Associates, LLC
51
Cassendra
• Apache open source project• Originally developed by Facebook• Designed for highly distributed high-
reliable systems• No single point of failure• Column-family data model
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
M
DCopyright Kelly-McCreary & Associates, LLC
52
Voldomort
• A distributed key-value system• Used at LinkedIn• 10K-20K node operations/CPU• Auto-sharding• Graceful server failure handling
M
DCopyright Kelly-McCreary & Associates, LLC
53
MongoDB
• Open Source License• Document/Collection centric• Sharding built-in, automatic• Stores data in JSON format• Query language is JSON• Can be 10x faster than MySQL• Many languages (C++, JavaScript, Java,
Perl, Python etc.)
M
DCopyright Kelly-McCreary & Associates, LLC
54
Hadoop/Hbase
• Open source implementation of MapReduce algorithm written in Java
• Initially created by Yahoo– 300 person-years development
• Column-oriented data store• Java interface• Hbase designed specifically to work with
Hadoop
M
DCopyright Kelly-McCreary & Associates, LLC
55
CouchDB
• Apache Document Store• Written in ERLANG• RESTful JSON API• Distributed, featuring robust, incremental
replication with bi-directional conflict detection and management
M
DCopyright Kelly-McCreary & Associates, LLC
56
Memcached
• Free & open source in-memory caching system• Designed to speeding up dynamic web applications by alleviating
database load• RAM resident key-value store for small chunks of arbitrary data
(strings, objects) from results of database calls, API calls, or page rendering
• Simple interface• Designed for quick deployment, ease of development• APIs in many languages
M
DCopyright Kelly-McCreary & Associates, LLC
57
MarkLogic
• Native XML database designed to used by Petabyte data stores
• ACID compliant• Heavy use by federal agencies, document
publishers and "high-variability" data• Arguably the most successful NoSQL
company
M
DCopyright Kelly-McCreary & Associates, LLC
58
eXist
• OpenSource native XML database• Strong support for XQuery and XQuery
extensions• Heavily used by the Text Encoding
Initiative (TEI) community and XRX/XForms communities
• Ideal for metadata management• Integrated Lucene search and structured
search
M
DCopyright Kelly-McCreary & Associates, LLC
59
Riak
• Community and Commercial licenses• A "Dynamo-inspired" database• Written in ERLANG• Query JSON or ERLANG
M
DCopyright Kelly-McCreary & Associates, LLC
60
Hypertable
• Open Source• Closely modeled after Google's Bigtable
project• High performance distributed data storage
system• Designed to support applications requiring
maximum performance, scalability, and reliability
• Hypertable Query Language (HQL) that is syntactically similar to SQL
M
D
Selecting a NoSQL Pilot Project
• The "Goldilocks Pilot Project Strategy"
• Not to big, not to small, just the right size
• Duration• Sponsorship• Importance• Skills• Mentorship
61Copyright 2010 Dan McCreary & Associates
M
DCopyright Kelly-McCreary & Associates, LLC
62
The Future of the NoSQL Movement
• Will data sets continue to grow at exponential rates?• Will new system options become more diverse?• Will new markets have different demands?• Will some ideas be "absorbed" into existing RDBMS vendors products?• Will the NoSQL community continue to be the place where new
database ideas and products are incubated?• Will the job of doing high-quality architectural tradeoffs analysis become
easier?
Growth Diversity
M
D
Start Finish
Using the Wrong Architecture
Credit: Isaac Homelund – MN Office of the Revisor
M
D
Using the Right Architecture
Start Finish
Find ways to remove barriers to empoweringthe non programmers on your team.
M
DKelly-McCreary & Associates, LLC
65
Questions
Dan McCreary
President, Kelly-McCreary & Associates