NoSQL: An Analysis

Preview:

Citation preview

April 10-12 | Chicago, IL

NoSQL: An Analysis

Andrew J. Brust, Founder and CEO, Blue Badge Insights

April 10-12 | Chicago, IL

Please silence cell phones

3

Meet Andrew

CEO and Founder, Blue Badge Insights

Big Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC• http://www.msbinyc.comCo-moderator, NYC .NET Developers Group• http://www.nycdotnetdev.com“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer Newsbrustblog.com, Twitter: @andrewbrust

Andrew’s New Blog (bit.ly/bigondata)

Read all about it!

Agenda

Why NoSQL?ConceptsNoSQL CategoriesProvisioning, market, applicabilityTake-aways

NoSQL Data Fodder

AddressesPreference

s

NotesFriends,

Followers

Documents

“Web Scale”This the term used to justify NoSQLScenario is simple needs but “made up for in volume”• Millions of concurrent users

Think of sites like Amazon or GoogleThink of non-transactional tasks like loading catalog data to display product page, or environment preferences

NoSQL Common Traits

Non-relationalNon-schematized/schema-freeOpen sourceDistributedEventual consistency“Web scale”Developed at big Internet companies

CONCEPTS

Consistency

CAP Theorem

• Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance

NoSQL does not offer “ACID” guarantees

• Atomicity, consistency, isolation and durability

Instead offers “eventual consistency”

Similar to DNS propagation

Things like inventory, account balances should be consistent

• Imagine updating a server in Seattle that stock was depleted

• Imagine not updating the server in NY

• Customer in NY goes to order 50 pieces of the item

• Order processed even though no stock

Things like catalog information don’t have to be, at least not immediately

• If a new item is entered into the catalog, it’s OK for some customers to see it even before the other customers’ server knows about it

But catalog info must come up quickly

• Therefore don’t lock data in one location while waiting to update the other

Therefore, OK to sacrifice consistency for speed, in some cases

Consistency

CAP Theorem

Consistency

Availability

Partition Tolerance

Relational

NoSQL

Indexing

Most NoSQL databases are indexed by keySome allow so-called “secondary” indexesOften the primary key indexes are clusteredHBase uses HDFS (the Hadoop Distributed File System), which is append-only• Writes are logged

• Logged writes are batched

• File is re-created and sorted

Queries

Typically no query languageInstead, create procedural programSometimes SQL is supportedSometimes MapReduce code is used…

MapReduce

This is not Hadoop’s MapReduce, but it’s conceptually relatedMap step: pre-processes dataReduce step: summarizes/aggregates dataWill show a MapReduce code sample for Mongo soonWill demo map code on CouchDB

Sharding

A partitioning pattern where separate servers store partitionsFan-out queries supportedPartitions may be duplicated, so replication also provided• Good for disaster recovery

Since “shards” can be geographically distributed, sharding can act like a CDNGood for keeping data close to processing• Reduces network traffic when MapReduce splitting takes place

NOSQL CATEGORIES

20

Key-Value Stores

The most common; not necessarily the most popularHas rows, each with something like a big dictionary/associative array• Schema may differ from row to row

Common on cloud platforms• e.g. Amazon SimpleDB, Azure Table Storage

MemcacheDB, Voldemort, Couchbase, DynamoDB (AWS), Dynomite, Redis and Riak

Key-Value Stores

Table: CustomersRow ID: 101

First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501

Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502

Table: Orders

Row ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Database

Wide Column Stores

Has tables with declared column families

• Each column family has “columns” which are KV pairs that can vary from row to row

These are the most foundational for large sites

• BigTable (Google)

• HBase (Originally part of Yahoo-dominated Hadoop project)

• Cassandra (Facebook)

• Calls column families “super columns” and tables “super column families”

They are the most “Big Data”-ready

• Especially HBase + Hadoop

Table: CustomersRow ID: 101

Super Column: Name Column: First_Name: Andrew Column: Last_Name: BrustSuper Column: Address Column: Number: 123 Column: Street: Main StreetSuper Column: Orders Column: Last_Order: 1501

Table: Orders

Row ID: 1501Super Column: Pricing Column: Price: 300 USDSuper Column: Items Column: Item1: 52134 Column: Item2: 24457Row ID: 1502Super Column: Pricing Column: Price: 2500 GBPSuper Column: Items Column: Item1: 98456 Column: Item2: 59428

Row ID: 202Super Column: Name Column: First_Name: Jane Column: Last_Name: DoeSuper Column: Address Column: Number: 321 Column: Street: Elm StreetSuper Column: Orders Column: Last_Order: 1502

Wide Column Stores

April 10-12 | Chicago, IL

DemoWide Column Stores

Document Stores

Have “databases,” which are akin to tablesHave “documents,” akin to rows

• Documents are typically JSON objects

• Each document has properties and values

• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage)

• Can have attachments as well

Old versions are retained

• So Doc Stores work well for content management

Some view doc stores as specialized KV storesMost popular with developers, startups, VCsThe biggies:

• CouchDB

• Derivatives

• MongoDB

Document Store Application Orientation

Documents can each be addressed by URIsCouchDB supports full REST interfaceVery geared towards JavaScript and JSON

• Documents are JSON objects

• CouchDB/MongoDB use JavaScript as native language

In CouchDB, “view functions” also have unique URIs and they return HTML

• So you can build entire applications in the database

Database: CustomersDocument ID: 101

First_Name: AndrewLast_Name: BrustAddress:

Orders:

Database: Orders

Document ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Number: 123Street: Main Street

Most_recent: 1501

Document ID: 202First_Name: JaneLast_Name: DoeAddress:

Orders:

Number: 321Street: Elm Street

Most_recent: 1502

Document Stores

April 10-12 | Chicago, IL

DemoDocument Stores

Graph Databases

Great for social network applications and others where relationships are importantNodes and edges• Edge like a join

• Nodes like rows in a table

Nodes can also have properties and valuesNeo4j is a popular graph db

Database

Sent invitation to

Commented on photo by

Friend of

Address

Placed order

Item2

Item1

Joe Smith Jane Doe

Andrew Brust

Street: 123 Main StreetCity: New YorkState: NYZip: 10014

ID: 52134Type: DressColor: Blue

ID: 24457Type: ShirtColor: Red

ID: 252Total Price: 300 USD

George Washington

Graph Databases

PROVISIONING, MARKET, APPLICABILITY

NoSQL + BI

NoSQL databases are bad for ad hoc query and data warehousingBI applications involve models; models rely on schemaExtract, transform and load (ETL) may be your friendWide-column stores, however are good for “Big Data”

• See next slide

Wide-column stores and column-oriented databases are similar technologically

NoSQL + Big DataBig Data and NoSQL are interrelatedTypically, Wide-Column stores used in Big Data scenariosPrime example:• HBase and Hadoop

Why?• Lack of indexing not a problem

• Consistency not an issue

• Fast reads very important

• Distributed file systems important too

• Commodity hardware and disk assumptions also important

• Not Web scale but massive scale-out, so similar concerns

34

Going “NoSQL-Like” on the MS CloudAzure Table Storage (a key-value store)SQL Azure XML columns (supports variable schema, hierarchy)SQL Azure Federation (a sharding implementation)OData (HTTP/JSON data APIs)Running NoSQL database products using Azure VMs…

NoSQL on Windows Azure

Platform as a Service• Cloudant: https://cloudant.com/azure/

• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/

MongoDB, DIY: • On an Azure Worker Role:

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles

• On a Windows VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer

• On a Linux VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/

37

And With MS On-Premise Technologies

SQL Server 2008/2008R2/2012 “Beyond Relational” Features• Sparse columns (like Wide Column Stores)• Geospatial (geometry, geography data types)• FILESTREAM, FileTable (like Document Store attachments)• Full Text Search, Semantic Similarity Search• HierarchyID (can simulate Graph Database functionality)SQL Server Parallel Data Warehouse Edition (PDW)• Distributed architecture (like MapReduce/Hadoop)• PolyBase in PDW v2 (interfaces PDW and HDFS)

TAKE-AWAYS

Compromises

Eventual consistencyWrite bufferingOnly primary keys can be indexedQueries must be written as programsTooling• Productivity (= money)

Summing Up

• Line of Business -> Relational• Large, public (consumer)-facing sites -> NoSQL

• Complex data structures -> Relational• Big Data -> NoSQL

• Transactional -> Relational• Content Management -> NoSQL

• Enterprise->Relational • Consumer Web -> NoSQL

Thank you

• andrew.brust@bluebadgeinsights.com• @andrewbrust on twitter• Want to get on Blue Badge Insights’ list?”Text “bluebadge” to 22828

Win a Microsoft Surface Pro!

Complete an online SESSION EVALUATION to be entered into the draw.

Draw closes April 12, 11:59pm CTWinners will be announced on the PASS BA Conference website and on Twitter.

Go to passbaconference.com/evals or follow the QR code link displayed on session signage throughout the conference venue.

Your feedback is important and valuable. All feedback will be used to improve and select sessions for future events.

April 10-12, Chicago, IL

Thank you!Diamond Sponsor Platinum Sponsor

Recommended