43
April 10-12 | Chicago, IL NoSQL: An Analysis Andrew J. Brust, Founder and CEO, Blue Badge Insights

NoSQL: An Analysis

Embed Size (px)

Citation preview

Page 1: NoSQL: An Analysis

April 10-12 | Chicago, IL

NoSQL: An Analysis

Andrew J. Brust, Founder and CEO, Blue Badge Insights

Page 2: NoSQL: An Analysis

April 10-12 | Chicago, IL

Please silence cell phones

Page 3: NoSQL: An Analysis

3

Meet Andrew

CEO and Founder, Blue Badge Insights

Big Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC• http://www.msbinyc.comCo-moderator, NYC .NET Developers Group• http://www.nycdotnetdev.com“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer Newsbrustblog.com, Twitter: @andrewbrust

Page 4: NoSQL: An Analysis

Andrew’s New Blog (bit.ly/bigondata)

Page 5: NoSQL: An Analysis

Read all about it!

Page 6: NoSQL: An Analysis

Agenda

Why NoSQL?ConceptsNoSQL CategoriesProvisioning, market, applicabilityTake-aways

Page 8: NoSQL: An Analysis

NoSQL Data Fodder

AddressesPreference

s

NotesFriends,

Followers

Documents

Page 9: NoSQL: An Analysis

“Web Scale”This the term used to justify NoSQLScenario is simple needs but “made up for in volume”• Millions of concurrent users

Think of sites like Amazon or GoogleThink of non-transactional tasks like loading catalog data to display product page, or environment preferences

Page 10: NoSQL: An Analysis

NoSQL Common Traits

Non-relationalNon-schematized/schema-freeOpen sourceDistributedEventual consistency“Web scale”Developed at big Internet companies

Page 11: NoSQL: An Analysis

CONCEPTS

Page 12: NoSQL: An Analysis

Consistency

CAP Theorem

• Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance

NoSQL does not offer “ACID” guarantees

• Atomicity, consistency, isolation and durability

Instead offers “eventual consistency”

Similar to DNS propagation

Page 13: NoSQL: An Analysis

Things like inventory, account balances should be consistent

• Imagine updating a server in Seattle that stock was depleted

• Imagine not updating the server in NY

• Customer in NY goes to order 50 pieces of the item

• Order processed even though no stock

Things like catalog information don’t have to be, at least not immediately

• If a new item is entered into the catalog, it’s OK for some customers to see it even before the other customers’ server knows about it

But catalog info must come up quickly

• Therefore don’t lock data in one location while waiting to update the other

Therefore, OK to sacrifice consistency for speed, in some cases

Consistency

Page 14: NoSQL: An Analysis

CAP Theorem

Consistency

Availability

Partition Tolerance

Relational

NoSQL

Page 15: NoSQL: An Analysis

Indexing

Most NoSQL databases are indexed by keySome allow so-called “secondary” indexesOften the primary key indexes are clusteredHBase uses HDFS (the Hadoop Distributed File System), which is append-only• Writes are logged

• Logged writes are batched

• File is re-created and sorted

Page 16: NoSQL: An Analysis

Queries

Typically no query languageInstead, create procedural programSometimes SQL is supportedSometimes MapReduce code is used…

Page 17: NoSQL: An Analysis

MapReduce

This is not Hadoop’s MapReduce, but it’s conceptually relatedMap step: pre-processes dataReduce step: summarizes/aggregates dataWill show a MapReduce code sample for Mongo soonWill demo map code on CouchDB

Page 18: NoSQL: An Analysis

Sharding

A partitioning pattern where separate servers store partitionsFan-out queries supportedPartitions may be duplicated, so replication also provided• Good for disaster recovery

Since “shards” can be geographically distributed, sharding can act like a CDNGood for keeping data close to processing• Reduces network traffic when MapReduce splitting takes place

Page 19: NoSQL: An Analysis

NOSQL CATEGORIES

Page 20: NoSQL: An Analysis

20

Key-Value Stores

The most common; not necessarily the most popularHas rows, each with something like a big dictionary/associative array• Schema may differ from row to row

Common on cloud platforms• e.g. Amazon SimpleDB, Azure Table Storage

MemcacheDB, Voldemort, Couchbase, DynamoDB (AWS), Dynomite, Redis and Riak

Page 21: NoSQL: An Analysis

Key-Value Stores

Table: CustomersRow ID: 101

First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501

Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502

Table: Orders

Row ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Database

Page 22: NoSQL: An Analysis

Wide Column Stores

Has tables with declared column families

• Each column family has “columns” which are KV pairs that can vary from row to row

These are the most foundational for large sites

• BigTable (Google)

• HBase (Originally part of Yahoo-dominated Hadoop project)

• Cassandra (Facebook)

• Calls column families “super columns” and tables “super column families”

They are the most “Big Data”-ready

• Especially HBase + Hadoop

Page 23: NoSQL: An Analysis

Table: CustomersRow ID: 101

Super Column: Name Column: First_Name: Andrew Column: Last_Name: BrustSuper Column: Address Column: Number: 123 Column: Street: Main StreetSuper Column: Orders Column: Last_Order: 1501

Table: Orders

Row ID: 1501Super Column: Pricing Column: Price: 300 USDSuper Column: Items Column: Item1: 52134 Column: Item2: 24457Row ID: 1502Super Column: Pricing Column: Price: 2500 GBPSuper Column: Items Column: Item1: 98456 Column: Item2: 59428

Row ID: 202Super Column: Name Column: First_Name: Jane Column: Last_Name: DoeSuper Column: Address Column: Number: 321 Column: Street: Elm StreetSuper Column: Orders Column: Last_Order: 1502

Wide Column Stores

Page 24: NoSQL: An Analysis

April 10-12 | Chicago, IL

DemoWide Column Stores

Page 25: NoSQL: An Analysis

Document Stores

Have “databases,” which are akin to tablesHave “documents,” akin to rows

• Documents are typically JSON objects

• Each document has properties and values

• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage)

• Can have attachments as well

Old versions are retained

• So Doc Stores work well for content management

Some view doc stores as specialized KV storesMost popular with developers, startups, VCsThe biggies:

• CouchDB

• Derivatives

• MongoDB

Page 26: NoSQL: An Analysis

Document Store Application Orientation

Documents can each be addressed by URIsCouchDB supports full REST interfaceVery geared towards JavaScript and JSON

• Documents are JSON objects

• CouchDB/MongoDB use JavaScript as native language

In CouchDB, “view functions” also have unique URIs and they return HTML

• So you can build entire applications in the database

Page 27: NoSQL: An Analysis

Database: CustomersDocument ID: 101

First_Name: AndrewLast_Name: BrustAddress:

Orders:

Database: Orders

Document ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Number: 123Street: Main Street

Most_recent: 1501

Document ID: 202First_Name: JaneLast_Name: DoeAddress:

Orders:

Number: 321Street: Elm Street

Most_recent: 1502

Document Stores

Page 28: NoSQL: An Analysis

April 10-12 | Chicago, IL

DemoDocument Stores

Page 29: NoSQL: An Analysis

Graph Databases

Great for social network applications and others where relationships are importantNodes and edges• Edge like a join

• Nodes like rows in a table

Nodes can also have properties and valuesNeo4j is a popular graph db

Page 30: NoSQL: An Analysis

Database

Sent invitation to

Commented on photo by

Friend of

Address

Placed order

Item2

Item1

Joe Smith Jane Doe

Andrew Brust

Street: 123 Main StreetCity: New YorkState: NYZip: 10014

ID: 52134Type: DressColor: Blue

ID: 24457Type: ShirtColor: Red

ID: 252Total Price: 300 USD

George Washington

Graph Databases

Page 31: NoSQL: An Analysis

PROVISIONING, MARKET, APPLICABILITY

Page 32: NoSQL: An Analysis

NoSQL + BI

NoSQL databases are bad for ad hoc query and data warehousingBI applications involve models; models rely on schemaExtract, transform and load (ETL) may be your friendWide-column stores, however are good for “Big Data”

• See next slide

Wide-column stores and column-oriented databases are similar technologically

Page 33: NoSQL: An Analysis

NoSQL + Big DataBig Data and NoSQL are interrelatedTypically, Wide-Column stores used in Big Data scenariosPrime example:• HBase and Hadoop

Why?• Lack of indexing not a problem

• Consistency not an issue

• Fast reads very important

• Distributed file systems important too

• Commodity hardware and disk assumptions also important

• Not Web scale but massive scale-out, so similar concerns

Page 34: NoSQL: An Analysis

34

Going “NoSQL-Like” on the MS CloudAzure Table Storage (a key-value store)SQL Azure XML columns (supports variable schema, hierarchy)SQL Azure Federation (a sharding implementation)OData (HTTP/JSON data APIs)Running NoSQL database products using Azure VMs…

Page 35: NoSQL: An Analysis

NoSQL on Windows Azure

Platform as a Service• Cloudant: https://cloudant.com/azure/

• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/

MongoDB, DIY: • On an Azure Worker Role:

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles

• On a Windows VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer

• On a Linux VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/

Page 37: NoSQL: An Analysis

37

And With MS On-Premise Technologies

SQL Server 2008/2008R2/2012 “Beyond Relational” Features• Sparse columns (like Wide Column Stores)• Geospatial (geometry, geography data types)• FILESTREAM, FileTable (like Document Store attachments)• Full Text Search, Semantic Similarity Search• HierarchyID (can simulate Graph Database functionality)SQL Server Parallel Data Warehouse Edition (PDW)• Distributed architecture (like MapReduce/Hadoop)• PolyBase in PDW v2 (interfaces PDW and HDFS)

Page 38: NoSQL: An Analysis

TAKE-AWAYS

Page 39: NoSQL: An Analysis

Compromises

Eventual consistencyWrite bufferingOnly primary keys can be indexedQueries must be written as programsTooling• Productivity (= money)

Page 40: NoSQL: An Analysis

Summing Up

• Line of Business -> Relational• Large, public (consumer)-facing sites -> NoSQL

• Complex data structures -> Relational• Big Data -> NoSQL

• Transactional -> Relational• Content Management -> NoSQL

• Enterprise->Relational • Consumer Web -> NoSQL

Page 41: NoSQL: An Analysis

Thank you

[email protected]• @andrewbrust on twitter• Want to get on Blue Badge Insights’ list?”Text “bluebadge” to 22828

Page 42: NoSQL: An Analysis

Win a Microsoft Surface Pro!

Complete an online SESSION EVALUATION to be entered into the draw.

Draw closes April 12, 11:59pm CTWinners will be announced on the PASS BA Conference website and on Twitter.

Go to passbaconference.com/evals or follow the QR code link displayed on session signage throughout the conference venue.

Your feedback is important and valuable. All feedback will be used to improve and select sessions for future events.

Page 43: NoSQL: An Analysis

April 10-12, Chicago, IL

Thank you!Diamond Sponsor Platinum Sponsor