Choosing which big data, nosql or database technology to use

One Size Doesn’t Fit AllChoosing which big data, NoSQL or database technology to use

March 14, 2012

Mark R. Madsenhttp://ThirdNature.net

Presenter

Presentation Notes

One size never fits all. We just pretend it does because it’s a useful fiction. The goal of this talk is three-fold: To explain the underlying dynamics of the market that supplies the technologies and architectures we use in our work To provide an explanation that helps frame the problems we have to solve so we can think better about them To put some of the “big data” impact into context Image: public domain

The problem of “big” is three problems of volume

Number of users!

Computations!

Amount of data!

Presenter

Presentation Notes

Data volume and extrapolation on data volume are not the right approach to the problem. It’s bigger than just this, or we’d be able to solve it by simply throwing more hardware at the problem. We need to look at it with a proper framework. Three Sources of Analytic Performance Problems It’s a size problem. Volume of: Data Users Computations Complexity (computation scale) - solved, except Moore's law is dead Size - comes in 2 flavors: data scale, what's the real problem with size? It can be storage costs, slowdown as grow bigger vs. value get out of growing system user scale, high concurrency may/may not be a problem; large or small data amounts, simple or complex queries, computational queries/work How big is big? Is it by row count in tables? By transactions at the sources? By data size? All of the above? Processing tens of GB per night is big to some people. It’s a minor data movement job to others. You can move a terabyte an hour easily through a standard database connection, so is that really a good metric? Number of tables and relationships can be as important as the amount of data stored since these drive query complexity which can result in poor optimizations and lots of data movement.

Unstructured data isn’t really unstructured.

The problem is that this data is unmodeled.

The real challenge is complexity.

Big data?

Presenter

Presentation Notes

This data today will be the structured data of tomorrow, at which point we will probably have new data that is apparently unstructured. People like to talk about data volumes and variety. Ability to discover or derive structure from data. As we learn how to address some of this new data it becomes structured. There’s a migration from disorder to order, but human mediated ain’t it.

The holy grail of databases under current market hype

A key problem is that we’re talking mostly about computation over data when we talk about “big data” and analytics, a potential mismatch for both relational and nosql.

Solving the Problem Depends on the Diagnosis

Presenter

Presentation Notes

The trick to choosing the right technology is to start with a proper diagnosis of what’s really causing the problem rather than focusing only on data volume. Assuming you did all the simple things, the easiest next step is to look at whether there are simple additions to augment the system without radical changes. Rip and replace of your data infrastructure isn’t usually practical. Beyond that, the diagnosis of where the bottleneck is should guide some of the approaches outlined earlier.

You must understand your workload ‐ throughput and response time requirements aren’t enough.▪ 100 simple queries accessing month‐to‐date data

▪ 90 simple queries accessing month‐to‐date data plus 10 complex queries using two years of history

▪ Hazard calculation for the entire customer master

▪ Performance problems are rarely due to a single factor.

Presenter

Presentation Notes

These factors will affect your choice of platform. Nature of the problem, e.g computation over data, types of problems (score, recommend, etc.) Size in terms of row counts, size of data, how much you use/need, Users: small or large queries Performance problems are usually due to a combination of factors, computations over a large data volume, high concurrent queries and volume, etc. It’s rare to have a concurrency or size problem or computational problem by itself.

Workload: One big query or many small queries?

Retrieval: small return set or large?

Selectivity: large volume of data scanned or small?

Presenter

Presentation Notes

Image: rock-fall-roadblock.jpg - http://www.flickr.com/photos/wsdot/4679360979/ roadblock-sheep.jpg - http://www.flickr.com/photos/brizo_the_scot/4013939756/

Important workload parameters to know

• Read‐intensive vs. write‐intensive



• Mutable vs. immutable data




• Immediate vs. eventual consistency





• Short vs. long access latency





• Short vs. long access latency

• Predictable vs. unpredictable data access patterns

Types of workloads

Write‐biased: ▪ OLTP▪ OLTP, batch▪ OLTP, lite▪ Object persistence▪ Data ingest, batch▪ Data ingest, real‐time

Read‐biased:▪ Query▪ Query, simple retrieval

▪ Query, complex

▪ Query‐hierarchical / object / network

▪ Analytic

Mixed?Inline analytic execution, operational BI

Presenter

Presentation Notes

OLTP = the usual stuff like order entry, bank accounts, etc. ACID required Batch = what you’d expect: close the books, batch order picking, order redistribution Lite = things like create a web site account, comment, rate something; midpoint of ACID spectrum Persistence = things like lite, but mainly for the mgmt of shared state and at the far end of ACID spectrum Ingest in big or small chunks Query = usual, couple tables, simple semantics, etc. Simple = fetch Complex = ROLAP, nested, CSQs, many tables, many conditions, many join criteria, aggregations, sorts Analytic = computation

Matching to parameters, at assumption of data scale

Workload parameters

Write‐biased

Read‐biased

Updateabledata

Eventual consistency ok

Un‐predictablequery path

Computeintensive

Standard RDBMS

ParallelRDBMS

NoSQL (kv,dht, obj)

Hadoop*

Streaming database

You see the problem: it’s an intersection of multiple parameters, and this chart only includes the first tier of parameters. Plus, workload factors can completely invert these general rules of thumb.

Presenter

Presentation Notes

What about: response time? Throughput? Selectivity? Retrieval? Query complexity? Computational complexity? Query latency? Determine scale needs: data size (volume) Determine concurrency: # users (concurrency) Determine response time and throughput needs Then match these up to the things that work best.

Matching to parameters, at assumption of data scale

Workload parameters

Complex queries

Selective queries

Low latency queries

High concurrency

High ingest rate

Standard RDBMS

Parallel RDBMS

NoSQL (kv, dht, obj)

Hadoop

Streaming database

You have to look at the combination of workload factors: data scale, concurrency, latency & response time, then chart the parameters.

Presenter

Presentation Notes

What about: response time? Throughput? Selectivity? Retrieval? Query complexity? Computational complexity? Query latency? Determine scale needs: data size (volume) Determine concurrency: # users (concurrency) Determine response time and throughput needs Then match these up to the things that work best.

Always build a proof of concept!

Presenter

Presentation Notes

Image Attributions

Thanks to the people who supplied the images used in this presentation:

Holy Grail – © Monty Python Ltd.Cupcakes – <lost attribution on Flickr>

rock‐fall‐roadblock.jpg ‐ http://www.flickr.com/photos/wsdot/4679360979/

roadblock‐sheep.jpg ‐ http://www.flickr.com/photos/brizo_the_scot/4013939756/

Slide 17

Presenter

Presentation Notes

http://www.flickr.com/photos/wsdot/4679360979/

http://www.flickr.com/photos/brizo_the_scot/4013939756/

About the PresenterMark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, analytics and information management. Mark is an award-winning author, architect and former CTO whose work has been featured in numerous industry publications. During his career Mark received awards from the American Productivity & Quality Center, TDWI, Computerworld and the Smithsonian Institute. He is an international speaker, contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net.

http://thirdnature.net/

About Third Nature

Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, and performance management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place.

Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.

We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.

Technology

Choosing which big data, nosql or database technology to use