Guide to SQL to NoSQL migration

Preview:

DESCRIPTION

Is your legacy database infrastructure struggling to meet the demand of customer Service Level Agreements? If you, like many companies, are discovering that your infrastructure is not robust enough to deal with the speed and scale required of today's Internet-scale applications, it may be time to consider a switch to NoSQL storage. Changing storage systems can be a daunting process and, with all the buzz surrounding NoSQL, it can be difficult to know where to start. As a Solutions Architect at Thumbtack Technology, Anton Yazovskiy has helped many companies through the selection and deployment process of NoSQL technologies. In this webinar, Anton will explain the main advantages of NoSQL and common use cases in which the migration to NoSQL makes sense. You will learn key questions that you should ask before migration, as well as important differences in data modeling and architectural approaches. Finally, you will take a look at a typical application based on Relational Database Management System (RDBMS) and will migrate it to NoSQL step-by-step. Key topics that will be covered: > Why you would want to migrate to NoSQL > Conceptual differences between RDBMS and NoSQL > Data modeling and architectural best practices > "I got it. But what exactly I need to do?" - Practical migration steps ABOUT THE PRESENTER Anton Yazovskiy is a Software Engineer at Thumbtack Technology, where he focuses on high-performance enterprise architecture. He has presented at a variety of IT conferences and “DevDays” on topics such as NoSQL and MarkLogic.

Citation preview

GUIDE TO SQL - NOSQL MIGRATION

Anton Yazovskiy Solution Architect, Thumbtack Technology

AGENDA

• Why would you want to migrate to NoSQL

• Conceptual difference between RBDMS and NoSQL

• Data modeling and architectural best practices

• Practical migration steps / questions you have to ask

WHY?scalability

performance developer productivity

CONCEPTUAL DIFFERENCE BETWEEN RBDMS AND NOSQL• relational schema allows you to query data in many different ways in different contexts

• accessible for many types of applications and separate dev teams

• schema helps to control rules common for everybody

!

• always remember that in most cases you run queries across the cluster

• NoSQL is about focusing on particular need and goal

• model your data for specific use case

• define what are you willing to sacrifice to achieve better results

DATA MODELING AND ARCHITECTURAL BEST

PRACTICES

POLYGLOT PERSISTENCE• different solutions are designed to solve different problems

• session & fast transactions

• cache

• aggregations

• analytical ad-hoc queries

• graph traversal

• the requirements for OLTP and OLAP storages are very different

POLYGLOT PERSISTENCE

NOSQL DATA STRUCTURES

• Key-Value: Riak, Redis, MemcacheDB, Aerospike and Amazon DynamoDB (Cloud).

• Key-Document: MongoDB and Couchbase.

• Column-Family: Cassandra, HBase

• Graph Databases - Neo4j and OrientDB.

PRACTICAL MIGRATION

STEPS• what would you like to achieve • learn your traffic • lean your data set • what are you willing to sacrifice • apply polyglot persistence • model your data • synchronization

WHAT WOULD YOU LIKE TO ACHIEVE

• better performance

• scale current solution

• process more or(and) different data

• speed-up the development

• I heard of it

LEARN YOUR TRAFFIC• how workload looks like:

• OLTP (simple lookups, short transactions)

• OLAP (aggregations, analytical queries, ad-hock scans, etc.)

• heavy-read, heavy-write

• what kind of queries do you perform in order to address application's questions:

• simple lookups, uncertain search, inner requests, traversal, BI/Analysis

LEAN YOUR DATA SET• what kind of data types do you operate with

• simple key-value

• structure, semi-structure

• nested/hierarchical

• graph-oriented

• what size of each data type do you have

WHAT ARE YOU WILLING TO SACRIFICE

• what data doesn't require a strong consistency

• where transactional guarantees aren't require

• what data are you willing to lost in case of hardware failure

• where are you willing to sacrifice joins

APPLY POLYGLOT PERSISTENCE

• Based on discovered answers, define the most obvious types of storages that you may need

• fast & simple storage for lookups, non-critical data and short transactions

• RDBMS for data that fit into single server

• document-oriented storage for inner/hierarchical data and aggregate-oriented reads & writes

• graph-oriented storage for traversal queries, social relations, etc.

• highly-scalable storage for BigData background processing

DEFINE A DATA MODEL

DATA MODELING: BEFORE YOU START

• from “what data do I have”to “what questions do I have”

• denormalization & duplication are your best friends

• hierarchical and embedded structures make your life easier, but they are your worst enemy

REFERENCES

• in-application joins

• nothing to be ashamed about

• apply carefully

!{ user_name: ayazovskiy, contact: {..}, access: { level: 523, group: dev } } { access_level: 523, rules: [...] }

DUPLICATION• Duplication is a technique of copying pieces of data between

structures in order to either optimize query processing time or convert data into particular business model.

!

• The main advantages of denormalization is ability to:

1. reduce the number of I/O operations and query time

2. reduce complexity of query processing in distributed systems

AGGREGATES• simplify data processing logic

• optimize read/write time

• ability to distribute the data across the cluster

• reduce # of requests across the cluster

• perform atomic updates

{ user_name: ayazovskiy, contact: { phone: 123, email: @thumbtack.net }, access: { level: 5, group: dev } }

AGGREGATES

• updates of duplicated data are heavy and complex

• querying across aggregates heavy and complex

{ user_name: ayazovskiy, contact: { phone: 123, email: @thumbtack.net }, access: { level: 5, group: dev } }

COUNTERS

• NoSQL auto-increment analog

• distributed consistent auto-increment is tricky

• counters aren't always reliable *

COMPOSITE KEYS

{ "ID": "chat#user_1#user_2#december_12_2014", "messages": [ { "user_1": "hey" }, { "user_1": "how is going?" }, { "user_2": "thanks, pretty well!" } ] }

APPEND

{ ID: account#User_A, account_total: $100, account_total_calculation_time: .., changes_since_last_calculation: [ 1399493200: +$10, 1399892139: -$25 ] }

THINK OF DATA SYNCHRONIZATION

• application-level synchronization:

• e.g. update user profile in document-oriented storage, it's social network in graph storage, and session in key-value cache

• regular synchronization:

• this may be a hourly/daily/weekly process that takes updated data and propagates across the system

• incremental background synchronization

• solutions like Tungsten synchronizer allows you to track changes in RDBS via transactional log, and apply these changes immediately to NoSQL storage

• e.g. user profiles in MySQL synchronized with Aerospike via property configured Tungsten Replicator

–Anton Yazovskiy

“always remember that in most cases you run queries across the cluster”

Any questions?

Thank you

@yazovsky ayazovksiy@thumbtack.net www.thumbtack.net

THANKS / REFERENCES• NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot

Persistence by Pramod J. Sadalage and Martin Fowler

• NoSQL Data Modeling Techniques

(http://highlyscalable.wordpress.com)

• MongoDB documentation (http://docs.mongodb.org)

• Couchbase documentation (http://docs.couchbase.com)

• FoundationDB Blog (http://blog.foundationdb.com)

Recommended