62
WWW.OSTUSA.COM DATABASES FOR BIG DATA EVOLUTION OF NoSQL DATABASES and CONCEPTS Bhaskar Gunda, Open Systems Tecnologies

DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

WWW.OSTUSA.COM

DATABASES FOR BIG DATA

EVOLUTION OF NoSQL DATABASES and CONCEPTS

Bhaskar Gunda,Open Systems Tecnologies

Page 2: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 3: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 4: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

About Me

• Bhaskar Gunda – Working as Principal Consultant at Open Systems Technologies • Has 28 years of IT experience• I am an Electrical Engineer with MBA• Started working with Computers while in college building Microprocessor based

systems such as Logic controllers on Intel 8085 and Z-80 systems using Assembly language.

• Started Career with Databases –– First ever database that I worked was – dBase III & dBase IV.– First Commercial database to workd was Sybase .– But immediately transitioned into Oracle –

• was trained in 4.0, but started using 5.0 onwards. • Still continuing to work with Oracle and many other databases – SQL Server, Informix, PostgreSQL, MySQL

• Started working NoSQL DBs couple of years back.• I specialize in building HA and DR systems, End-to-End Infrastructure design,

implementations, migrations.

Page 5: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

About Today’s Presentation

• NoSQL databases are gaining momentum • But there is some confusion over their concepts and different types of NoSQL

Databases.• Originally I thought of only focusing on NoSQL Concepts in this presentation.• But in keeping broader audience in mind, I have included some Database 101

Concepts also in this presentation.• I tried my best to put everything together in a format that flows logically.• As this is not an interactive presentation, I welcome your feedback and any

questions through email.• I will do my best to answer your questions through email. • My contact info is provided at the end of the presentation.

Page 6: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 7: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Data and Information

• Data can be defined as Discrete elements describing a person, thing or an activity.• Information is putting this Data together to form a meaningful Inference –

– Querying What is there – simple way of displaying the data – may be a spreadsheet format or a tabular format

– Visualization of data in a format that can be understood easily – dashboards, graphs, charts etc– Making some meaningful analysis – historical analysis, Incident Analysis, Post-mortem Analysis, Predictive

Analysis..

Often times Data and Information are used interchangeably, which is not correct. – Data is discrete element and Information is a simple or complex compound of these elements.– Data is generated, sourced, gathered, acquired on its own – Information is generated from Data

• Database Management System (DBMS) --– Database is a location where the data is stored in certain format– DBMS is a collection of programs that allows users to specify the structure of database, create, query and

modify the data in the database and control access to it.

Page 8: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Data and Information

• A simple and easy way to understand is to use a Lego Analogy.– Data is like Lego blocks.

– Information is putting these Lego Blocks together to form a thing.

– And a person who puts everything together is a Data Scientist

Page 9: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

POWER OF DATA

• Old Saying – PEN is MIGHTIER than SWORD.

• Modern Saying is – DATA is MIGHTIER than PEN and a SWORD.

• Companies like Yahoo, Google, Facebook, Twitter, LinkedIn and many others are based on Using Data in a meaningful way – doing business with Data and Information. They have completely changed the relationships among people, how they communicate and how they interact with each other. Because of this a term has been coined in – Social Networks.

• Companies like Amazon, Alibaba (largest e-commerce portals) are successful because of mining of data to understand the consumer behavior.

Page 10: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

History of DBMS and Evolution

• Databases have a long history and evolved different models from early 1960’s until now.– Minimal or no-format Databases (No Frills) – These databases were like writing a transaction on a

paper except was stored in Computers – pre 1960’s.

– Hierarchical Database Models – early 1960’s -- Data is stored into different Units with Hierarchical relationships

– Network Database Model – Late 1960’s – Multiple relationships were created with transactions.

– Relational Database Management Systems (RDBMS) -- Early 1970’s – Uses Entity-Relationship model based on E.F.Codd’s 12 Principles

– NoSQL Database – 2009. Deviates away from Relational Model and introduces new method of storing the data

Page 11: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Paper/Shared

Page 12: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

HIERARCHICAL DATABASE MODEL

Page 13: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NETWORK DATABASE MODEL

Page 14: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 15: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Relational Database Management System (RDBMS)

• Most Popular Database System• Developed by E.F.Codd in early 1970’s.• The database is based on 12 Principles developed by E.F.Codd• This is based on Entity and Relationships.• The data is arranged in Databases consisting of Tables – in Row & Column format.• Data storage is optimized with Normalization.• Data in tables are bound by relationships called Constraints – which enforces the

integrity of data across the database.• The tables are arranged in Schema format with access controls.• RDBMS is ACID Complaint.

Page 16: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

ACID - Defined

• ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably.

• Atomicity -- Atomicity requires that each transaction be "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen.

• Consistency -- Consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors cannot result in the violation of any defined rules.

• Isolation -- Isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e., one after the other. Providing isolation is the main goal of concurrency control. Depending on concurrency control method (i.e. if it uses strict - as opposed to relaxed -serializability), the effects of an incomplete transaction might not even be visible to another transaction.

• Durability -- Durability property ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.

Page 17: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Structured Query Language (SQL)

• Special Purpose Programming Language designed for managing data in RDBMS• Developed by IBM in 1970’s.• SQL is 4th Generation Language.• SQL is based on relational algebra and tuple related Calculus.• It consists of DML, DCL and DDL.• RDBMS and SQL are closely tied to each other.

Title

Page 18: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

DBMS ARCHITECTURE

Title

PHYSICAL LAYER(Represents how data is stored on the Storage

Devices)

LOGICAL LAYER(Represents how data is accessed by the users –

Schema, Tables)

VIEW VIEW VIEW

Represents How Data has been portrayed- Using Interface Languages such as SQL

Page 19: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

RDBMS Concept

Unique Values

001,1,Doe,John,3000;002,2,Smith,Jane,3500;003,3,Taylor,John,2800;004,4,Smith,Mike,2500;005,5,Doak,Richard,4000;006,6,Brown,Dan,3500

Row Format Storage

ID

1

2

6

5

4

3

Last

Doe

Smith

Brown

Doak

Smith

Taylor

First

John

Jane

Dan

Richard

Mike

John

Bonus

3000

3500

3500

4000

2500

2800

Possible duplicate contents

Unique

ROWID

001

002

006

004

003

005

Page 20: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

RDBMS Advantages

• Very popular and almost all the ERPs and many mainstream applications are run on RDBMS.

• Integrity and consistency of data and simple representation of data layout – tables & constraints in a schema level

• Physical independence – Users are not worried about physical layer, but only interact with Logical layer.

• Logical Independence – makes database portable across physical layers and applications and users are not impacted for most of the times

• Support for SQL• Better backup and restore capabilities

Title

Page 21: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

RDBMS Disadvantages

• Expensive and complex Software• Expensive Hardware• Highly Skilled resources are required for setting up and managing.• Difficult to recover data if lost• Horizontal scalability is limited• Only Vertically scalable• Very difficult to utilize many complex data types• Does not completely represent real world conditions• Data processing becomes slow as the size increases or some times even simpler

data sizes also due to changing data handling algorithms.• Very limited support for 3 GLs and hence Procedural handling of Data is not easy.

Title

Page 22: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 23: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

EXPLOSION OF DATA

• With advent of Social networks, increases utilization of Computers and wide spread use of Internet, the data in the world is growing at tremendous pace.

• Oracle has done a study to estimate the data growth and current data content in the world from all the sources and found the following– Data is growing at very faster pace – at an annually compounded rate of 40%.

– It is almost doubling every year or may be even more in next few years.

– At the current rate of growth it will reach about 45 Zetabytes (ZB) by 2020

(1 zettabyte = 1021 bytes or 1 trillion GB)

– Amount of Data that exists today is 2 times of what it was 2 years back.

• Due to increase in the data sources such as Social Networks, Internet of things (IoT), Healthcare – different data types are being generated

• All the above factors have started to limit the use of RDBMS

Title

Page 24: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 25: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

BIG Data Challenges and RDBMS Limitations

BIG DATA CHALLENGE RDBMS LimitationHigh Velocity – Data is generated at a very high speed and required tobe ingested

It is not easy to configure RDBMS for high rate of data Ingestion.Requires many resources and hence high cost software/hardware

High Variance – Data generated is of different data types – noparticular format or data type can be defined for certain data sources– such as Social networks – structured, semi-structured & un-structured

RDBMS has only certain data types. Others have to be defined, butdefining and maintaining to meet current requirements is veryexpensive and still does not blend in properly.

High Volume – Data often generated is in high volume RDBMS creates a limitation in ingesting large amounts of data. Toenable more resources and more licenses and more costs

High Veracity – Uncertainty and Uncleansed data. RDBMS has to be designed to handle peak loads even if it is notalways the case and prior cleansing is required – which makes itdifficult to handle and prohibits the cost

Continuous Data and Availability RDBMS requires huge amount of investment to achieve very high HAand DR capabilities and still not 100% RTO and RPO are met.

Location Independence -- ability to read and write to a databaseregardless of where that I/O operation physically occurs and to haveany write functionality propagated out from that location, so that it’savailable to users and machines at other sites.

RDBMS hits the limit of this functionality. We cannot have multiplenodes writing to multiple places and still have data concurrency.Oracle RAC provides distributed computing, but not distributedcopies of database at the same time.

Flexible Data Models – not tied into any principles or schema RDBMS hits the wall if any of its principles are deviated or cannotcreate schema less, dependency less model

Faster Analytics and Business Intelligence RDBMS again hits the limit with performance and scalability when itcomes to Real-Time analytics and Business Intelligence.

Page 26: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 27: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Paradigm Shift in Database Management

Title

• Organizations are increasingly conceding the fact that the exploitation of its big data is a major factor in competitiveness in the next decade.

• We are trying to solve Today’s problems with Yesterday’s solutions.• For everything and anything RDBMS is not the solution. • Big Data Analytics does not need RDBMS methodology. To certain extent ACID

can be either compromised or taken care of at the source and hence do not additionally be enforced in the Database.

• Highly Scalable, low cost solution – should be the option and hence RDBMS cannot be used. RDBMS is a proprietary system with huge Software Cost.

• SQL is not always the Method to Extract Data – RDBMS and SQL are inseparable.• Most organizations have started to cross of chasm of RDBMS to NoSQL

databases.

Page 28: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 29: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Databases

• NoSQL Database is a buzz word in modern database technology world• NoSQL is a word coined by Carlo Strozzi in 1998 to name his lightweight, Strozzi NoSQL

open-source relational database that did not expose the standard SQL interface, but was still relational.

• NoSQL DB now has changed its original meaning OR rather added more to the original concept of Carlo Strozzi of using just SQL to interact with database.

• Decoupling SQL from RDBMS means changing the RDBMS methodology is today’s concept.

• And hence NoSQL Database means “Not Only SQL” database. Or in other words using a concept beyond RDBMS.

• NoSQL databases are some times called – “Non RELATIONAL”, “Non SQL” – but in my opinion it is not completely True – It is just beyond usage of SQL only – means shift in the way Data is stored and Managed – another new Breed of DBMS – NoSQL

Title

Page 30: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Birth of NoSQL

• Johan Oskarsson of Last.fm reintroduced the term NoSQL in early 2009 when he organized an event to discuss "open source distributed, non relational databases".

• Concept of Hadoop and Open Source have opened the doors to World of Innovation in Database Management Systems to look beyond RDBMS.

• One of Early NoSQL Database Entry was– Google BitTable• The key in developing the concept of NoSQL database was – Distributed

Processing, Horizontal Scalability, Use of Cheap and Commodity Hardware, Speed of Analytics using 3GL and other languages and not just 4GL - SQL.

Title

Page 31: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Benefits of NoSQL Database

• NoSQL databases have different models and are purpose built.• Compared to RDBMS NoSQL databases are more scalable and Provide superior

performance • Large Volumes of Rapidly changing, semi-structured and unstructured data can

easily be handled• Helps in Agile sprints, quick schema iteration and frequent code pushes• Object oriented programming that is easy to use and flexible• Geographically distributed scale-out architecture. • All the challenges described for Big Data are addressed with NoSQL database.

Title

Page 32: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Database Concepts

• Open Source• Schemaless• Scalability with Scale Out with Commodity Class Hardware • Distribution and Sharding – Parallel Query with Engines such as MapReduce &

Spark, Distributed Caches• Data ingestion and extraction using multiple methods.• Eventual Consistency• High Availability

Title

Page 33: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Concepts – Open Source

• Typically most of the NoSQL databases are open source – Hbase, CouchDB• There are many vendors today offering commercial Databases with support –

MongoDB, Vertica, Couchbase Server• Some of the vendors have built the offering on top of Open Source – Splice

machine• Almost all of these databases are integrated with many Open Source tools.• They layer on top of some the Big Data environments or utilize the tools and

concepts already in place for Big Data Eco system.• Does not require SQL engine – however, many of the vendors have developed

products that are more of SQL type which translates into built-in distribution processes

Title

Page 34: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Concepts – Schemaless

• This is something very hard to conceptualize coming from RDBMS world.• NoSQL solutions do not require, or accept, a pre-planned data model whereby every

record has the same fields and each field of a table has to be accounted for in each record

• They support a flexible data model. Though there can be strong similarities from record to record, there is no “carry-over” from one record to the next.

• Each field is encoded with JavaScript Object Notation (JSON) or Extensible Markup Language (XML) according to the solution’s architecture.

• The result is that developers have the agility they need to meet evolving business requirements.

• Because of this model data can be dumped without Transformation. Transformation of data occurs while Extracting the data – ELT Vs ETL in RDBMS. This is very much useful in building Data Warehouse systems.

• Schema is built on Query

Title

Page 35: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Database Architecture

PHYSICAL LAYER(Represents how data is stored on the Storage

Devices)

LOGICAL LAYER(Represents how data is accessed by the users –

Schema)

VIEW VIEW VIEW

Represents How Data has been portrayed- Using Interface Languages such as SQL, Python or Tools like Tableau or Qlik

View & Logical Layers are merged.Logical Layer becomes part of Data Visualization OR in other words a Schema is built upon Query

Page 36: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Concepts – Scalability with Scale Out

• NoSQL databases are Scalable with Scale Out model.• NoSQL solutions support a scale out model for growth by dividing the

programming across a single data set spread over many machines. • While relational databases are engineered to scale up by adding additional

resources to the server, NoSQL databases are engineered to scale by adding additional servers or nodes. – Distributed Processing Model

• This is the concept taken from Hadoop. But NoSQL databases do not necessarily require Hadoop infrastructure in background.

• NoSQL databases like Hadoop can run on Commodity Class hardware and does not require any high end Infrastructure as RDBMS.

• There is no limit to the amount of servers that NoSQL databases can run on.

Title

Page 37: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Concepts – Distribution with Sharding

• These databases are Engineered to run on Multiple Installations of servers.• NoSQL solutions utilize a partitioning pattern known as SHARDING– that places

each partition in potentially separate servers that are potentially physically disparate.

• The result is that each server is responsible for operating its data instead of all of the data.

• This helps in Scalability with Scale out as discussed.• This model helps in running Parallel Query Operations using Big Data Engines

such as MapReduce or Spark.• Sharding is implemented using Distributed Cache Model.

Page 38: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Distributed Processing between RDBMS & NoSQL

Title

Distributed Processing in RDBMS Distributed Processing in NoSQL DB

1. Single Copy of database2. Possible Block level contention.3. If same block is accessed, then the entire record or

page will be locked.

1. Multiple copies of Database.2. Blocks are distributed across machines and hence will not lock

each other.3. Only block level is locked – so entire record is not locked.4. Added benefit is Higher availability

Page 39: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Concepts – Data Ingestion and Extraction

• Most of the NoSQL databases support many Data ingestion tools in Big Data Eco system such as Flume, SQOOP, Spark Streaming

• Data is extracted using many methods – not necessarily SQL. However, some mainstream vendors have built their own implementations of SQL for jump starting the process, actual power is utilizing Low level programming languages such as Java, Python, Scala, R etc.

• If SQL method is used – then in the background the SQL Jobs are split into multiple processes spread across different nodes much like MapReduce or Spark. Or some of the databases are built on top of MapReduce or Spark and hence are submitted as MapReduce or Spark Jobs.

• Data visualization Tools such as Tableau or Qlik support most of the NoSQL DBs.

Title

Page 40: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Database Concepts – Eventual Consistency

• This is another concept very hard to visualize.• In RDBMS world we are used to have Data consistency based on ACID.• But Some NoSQL solutions still do not have strong consistency like a single

machine system does.• Each record will be consistent, but transactions are usually guaranteed to be

“eventually consistent” which means changes to data could be staggered for a short period of time due to a lower latency in the write operation.

• Sometimes CONSISTENCY can be compromised depending upon the application that is using this database – for example Predictive Analytics or running What If scenarios.

Title

Page 41: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Database Concepts – High Availability

• By virtue of the Design High Availability is built into NoSQL databases.• There is no extra effort or software is required for this purpose.• Data is distributed across multiple nodes with multiple copies much like Hadoop

infrastructure.• Failure of any node in the cluster will not affect the data loss or processing failure.• Once the failed hardware is replaced or brought online, the data on that node is

automatically synchronized from the changed blocks on the other nodes.

Title

Page 42: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL DBMS Applications

• With some of the questions about ACID compliance, schema less options, support for SQL etc, questions may arise where exactly the NoSQL Database can be utilized.

• What type of applications are supported on NoSQL Database.• NoSQL databases are mostly deployed for ad-hoc query purposes. These

databases are not deployed for OLTP purposes. (Even though some of the vendors are coming out with ACID compliance and OLTP support, but largely they are not used for OLTP).

• Primary applications – Data Warehouse, BI, Predictive Analytics, Big Data applications.

• Data Warehouse and BI applications benefit most with NoSQL DBs as it reduces cost of hardware, software, increased the processing output; Best of all using ELT and not ETL.

Title

Page 43: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 44: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Database Types

• All NoSQL Databases are not designed similarly• They are different types of NoSQL Databases based on the design on how they

store data.• Types of NoSQL Databases are –

– Columnar Databases stores

– Key-Value Database stores

– Document Database stores

– Graphical Database stores

– Multi-model Database stores

Title

Page 45: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

COLUMNAR DATABASE Store

• Most popular model of database is Columnar Database model as this model is closer to RDBMS.• It is a DBMS that stores data tables as sections of columns of data rather than as rows of data

(unlike RBMS where data is stored in rows). Explained in the next slide.• Data is compressed by eliminating the duplicate data in the columns. On top of it, one of the most

popular compression models – LZW (Lempel-Ziv-Welch) algorithm, Run-length encoding.• Compression is further enhanced by sorting the data in the columns.• Some of the most popular databases of this model are –

– HP Vertica, Hbase, Cassandra, Accumulo, BigTable, Splice Machine

• SAP HANA is one of the popular columnar database store – but it is designed to support only SAP application and very expensive. SAP has announced entire ERP (OLTP & Batch processing) -- SAP S6 to be supported on HANA beginning of last year-2015.

• Most Common utilization of this model is – Clinical Data processing, Data Warehouse & BI, Library card catalogs, ad-hoc query requirements requiring large amounts of small set of columns is aggregated.

Title

Page 46: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Column Format Storage

ID

1

2

6

5

4

3

Last

Doe

Smith

Brown

Doak

Smith

Taylor

First

John

Jane

Dan

Richard

Mike

John

Bonus

3000

3500

3500

4000

2500

2800

Unique Values

Possible duplicate contents

Unique

1,2,3,4,5; Doe,Smith,Taylor,Smith,Doak,Brown; John,Jane,John,Mike,Richard,Dan;3000,3500,2800,2500,4000,3500

1:001;2:002;3:003;4:004;5:005; Doe:001;Smith:002,004;Taylor:003;Doak:005;Brown:006; John:001,003;Jane:002;,Mike:004;Richard:005;Dan:006;3000:001;3500:002,006;2800:003;2500:004;4000:005;

ROWID

001

002

006

004

003

005

Page 47: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

RDBMS Vs Columnar stores

Title

• 001,1,Doe,John,3000;• 002,2,Smith,Jane,3500;• 003,3,Taylor,John,2800;• 004,4,Smith,Mike,2500;• 005,5,Doak,Richard,4000;• 006,6,Brown,Dan,3500

1:001;2:002;3:003;4:004;5:005; Doe:001;Smith:002,004;Taylor:003;Doak:005;Brown:006; John:001,003;Jane:002;,Mike:004;Richard:005;Dan:006;3000:001;3500:002,006;2800:003;2500:004;4000:005;

RDBMS or ROW format storage Columnar format storage

ID

1

2

6

5

4

3

Last

Doe

Smith

Brown

Doak

Smith

Taylor

First

John

Jane

Dan

Richard

Mike

John

Bonus

3000

3500

3500

4000

2500

2800

ROWID

001

002

006

004

003

005

Page 48: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Pros and Cons of Columnar Database

• Pros –– This is very much useful and efficient when an aggregate needs to be computed over many rows

but only for smaller subset of data.

– This is efficient when new values of a column are supplied for all rows at once.

– High compression helps in reduced storage requirements and reduced Disk Reads

• Cons –– If many columns of a single row or multiple rows have to queried or fetched then this may be less

efficient – but still it outperforms RDBMS.

– If entire row has to be updated or replaced then it will take some time to perform the operation.

Title

Page 49: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Key-Value Database Store

• This is a method for storing, retrieving and managing Arrays of data where Metadata is defined for each value in the array.

• This store consists of collection of Objects or Records of similar type but has different fields.

• Each record may differ from others.• It is different than RDBMS – where each record has pre-defined model of key-

values.• Document based and Graphical based models are derived from this model.• This follows more closely with modern concepts like Object Oriented

Programming (OOP).• Most popular databases in this format are –

– REDIS, Oracle NoSQL DB, Berkley DB, DynamoDB

Title

Page 50: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Key-Value Database Store -- Storage

• An XML format (or JSON format) as follows represent the data storage in Key-Value store

<contact>

<firstname>Bhaskar</firstname>

<lastname>Gunda</lastname>

<street1>605 Seward Ave. NW</street1>

<city>Grand Rapids</city>

<state>MI</state>

<zip>49504</zip>

<country>USA</country>

</contact>

– This record is of type – Contact/Address.

– Each field has metadata (key) defining the value.

Title

Page 51: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Document Database Store

• This is another popular method of storing the data. In fact adoption of NoSQL has increased because of this model.

• This is designed for storing, retrieving and managing document-oriented data – semi-structured data.

• This model is a subset of Key-value store but differs from it by not having the keys pre-defined.

• Metadata is generated for each document separately.• The data stored in a free-from. • This differs from RDBMS where a fixed record structure is created for acquiring and

storing the data.• Programmers create intelligence in parsing the data.• Each document is a record of its own and every record may differ from others. Each

record is of same type but not necessarily have same number of fields.

Title

Page 52: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Document Data Store – Contd.

• Each document is retrieved using a Unique key – usually a URI.• Database retains index on the Keys to speed up the retrieval process.• This makes this database to be popular in Web applications.• A free form of data store, automatic suggestions of data are the primary applications of

this data store.• For retrieval purpose admin adds hints to the databae to look for certain type of

information.• Any document data containing metadata – such as JSON, XML can be used to store the

data in this store.• Most popular databases are –

– Couchbase Server– CouchDB– MongoDB– Elasticsearch

Title

Page 53: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Document Data Store – Storage

Title

Bhaskar GundaOST,605 Seward Ave NW,Grand Rapids, MI 49504

XYZ605 Seward Ave, Grand Rapids, MI 49504

ABCOST,PO.Box. 456605 Seward Ave NW,Grand Rapids, MI 49504

• Each of the above boxes represent One Document• All three boxes are of same type – Address type document• But they differ in the content and number of fields.• Each of these documents are stored with Unique values and the metadata is generated for

each document.• Programmer writes hints such as “find all my <contact>s with a <zip code>”

This document does not contain Company Name than the first document

This document contains additional PO Box field than the first document

Page 54: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Document Database Store – Applications

• This type of data store is more popular in Web applications.• Largely used for semi-structured data.• Implementations offer a variety of ways of organizing documents, including

notions of:– Collections

– Tags

– Non-visible Metadata

– Directory hierarchies

– Buckets

Title

Page 55: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

GRAPHICAL DATABASE STORE

• This model utilizes a Graph compute model consisting of Nodes & Relationships.– Each Node is an Entity – a person, place, thing or an activity

– Each Relationship is how Two Nodes are connected to each other.

• Graph Database Model is a DBMS system with storing, retrieving and manipulating data working in a Graph data model.

• Relationships take first priority in this model – applications doesn’t have to infer data connections using foreign keys. This is the difference between RDBMS and this model.

• This is simpler and more Expressive than other models.• This model is more useful in Social networks traversing relationships.• Graphical databases can be OLTP databases and are fully ACID complaint.• Some Graphical Databases implement Key-Value store internally for building the

relationships (pointers) between records.• Most popular databases are– Neo4j, Giraph

Title

Page 56: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Graphical Database Store - Storage

Title

Page 57: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Multi-model Database Store

• Each of the databases (columnar, key-value, document, graphical) are organized in a single database model that determines how data is stores, retrieved and manipulated.

• If an Organization has need for two different applications which are optimized by one data model for each, then they have to have two different Models implemented for each type of application (called Polyglot Persistence)– which defeats the purpose of using NoSQL Database.

• This is resolved by combining two different models. • This offers a great advantage of polyglot persistence.• This model is also ACID compliant.• One of the first and mostly used database is – OrientDB (supporting Graph, document,

key-value & object Models). • Other popular database is – Couchbase server.

Title

Page 58: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Selecting a NoSQL Database

Title

• Selecting which model of database is suitable largely depends upon the intended Business use of the data.

• Key Factors to be considered are -• Model of the database store as required by Business need.

• Scalability

• ACID Compliance required

• Sharding Capability

• Ability to utilize In-Memory transactions or Not

• Data Ingestion, extraction and Visualization support

• Support for Hadoop Eco system

• Cost to support

Page 59: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

NoSQL Database Challenges

Title

• NoSQL databases are mostly used for ad-hoc queries, predictive analytics and recently increasing the use in DW and BI applications. It is not intended for OLTP or support mainstream applications such as ERPs.

• Security is one of the concerns in these models. However, Vendor provided NoSQL database are implementing to certain extent some rigid Security models.

• Selecting a right model to suit the business need requires an in-depth analysis and understanding of each of the models – this requires a highly skilled resource (usually outside resource) to identify the right type.

• Risk in selection can be mitigated by conducting a POC upon short listing the models selected. Usually cloud can be used for this purpose.

Page 60: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Agenda• About Me• Introduction to DBMS – History and Evolution• RDBMS concepts• Overview of Big Data• Boundaries of RDBMS—Need for DBMS beyond RDBMS• Paradigm Shift in DBMS• NoSQL Databases – Definition, Advantages and breaking boundaries• Types of NoSQL Databases and their Usage• Future of RDBMS

Page 61: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

Future of RDBMS

• With all this discussion we may feel that RDBMS is going to die.• Is it real that RDBMS is going to die?

– Not in Reality. RDBMS enforces certain requirements such as ACID compliance, General Model, matured state of data storage which are all required for the mainstream applications.

– Many applications – ERPs and Transactional systems are designed for RDBMS.

– For all OLTP – RDBMS becomes a choice of database.

• In reality RDBMS and NoSQL databases will co-exist for many years to come in any organization. But some NoSQL databases are also closing the gap between RDBMS and NoSQL and making NoSQL database to be RDBMS as well.

• It will be very expensive preposition for any organization to replace RDBMS for their business operations.

• However, it becomes easier, cheaper and most beneficial if they can replace RDBMS with NoSQL Databases for applications like Data Warehouse, BI or any new Analytics platform.

Page 62: DATABASES FOR BIG DATA · Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together

My Info

Bhaskar Gunda

Principal Consultant,

OST

Phone: 616-574-3504

Email: [email protected]

Title