Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
CHAPTER 24
NOSQL Databasesand Big Data Storage Systems
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction
n NOSQLn Not only SQL
n Most NOSQL systems are distributed databases or distributed storage systemsn Focus on semi-structured data storage, high
performance, availability, data replication, and scalability
Slide 24- 3
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction (cont’d.)
n NOSQL systems focus on storage of “big data”n Typical applications that use NOSQL
n Social median Web linksn User profilesn Marketing and salesn Posts and tweetsn Road maps and spatial datan Email
Slide 24- 4
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
24.1 Introduction to NOSQL Systems
n BigTablen Google’s proprietary NOSQL systemn Column-based or wide column store
n DynamoDB (Amazon)n Key-value data store
n Cassandra (Facebook)n Uses concepts from both key-value store and
column-based systems
Slide 24- 5
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction to NOSQL Systems (cont’d.)
n MongoDB and CouchDBn Document stores
n Neo4J and GraphBasen Graph-based NOSQL systems
n OrientDBn Combines several concepts
n Database systems classified on the object modeln Or native XML model
Slide 24- 6
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction to NOSQL Systems (cont’d.)
n NOSQL characteristics related to distributed databases and distributed systemsn Scalabilityn Availability, replication, and eventual consistencyn Replication models
n Master-slaven Master-master
n Sharding of filesn High performance data access
Slide 24- 7
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction to NOSQL Systems (cont’d.)
n NOSQL characteristics related to data models and query languagesn Schema not requiredn Less powerful query languagesn Versioning
Slide 24- 8
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction to NOSQL Systems (cont’d.)
n Categories of NOSQL systemsn Document-based NOSQL systemsn NOSQL key-value storesn Column-based or wide column NOSQL systemsn Graph-based NOSQL systemsn Hybrid NOSQL systemsn Object databasesn XML databases
Slide 24- 9
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
24.2 The CAP Theorem
n Various levels of consistency among replicated data itemsn Enforcing serializabilty the strongest form of
consistencyn High overhead – can reduce read/write operation
performancen CAP theorem
n Consistency, availability, and partition tolerancen Not possible to guarantee all three simultaneously
n In distributed system with data replication
Slide 24- 10
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The CAP Theorem (cont’d.)
n Designer can choose two of three to guaranteen Weaker consistency level is often acceptable in
NOSQL distributed data storen Guaranteeing availability and partition tolerance
more importantn Eventual consistency often adopted
Slide 24- 11
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
24.3 Document-Based NOSQL Systems and MongoDB
n Document storesn Collections of similar documents
n Individual documents resemble complex objects or XML documentsn Documents are self-describingn Can have different data elements
n Documents can be specified in various formatsn XMLn JSON
Slide 24- 12
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MongoDB Data Model
n Documents stored in binary JSON (BSON) formatn Individual documents stored in a collectionn Example command
n First parameter specifies name of the collectionn Collection options include limits on size and
number of documents
n Each document in collection has unique ObjectID field called _id
Slide 24- 13
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MongoDB Data Model (cont’d.)
n A collection does not have a scheman Structure of the data fields in documents chosen
based on how documents will be accessedn User can choose normalized or denormalized
designn Document creation using insert operation
n Document deletion using remove operation
Slide 24- 14
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 24- 15
Figure 24.1 (continues)Example of simple documents in MongoDB (a) Denormalized document design with embedded subdocuments (b) Embedded array of document references
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 24- 16
Figure 24.1 (cont’d.) Example of simple documents in MongoDB (c) Normalized documents (d) Inserting the documents in Figure 24.1(c) into their collections
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MongoDB Distributed Systems Characteristics
n Two-phase commit methodn Used to ensure atomicity and consistency of
multidocument transactionsn Replication in MongoDB
n Concept of replica set to create multiple copies on different nodes
n Variation of master-slave approachn Primary copy, secondary copy, and arbiter
n Arbiter participates in elections to select new primary if needed
Slide 24- 17
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MongoDB Distributed Systems Characteristics (cont’d.)
n Replication in MongoDB (cont’d.)n All write operations applied to the primary copy
and propagated to the secondariesn User can choose read preference
n Read requests can be processed at any replican Sharding in MongoDB
n Horizontal partitioning divides the documents into disjoint partitions (shards)
n Allows adding more nodes as needed n Shards stored on different nodes to achieve load
balancingSlide 24- 18
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MongoDB Distributed Systems Characteristics (cont’d.)
n Sharding in MongoDB (cont’d.)n Partitioning field (shard key) must exist in every
document in the collectionn Must have an index
n Range partitioningn Creates chunks by specifying a range of key valuesn Works best with range queries
n Hash partitioningn Partitioning based on the hash values of each shard
key
Slide 24- 19
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
24.4 NOSQL Key-Value Stores
n Key-value stores focus on high performance, availability, and scalabilityn Can store structured, unstructured, or semi-
structured datan Key: unique identifier associated with a data item
n Used for fast retrievaln Value: the data item itself
n Can be string or array of bytesn Application interprets the structure
n No query languageSlide 24- 20
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
DynamoDB Overview
n DynamoDB part of Amazon’s Web Services/SDK platformsn Proprietary
n Table holds a collection of self-describing itemsn Item consists of attribute-value pairs
n Attribute values can be single or multi-valuedn Primary key used to locate items within a table
n Can be single attribute or pair of attributes
Slide 24- 21
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Voldemort Key-Value Distributed Data Store
n Voldemort: open source key-value system similar to DynamoDB
n Voldemort featuresn Simple basic operations (get, put, and delete)n High-level formatted data valuesn Consistent hashing for distributing (key, value)
pairsn Consistency and versioning
n Concurrent writes allowedn Each write associated with a vector clock
Slide 24- 22
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 24- 23
Figure 24.2 Example of consistent hashing (a) Ring having three nodes A, B, and C, with C having greater capacity. The h(K) values that map to the circle points in range 1 have their (k, v) items stored in node A, range 2 in node B, range 3 in node C (b) Adding a node D to the ring. Items in range 4 are moved to the node D from node B (range 2 is reduced) and node C (range 3 is reduced)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Examples of Other Key-Value Stores
n Oracle key-value storen Oracle NOSQL Database
n Redis key-value cache and storen Caches data in main memory to improve
performancen Offers master-slave replication and high availabilityn Offers persistence by backing up cache to disk
n Apache Cassandran Offers features from several NOSQL categoriesn Used by Facebook and others
Slide 24- 24
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
24.5 Column-Based or Wide ColumnNOSQL Systems
n BigTable: Google’s distributed storage system for big datan Used in Gmailn Uses Google File System for data storage and
distributionn Apache Hbase a similar, open source system
n Uses Hadoop Distributed File System (HDFS) for data storage
n Can also use Amazon’s Simple Storage System (S3)
Slide 24- 25
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hbase Data Model and Versioning
n Data organization conceptsn Namespacesn Tablesn Column familiesn Column qualifiersn Columnsn Rowsn Data cells
n Data is self-describing
Slide 24- 26
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hbase Data Model and Versioning (cont’d.)
n HBase stores multiple versions of data itemsn Timestamp associated with each version
n Each row in a table has a unique row keyn Table associated with one or more column
familiesn Column qualifiers can be dynamically specified
as new table rows are created and insertedn Namespace is collection of tablesn Cell holds a basic data item
Slide 24- 27
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 24- 28
Figure 24.3 Examples in Hbase (a) Creating a table called EMPLOYEE with three column families: Name, Address, and Details (b) Inserting some in the EMPLOYEE table; different rows can have different self-describing column qualifiers (Fname, Lname, Nickname, Mname, Minit, Suffix, … for column family Name; Job, Review, Supervisor, Salary for column family Details). (c) Some CRUD operations of Hbase
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hbase Crud Operations
n Provides only low-level CRUD (create, read, update, delete) operations
n Application programs implement more complex operations
n Createn Creates a new table and specifies one or more
column families associated with the tablen Put
n Inserts new data or new versions of existing data items
Slide 24- 29
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hbase Crud Operations (cont’d.)
n Getn Retrieves data associated with a single row
n Scann Retrieves all the rows
Slide 24- 30
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hbase Storage and Distributed System Concepts
n Each Hbase table divided into several regionsn Each region holds a range of the row keys in the
tablen Row keys must be lexicographically orderedn Each region has several stores
n Column families are assigned to storesn Regions assigned to region servers for storage
n Master server responsible for monitoring the region servers
n Hbase uses Apache Zookeeper and HDFS
Slide 24- 31
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
24.6 NOSQL Graph Databases and Neo4j
n Graph databasesn Data represented as a graphn Collection of vertices (nodes) and edgesn Possible to store data associated with both
individual nodes and individual edgesn Neo4j
n Open source systemn Uses concepts of nodes and relationships
Slide 24- 32
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Neo4j (cont’d.)
n Nodes can have labelsn Zero, one, or several
n Both nodes and relationships can have propertiesn Each relationship has a start node, end node,
and a relationship typen Properties specified using a map patternn Somewhat similar to ER/EER concepts
Slide 24- 33
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Neo4j (cont’d.)
n Creating nodes in Neo4jn CREATE commandn Part of high-level declarative query language
Cyphern Node label can be specified when node is createdn Properties are enclosed in curly brackets
Slide 24- 34
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Neo4j (cont’d.)
Slide 24- 35
Figure 24.4 Examples in Neo4j using the Cypher language (a) Creating some nodes
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Neo4j (cont’d.)
Slide 24- 36
Figure 24.4 (cont’d.) Examples in Neo4j using the Cypher language (b) Creating some relationships
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Neo4j (cont’d.)
n Pathn Traversal of part of the graphn Typically used as part of a query to specify a
patternn Schema optional in Neo4jn Indexing and node identifiers
n Users can create for the collection of nodes that have a particular label
n One or more properties can be indexed
Slide 24- 37
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The Cypher Query Language of Neo4j
n Cypher query made up of clausesn Result from one clause can be the input to the
next clause in the query
Slide 24- 38
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The Cypher Query Language of Neo4j (cont’d.)
Slide 24- 39
Figure 24.4 (cont’d.) Examples in Neo4j using the Cypher language (c) Basic syntax of Cypher queries
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The Cypher Query Language of Neo4j (cont’d.)
Slide 24- 40
Figure 24.4 (cont’d.) Examples in Neo4j using the Cypher language (d) Examples of Cypher queries
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Neo4j Interfaces and Distributed System Characteristics
n Enterprise edition versus community editionn Enterprise edition supports caching, clustering of
data, and lockingn Graph visualization interface
n Subset of nodes and edges in a database graph can be displayed as a graph
n Used to visualize query resultsn Master-slave replicationn Cachingn Logical logs
Slide 24- 41
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
24.7 Summary
n NOSQL systems focus on storage of “big data”n General categories
n Document-basedn Key-value storesn Column-basedn Graph-basedn Some systems use techniques spanning two or
more categoriesn Consistency paradigmsn CAP theorem
Slide 24- 42