QUERYING (BIG) DATA ON NOSQL STORES - Vargas-Solarvargas-solar.com/bigdata-management/wp-content/... · NOSQL STORES: DATA MANAGEMENT PROPERTIES ! Indexing ! Distributed hashing like

QUERYING (BIG) DATA ON NOSQL STORES

GENOVEVA VARGAS SOLAR FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE

[email protected]

http://www.vargas-solar.com/teaching

http://www.vargas-solar.com

DATA PROCESSING

¡  Process the data to produce other data: analysis tool, business intelligence tool, ...

¡  This means

¡  • Handle large volumes of data

¡  • Manage thousands of processors

¡  • Parallelize and distribute treatments

¡  Scheduling I/O

¡  Managing Fault Tolerance

¡  Monitor /Control processes

Map-Reduce provides all this easy!

2

NOSQL QUERY EXECUTION

3

� �

� �

� �

� �

� �

� �

Master'

Slaves'

all'updates'made'to'the'master'

changes'propagate''To'slaves'

reads'can'be'done'from'master'or'slaves'

Map

Data organization -  DISTRIBUTED FILE SYSTEM -  INDEXING

Reduce

Query

NoSQL

“Optimizing query processing” à DATA COMPRESSION, LOCATION, R/W STRATEGIES, CACHE/MEMORY ORGANIZATION

Query programming À  MAP-REDUCE PROGRAMMING PATTERNS

Query execution -  EXECUTION MODEL -  DATA PROCESSING PROPERTIES

1

2

3

ORGANISATION DE DONNÉES: THE NOSQL CASE

4

NOSQL STORES: DATA MANAGEMENT PROPERTIES

¡  Indexing

¡  Distributed hashing like Memcached open source cache

¡  In-memory indexes are scalable when distributing and replicating objects over multiple nodes

¡  Partitioned tables

¡  High availability and scalability: eventual consistency

¡  Data fetched are not guaranteed to be up-to-date

¡  Updates are guaranteed to be propagated to all nodes eventually

¡  Shared nothing horizontal scaling

¡  Replicating and partitioning data over many servers

¡  Support large number of simple read/write operations per second (OLTP)

¡  No ACID guarantees

¡  Updates eventually propagated but limited guarantees on reads consistency

¡  BASE: basically available; soft state, eventually consistent

¡  Multi-version concurrency control

5

PROBLEM STATEMENT: HOW MUCH TO GIVE UP?

¡  CAP theorem1: a system can have two of the three properties

¡  NoSQL systems sacrifice consistency

6

Consistency Availability

Fault-‐tolerant partitioning

1 Eric Brewer, "Towards robust distributed systems." PODC. 2000 http://www.cs.berkeley.edu/~brewer/cs262b-‐2004/PODC-‐keynote.pdf

NOSQL STORES: AVAILABILITY AND PERFORMANCE

¡  Replication

¡  Copy data across multiple servers (each bit of data can be found in multiple servers)

¡  Increase data availability

¡  Faster query evaluation

¡  Sharding

¡  Distribute different data across multiple servers

¡  Each server acts as the single source of a data subset

¡  Orthogonal techniques

7

REPLICATION: PROS & CONS

¡  Data is more available

¡  Failure of a site containing E does not result in unavailability of E if replicas exist

¡  Performance

¡  Parallelism: queries processed in parallel on several nodes

¡  Reduce data transfer for local data

¡  Increased updates cost

¡  Synchronisation: each replica must be updated

¡  Increased complexity of concurrency control

¡  Concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented

8

SHARDING: WHY IS IT USEFUL?

¡  Scaling applications by reducing data sets in any single databases

¡  Segregating data

¡  Sharing application data

¡  Securing sensitive data by isolating it

¡  Improve read and write performance ¡  Smaller amount of data in each user group implies faster querying

¡  Isolating data into smaller shards accessed data is more likely to stay on cache

¡  More write bandwidth: writing can be done in parallel

¡  Smaller data sets are easier to backup, restore and manage

¡  Massively work done ¡  Parallel work: scale out across more nodes

¡  Parallel backend: handling higher user loads

¡  Share nothing: very few bottlenecks

¡  Decrease resilience improve availability ¡  If a box goes down others still operate

¡  But: Part of the data missing

9

Load%balancer%

Cache%1%

Cache%2%

Cache%3%

MySQL%Master%

MySQL%Master%

Web%1%

Web%2%

Web%3%

Site%database%

Resume%database%

SHARDING AND REPLICATION

¡  Sharding with no replication: unique copy, distributed data sets

¡  (+) Better concurrency levels (shards are accessed independently)

¡  (-) Cost of checking constraints, rebuilding aggregates

¡  Ensure that queries and updates are distributed across shards

¡  Replication of shards

¡  (+) Query performance (availability)

¡  (-) Cost of updating, of checking constraints, complexity of concurrency control

¡  Partial replication (most of the times)

¡  Only some shards are duplicated

10

QUERY PROGRAMMING DATA PROCESSING USING MAP-REDUCE

11

MAP-REDUCE

¡  Programming model for expressing distributed computations on massive amounts of data

¡  Execution framework for large-scale data processing on clusters of commodity servers

¡  Market: any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing content must tackle large-data problems

¡  data- intensive processing is beyond the capability of any individual machine and requires clusters

¡  large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of machines

12

« Data represent the rising tide that lifts all boats—more data lead to better algorithms and systems for solving real-world problems »

COUNTING WORDS

13

(URI, document) à (term, count)

see bob throw see spot run

bob <1> run <1> see <1,1> spot <1> throw <1>

see 1 bob 1 throw 1 see 1 spot 1 run 1

bob 1 run 1 see 2 spot 1 throw 1

Map Shuffle/Sort Reduce

MAP – REDUCE DESIGN PATTERNS

14

SUMMARIZATION Numerical

Inverted index

Counting with counters

FILTERING Filtering

Bloom

Top ten

Distinct

DATA ORGANIZATION Structured to hierarchical

Partitioning

Binning

Total order sorting

Shuffling

JOIN Reduce side join

Reduce side join with bloom filter

Replicated join

Composite join

Cartesian product

•  Minimum, maximum, count, average, median-standard deviation

•  Wikipedia inverted index

•  Count number of records, a small number of unique instances, summations

•  Number of users per state

•  Remove most of nonwatched values, prefiltering data for a set membership check

•  Hot list, Hbase query

•  Closer view of data, tracking event threads, distributed grep, data cleansing, simple random sampling, remove low scoring data

•  Outlier analysis, select interesting data, catchy dashbords

•  Top ten users by reputation

•  Deduplicate data, getting distinct values, protecting from inner join explosion

•  Distinct user ids

•  Prejoining data, preparing data for Hbase or MongoDB

•  Post/comment building for StackOverflow, Question/Answer building

•  Partitioning users by last access date

•  Binning by Hadoop-related tags

•  Sort users by last visit

•  Anonymizing StackOverflow comments

•  Multiple large data sets joined by foreign key

•  User – comment join

•  Reputable user – comment join

•  Replicated user – comment join

•  Composite user – comment join

•  Comment comparison

ELEMENTS TO THINK ABOUT EFFICIENT EXECUTION

15

HADOOP INFRASTRUCTURE

16

HADOOP FRAMEWORK

¡  Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data

¡  Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters

¡  HBase: A scalable, distributed database that supports structured data storage for large tables

¡  Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying

¡  Chukwa: A data collection system for managing large distributed systems

¡  Pig: A high-level data-flow language and execution framework for parallel computation

¡  ZooKeeper: A high-performance coordination service for distributed applications

17

DISTRIBUTED FILE SYSTEM

¡  Abandons the separation of computation and storage as distinct components in a cluster

¡  Google File System (GFS) supports Google’s proprietary implementation of MapReduce;

¡  In the open-source world, HDFS (Hadoop Distributed File System) is an open-source implementation of GFS that supports Hadoop

¡  The main idea is to divide user data into blocks and replicate those blocks across the local disks of nodes in the cluster

¡  Adopts a master–slave architecture

¡  Master (namenode HDFS) maintains the file namespace (metadata, directory structure, file to block mapping, location of blocks, and access permissions)

¡  Slaves (datanode HDFS) manage the actual data blocks

18

HFDS GENERAL ARCHITECTURE

¡  An application client wishing to read a file (or a portion thereof) must first contact the namenode to determine where the actual data is stored

¡  The namenode returns the relevant block id and the location where the block is held (i.e., which datanode)

¡  The client then contacts the datanode to retrieve the data.

¡  HDFS lies on top of the standard OS stack (e.g., Linux): blocks are stored on standard single-machine file systems 19

HDFS PROPERTIES

¡  HDFS stores three separate copies of each data block to ensure both reliability, availability, and performance

¡  In large clusters, the three replicas are spread across different physical racks, ¡  HDFS is resilient towards two common failure scenarios individual datanode crashes and failures in networking equipment that bring an entire rack

offline.

¡  Replicating blocks across physical machines also increases opportunities to co-locate data and processing in the scheduling of MapReduce jobs, since multiple copies yield more opportunities to exploit locality

¡  To create a new file and write data to HDFS ¡  The application client first contacts the namenode

¡  The namenode

¡  updates the file namespace after checking permissions and making sure the file doesn’t already exist

¡  allocates a new block on a suitable datanode

¡  The application is directed to stream data directly to it

¡  From the initial datanode, data is further propagated to additional replicas

20

HADOOP CLUSTER ARCHITECTURE

¡  The HDFS namenode runs the namenode daemon

¡  The job submission node runs the jobtracker, which is the single point of contact for a client wishing to execute a MapReduce job

¡  The jobtracker

¡  Monitors the progress of running MapReduce jobs

¡  Is responsible for coordinating the execution of the mappers and reducers

¡  Tries to take advantage of data locality in scheduling map tasks 21

MAP-REDUCE PHASES

¡  Initialisation

¡  Map: record reader, mapper, combiner, and partitioner

¡  Reduce: shuffle, sort, reducer, and output format

¡  Partition input (key, value) pairs into chunks run map() tasks in parallel

¡  After all map()’s have been completed consolidate the values for each unique emitted key

¡  Partition space of output map keys, and run reduce() in parallel

22

MAP SUB-PHASES

¡  Record reader translates an input split generated by input format into records ¡  parse the data into records, but not parse the record itself

¡  It passes the data to the mapper in the form of a key/value pair. Usually the key in this context is positional information and the value is the chunk of data that composes a record

¡  Map user-provided code is executed on each key/value pair from the record reader to produce zero or more new key/value pairs, called the intermediate pairs ¡  The key is what the data will be grouped on and the value is the information pertinent to the analysis in the reducer

¡  Combiner, an optional localized reducer

¡  Can group data in the map phase

¡  It takes the intermediate keys from the mapper and applies a user-provided method to aggregate values in the small scope of that one mapper

¡  Partitioner takes the intermediate key/value pairs from the mapper (or combiner) and splits them up into shards, one shard per reducer

23

REDUCE SUB PHASES

¡  Shuffle and sort takes the output files written by all of the partitioners and downloads them to the local machine in which the reducer is running.

¡  These individual data pieces are then sorted by key into one larger data list

¡  The purpose of this sort is to group equivalent keys together so that their values can be iterated over easily in the reduce task

¡  Reduce takes the grouped data as input and runs a reduce function once per key grouping

¡  The function is passed the key and an iterator over all of the values associated with that key

¡  Once the reduce function is done, it sends zero or more key/value pair to the final step, the output format

¡  Output format translates the final key/value pair from the reduce function and writes it out to a file by a record writer

24

CASE STUDY

25

EXTENSIBLE RECORD STORES

¡  Basic data model is rows and columns

¡  Basic scalability model is splitting rows and columns over multiple nodes ¡  Rows split across nodes through sharding on the primary key

¡  Split by range rather than hash function

¡  Rows analogous to documents: variable number of attributes, attribute names must be unique

¡  Grouped into collections (tables)

¡  Queries on ranges of values do not go to every node

¡  Columns are distributed over multiple nodes using “column groups” ¡  Which columns are best stored together

¡  Column groups must be pre-defined with the extensible record stores

26

SYSTEM ADDRESS

HBase hbase.apache.com

HyperTable hypertable.org

Cassandra incubator.apache.org/cassandra

EXTENSIBLE RECORD DATA MODEL (HBASE EXAMPLE)

¡  Most basic unit: column ¡  Each column may have multiple versions

¡  Each distinct value contained in a separate cell

¡  One or more columns form a row addressed uniquely by a row key

¡  A number of rows form a table

27

C1 1

2

C2 14

25 R-‐1

Version 1

Version 2

Cell Column

Raw

C2 44

75

R-‐n Version 1

Version 2

Cell Column Raw

T1

Table

C3 95 F-‐2

F-‐1 Family

Family

DATA ORGANIZATION

28

REFINEMENTS: LOCALITY GROUPS

¡  Can group multiple column families into a locality group ¡  Separate SSTable is created for each locality group in each tablet.

¡  Segregating columns families that are not typically accessed together enables more efficient reads. ¡  In WebTable, page metadata can be in one group and contents of the page in another group.

REFINEMENTS: COMPRESSION

¡  Many opportunities for compression ¡  Similar values in the same row/column at different timestamps

¡  Similar values in different columns

¡  Similar values across adjacent rows

¡  Two-pass custom compressions scheme ¡  First pass: compress long common strings across a large window

¡  Second pass: look for repetitions in small window

¡  Speed emphasized, but good space reduction (10-to-1)

FILTER

¡  Given a collection of tuples, filtering simply evaluates each record separately and decides, based on some condition, whether it should stay or go

¡  Scan through a file line-by-line and only output lines that match a specific pattern

¡  Simple random sampling: grab a subset of our larger data set in which each record has an equal probability of being selected (decrease the dataset size)

¡  Instead of some filter criteria function that bears some relationship to the content of the record, a random number generator will produce a value, and if the value is below a threshold, keep the record. Otherwise, toss it out

¡  Bloom: keep records that are member of some predefined set of values (hot values)

¡  For each record, extract a feature of that record. If that feature is a member of a set of values represented by a Bloom filter, keep it; otherwise toss it out (or the reverse).

¡  For example: keep or throw away this record if the value in the user field is a member of a predetermined list of users.

31

BLOOM FILTER

¡  Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set

¡  The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to demonstrate

¡  Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1

32

1 2 3 4 5 6 7 8 9 10 11 12 13 14

http://billmill.org/bloomfilter-‐tutorial/

A FORM OF OPTIMIZATION FOR ACCESSING HBASE SEMI-HANDS ON (SEE EXERCISE 4)

33

REFINEMENTS: BLOOM FILTERS

¡  Read operation has to read from disk when desired SSTable isn’t in memory

¡  Reduce number of accesses by specifying a Bloom filter.

¡  Allows us ask if an SSTable might contain data for a specified row/column pair.

¡  Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations

¡  Use implies that most lookups for non-existent rows or columns do not need to touch disk

NOSQL DATA PROCESSING PROPERTIES

35

only an afterthought and could cause problems once you need to scale the system. And if it does offer scalability, does it imply specific steps to do so? The easiest solution would be to add one machine at a time, while sharded setups (especially those not supporting virtual shards) sometimes require for each shard to be increased simultaneously because each partition needs to be equally powerful.

Lars George; Hbase the definitive guide, O’Reilly

NOSQL DATA PROCESSING PROPERTIES

36

37

SOME BOOKS

¡  Hadoop The Definitive Guide – O’Reily 2011 – Tom White

¡  Data Intensive Text Processing with MapReduce – Morgan & Claypool 2010 –Jimmy Lin, Chris Dyer – pages 37-65

¡  Cloud Computing and Software Services Theory and Techniques– CRC Press 2011- Syed Ahson, Mohammad Ilyas – pages 93-137

¡  Writing and Querying MapReduce Views in CouchDB – O’Reily 2011 –Brandley Holt – pages 5-29

¡  NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by Pramod J. Sadalage, Martin Fowler

38

NOSQL STORES: AVAILABILITY AND PERFORMANCE

39

REPLICATION MASTER - SLAVE

¡  Makes one node the authoritative copy/replica that handles writes while replica synchronize with the master and may handle reeds

¡  All replicas have the same weight ¡  Replicas can all accept writes

¡  The lose of one of them does not prevent access to the data store

¡  Helps with read scalability but does not help with write scalability

¡  Read resilience: should the master fail, slaves can still handle read requests

¡  Master failure eliminates the ability to handle writes until either the master is restored or a new master is appointed

¡  Biggest complication is consistency ¡  Possible write – write conflict

¡  Attempt to update the same record at the same time from to different places

¡  Master is a bottle-neck and a point of failure

40

� �

� �

� �

� �

� �

� �

Master'

Slaves'

all'updates'made'to'the'master'

changes'propagate''To'slaves'

reads'can'be'done'from'master'or'slaves'

MASTER-SLAVE REPLICATION MANAGEMENT

¡  Masters can be appointed ¡  Manually when configuring the nodes cluster

¡  Automatically: when configuring a nodes cluster one of them elected as master. The master can appoint a new master when the master fails reducing downtime

¡  Read resilience

¡  Read and write paths have to be managed separately to handle failure in the write path and still reads can occur

¡  Reads and writes are put in different database connections if the database library accepts it

¡  Replication comes inevitably with a dark side: inconsistency ¡  Different clients reading different slaves will see different values if changes have not been propagated to all slaves

¡  In the worst case a client cannot read a write it just made

¡  Even if master-slave is used for hot backups, if the master fails any updates on to the backup are lost

41

REPLICATION: PEER-TO-PEER

¡  Allows writes to any node; the nodes coordinate to synchronize their copies

¡  The replicas have equal weight

¡  Deals with inconsistencies

¡  Replicas coordinate to avoid conflict

¡  Network traffic cost for coordinating writes

¡  Unnecessary to make all replicas agree to write, only the majority

¡  Survival to the loss of the minority of replicas nodes

¡  Policy to merge inconsistent writes

¡  Full performance on writing to any replica

42

� �

� �

� �

� �

� �

� �

Master'

nodes'communicate'their'writes'

all'nodes'read'and'write'all'data'

REPLICATION: ASPECTS TO CONSIDER

¡  Conditioning

¡  Important elements to consider

¡  Data to duplicate

¡  Copies location

¡  Duplication model (master – slave / P2P)

¡  Consistency model (global – copies)

43

Fault tolerance

Availability Transparency levels

Performance

à Find a compromise !

SHARDING

¡  Ability to distribute both data and load of simple operations over many servers, with no RAM or disk shared among servers

¡  A way to horizontally scale writes ¡  Improve read performance

¡  Application/data store support

44

¡  Puts different data on separate nodes

¡  Each user only talks to one servicer so she gets rapid responses

¡  The load should be balanced out nicely between servers

¡  Ensure that

¡  data that is accessed together is clumped together on the same node

¡  that clumps are arranged on the nodes to provide best data access

� �

� �

Each%shard%reads%and%writes%its%own%data%

�

�

� �

SHARDING

Database laws

¡  Small databases are fast

¡  Big databases are slow

¡  Keep databases small

Principle

¡  Start with a big monolithic database

¡  Break into smaller databases

¡  Across many clusters

¡  Using a key value

45

Instead of having one million customers information on a single big machine ….

100 000 customers on smaller and different machines

SHARDING CRITERIA

¡  Partitioning

¡  Relational: handled by the DBMS (homogeneous DBMS)

¡  NoSQL: based on ranging of the k-value

¡  Federation

¡  Relational

¡  Combine tables stored in different physical databases

¡  Easier with denormalized data

¡  NoSQL:

¡  Store together data that are accessed together

¡  Aggregates unit of distribution

46

SHARDING

Architecture

¡  Each application server (AS) is running DBS/client

¡  Each shard server is running

¡  a database server

¡  replication agents and query agents for supporting parallel query functionality

Process

¡  Pick a dimension that helps sharding easily (customers, countries, addresses)

¡  Pick strategies that will last a long time as repartition/re-sharding of data is operationally difficult

¡  This is done according to two different principles ¡  Partitioning: a partition is a structure that divides a space

into tow parts

¡  Federation: a set of things that together compose a centralized unit but each individually maintains some aspect of autonomy

47

Customers data is partitioned by ID in shards using an algorithm d to determine which shard a customer ID belongs to

48

PARTITIONING A PARTITION IS A STRUCTURE THAT DIVIDES A SPACE INTO TOW PARTS

49

BACKGROUND: DISTRIBUTED RELATIONAL DATABASES

¡  External schemas (views) are often subsets of relations (contacts in Europe and America)

¡  Access defined on subsets of relations: 80% of the queries issued in a region have to do with contacts of that region

¡  Relations partition ¡  Better concurrency level

¡  Fragments accessed independently

¡  Implications ¡  Check integrity constraints

¡  Rebuild relations

50

FRAGMENTATION

¡  Horizontal

¡  Groups of tuples of the same relation

¡  Budget < 300 000 or >= 150 000

¡  Not disjoint are more difficult to manage

¡  Vertical

¡  Groups attributes of the same relation

¡  Separate budget from loc and pname of the relation project

¡  Hybrid

51

FRAGMENTATION: RULES

Vertical

¡  Clustering

¡  Grouping elementary fragments

¡  Budget and location information in two relations

¡  Splitting

¡  Decomposing a relation according to affinity relationships among attributes

Horizontal

¡  Tuples of the same fragment must be statistically homogeneous ¡  If t1 and t2 are tuples of the same fragment then t1 and t2 have

the same probability of being selected by a query

¡  Keep important conditions ¡  Complete

¡  Every tuple (attribute) belongs to a fragment (without information loss)

¡  If tuples where budget >= 150 000 are more likely to be selected then it is a good candidate

¡  Minimum

¡  If no application distinguishes between budget >= 150 000 and budget < 150 000 then these conditions are unnecessary

52

SHARDING: HORIZONTAL PARTITIONING

¡  The entities of a database are split into two or more sets (by row)

¡  In relational: same schema several physical bases/servers ¡  Partition contacts in Europe and America shards where

they zip code indicates where the will be found

¡  Efficient if there exists some robust and implicit way to identify in which partition to find a particular entity

¡  Last resort shard ¡  Needs to find a sharding function: modulo, round robin,

hash – partition, range - partition

53

Load%balancer%

Cache%1%

Cache%2%

Cache%3%

MySQL%Master%

MySQL%Master%

Web%1%

Web%2%

Web%3%

Even%IDs%

Odd%IDs%MySQL%Slave%1% MySQL%

Slave%2%

MySQL%Slave%n%

MySQL%Slave%1% MySQL%

Slave%2%

MySQL%Slave%n%

FEDERATION A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY

54

FEDERATION: VERTICAL SHARDING

¡  Principle

¡  Partition data according to their logical affiliation

¡  Put together data that are commonly accessed

¡  The search load for the large partitioned entity can be split across multiple servers (logical and physical) and not only according to multiple indexes in the same logical server

¡  Different schemas, systems, and physical bases/servers

¡  Shards the components of a site and not only data

55

Load%balancer%

Cache%1%

Cache%2%

Cache%3%

MySQL%Master%

MySQL%Master%

Web%1%

Web%2%

Web%3%

Site%database%

Resume%database%MySQL%Slave%1% MySQL%

Slave%2%

MySQL%Slave%n%

MySQL%Slave%1%

Internal%user%

NOSQL STORES: PERSISTENCY MANAGEMENT

56

«MEMCACHED»

¡  «memcached» is a memory management protocol based on a cache: ¡  Uses the key-value notion ¡  Information is completly stored in RAM

¡  «memcached» protocol for: ¡  Creating, retrieving, updating, and deleting information from the database ¡  Applications with their own «memcached» manager (Google, Facebook,

YouTube, FarmVille, Twitter, Wikipedia)

57

STORAGE ON DISC (1)

¡  For efficiency reasons, information is stored using the RAM:

¡  Work information is in RAM in order to answer to low latency requests

¡  Yet, this is not always possible and desirable

Ø The process of moving data from RAM to disc is called "eviction”; this process is configured automatically for every bucket

58

STORAGE ON DISC (2)

¡ NoSQL servers support the storage of key-value pairs on disc:

¡  Persistency–can be executed by loading data, closing and reinitializing it without having to load data from another source

¡  Hot backups– loaded data are sotred on disc so that it can be reinitialized in case of failures

¡  Storage on disc– the disc is used when the quantity of data is higher thant the physical size of the RAM, frequently used information is maintained in RAM and the rest es stored on disc

59

STORAGE ON DISC (3)

¡  Strategies for ensuring:

¡  Each node maintains in RAM information on the key-value pairs it stores. Keys:

¡  may not be found, or

¡  they can be stored in memory or on disc

¡  The process of moving information from RAM to disc is asynchronous:

¡  The server can continue processing new requests

¡  A queue manages requests to disc

Ø  In periods with a lot of writing requests, clients can be notified that the server is termporaly out of memory until information is evicted

60

NOSQL STORES: CONCURRENCY CONTROL

61

MULTI VERSION CONCURRENCY CONTROL (MVCC)

¡  Objective: Provide concurrent access to the database and in programming languages to implement transactional memory

¡  Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could see a half-written or inconsistent piece of data.

¡  Lock: readers wait until the writer is done

¡  MVCC: ¡  Each user connected to the database sees a snapshot of the database at a particular instant in time

¡  Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been committed

¡  When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à multiple versions stored, but only one is the latest

¡  Writes can be isolated by virtue of the old versions being maintained

¡  Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects

62

Documents

QUERYING (BIG) DATA ON NOSQL STORES - Vargas-Solarvargas-solar.com/bigdata-management/wp-content/... · NOSQL STORES: DATA MANAGEMENT PROPERTIES ! Indexing ! Distributed hashing like