Download pdf - An example of CP system

An example of CP system: Google’s BigTable

Leonardo Aniello [email protected]

References to CAP Theorem

• the theorem in a nutshell

between Consistency, Availability and tolerance to Partitions, only two can be achieved

• CP systems give up availability to get strong consistency

in case of partitions, requests are not served

Scenario - BigTable

• BigTable is a distributed storage for structured data used at Google to provide data persistence to several services o web indexing

o Google Earth

o Google Finance

o Google Analytics

Data Model

• “sparse, distributed, persistent multi-dimensional sorted map” [1]

o sparse: NULL values are not stored, save disk space

o distributed: data set managed by distinct physical nodes

o persistent: data saved on stable storage

omulti-dimensional: data indexed by row, column, timestamp

o sorted: data is stored to disk according to a pre-defined order

Data Model – Column vs Row Oriented (1/2)

• consider a tabular view of the data [2]

• RDBMSs are row oriented, all the values of a row are serialized together 1, Smith, Joe, 40000;

2, Jones, Mary, 50000;

3, Johnson, Cathy, 44000;

o to get the values of a few columns, all the values of the rows have to be read

many disk seeks

EmpID Lastname Firstname Salary

1 Smith Joe 40000

2 Jones Mary 50000

3 Johnson Cathy 44000

Data Model – Column vs Row Oriented (2/2)

• consider a tabular view of the data [2]

• column oriented data model: all the values of a column are serialized together 1, 2, 3;

Smith, Jones, Johnson;

Joe, Mary, Cathy;

40000, 50000, 44000;

o to get the values of a few columns, only interested columns have to be read

disk seeks are heavily reduced

• BigTable takes inspiration from column oriented data model

EmpID Lastname Firstname Salary

1 Smith Joe 40000

2 Jones Mary 50000

3 Johnson Cathy 44000

Data Model – Example (1/5)

• scenario [3]

o users create blog entries

o blog entries are categorized by subject area

o users can choose to subscribe to the blogs of other users

• required queries

o what users subscribe to my blog?

o show all the blog entries about fashion


user table

user_id username state

1 jbellis TX

2 dhutch CA

3 egilmore NULL

blog table

blog_id user_id entry category_id

101 1 Today I ... 3

102 2 I am ... 2

103 1 This is ... 3

subscriber table

subscriber blogger row_id

1 2 1

2 1 2

1 3 3

category table

category category_id

sports 1

fashion 2

tech 3

Relational Data Model

• relational data model, queries implementation

o what users subscribe to my blog? select user.username

from user join subscriber on user.user_id = subscriber.subscriber

where subscriber.blogger = MY_USER_ID

o show me all of the blog entries about fashion select blog.entry

from blog join category on blog.category_id = category.category_id

where category.category = ‘fashion’

joins are required to correlate information from different tables



BigTable Data Model user

jbellis name state

jonathan TX

dhutch name state

daria CA

egilmore name

eric

blog

92dbeb5 body user category

Today I ... jbellis tech

d418a66 body user category

I am ... dhutch fashion

6a0b483 body user category

This is ... egilmore sports

subscribes_to

jbellis dhutch egilmore

dhutch jbellis

egilmore jbellis dhutch

subscribes_of

jbellis dhutch egilmore

dhutch jbellis egilmore

egilmore jbellis

NULL values are not stored

• BigTable data model, queries implementation

o what users subscribe to my blog? select <column names>

from subscribes_to

where key = MY_USERNAME

o show me all of the blog entries about fashion select blog.body

from blog

where blog.category = ‘fashion’

joins are not required, data is denormalized to fit the queries


Data Model – Differences with RDBMS

• no referential integrity o to be managed at application level

o no concept of joins

• NULL values are not stored: save disk space

• denormalization: meet practical requirements o performance: joins are very demanding

o retention: referenced data can change over time, need for snapshot in history

Data Model – Rows

• rows are arbitrary strings

• reads/writes to a single row are atomic easier to manage concurrent requests

• data sorted by row key in a lexicographic order

Data Model – Column Families

• columns grouped in column families

• access control and disk/memory management are performed at column family level

• design choices o one table for each application

o small number of column families, unbounded number of columns

o use a column family for each required query

• column key syntax: family:qualifier

Data Model – Timestamps

• a cell in table is indexed by row key and column key and can include several versions of the data o these versions are indexed by timestamp

o timestamps are assigned by

BigTable: they represent real time in microseconds

client: they are chosen on the basis of application-specific needs

o versions sorted by decreasing timestamp: most recent version is read first

• garbage-collection mechanisms o keep only the last N versions

o keep only the versions written in the last N days

Data Model – Locality

• dynamic partitioning of the rows of a table

• tablet: a row range, unit of distribution and load balancing

• rows with contiguous keys are stored in the same tablet

• all the rows of a tablet are managed by a single node

• clients can leverage this property to get good locality for data accesses

Data Model – Real Example (1/3)

• WebTable: a large collection of web pages and related information

o rows: URLS of the pages

o columns: properties of the pages (content, links, language, …)

o timestamp indicating when the information has been retrieved


• data locality in WebTable

o need to group together pages in the same domain

o solution: instead of using the exact URL of the page, the hostname is reversed

o example original URL: maps.google.com/index.html

row key: com.google.maps/index.html

all the pages of google.com domain are stored contiguously (otherwise, pages of maps.google.com wouldn’t be contiguous to pages of drive.google.com)


• column families in WebTable

o the language column family has a single column where the information about the language of the page is stored

o the anchor column family is used to store the references to a page

one column for each reference

the qualifier is the name of the referring site

the value is the text of the link

Data Model – API

•modify data RowMutation(table_name, row_ID)

o Set(column_ID, value)

o Delete(column_ID)

o applied atomically to a row

• retrieve data Scanner(table_name)

o FetchColumnFamily(column_family_name)

o Lookup(row_ID)

o ScanStream: iterate through the results

Architecture

• physical resources are shared among several Google services

• data is persistently stored to Google File System (GFS) [4] [6]

• coordination activities are supported by Chubby service [5] [6]

Architecture – Google File System (1/6)

• GFS assumptions/requirements o store a modest number of large files

o typical workloads

two kinds of reads: large streaming reads and small random reads

many large, sequential writes that append data to files

small writes at arbitrary positions are supported but do not have to be efficient

o high throughput is more important than low latency

• GFS interface: supported file operations o create/delete

o open/close

o read/write


• files are divided into fixed-size chunks (default 64 MB)

• each chunk is replicated on multiple chunkservers (default 3)

• the master maintains all file system metadata o namespace

o access control information

omapping from files to chunks

omapping from chunks to chunkservers

• clients never read and write file data through the master o a client asks the master which chunkservers it should contact

o it caches this information for a limited time

o it interacts with the chunkservers directly for many subsequent operations


• GFS architecture


• GFS mutations o a mutation affects a specific part of a file, called file region

o a mutation can be concurrent with other mutations

o if a mutation would cause a chunk to exceed the maximum size

o padding is used to fill the chunk

o the client is notified to continue the mutation on another chunk

owrite: data is written atomically at an application-specific file offset

o record append: data (the record) is appended atomically at-least-once

the offset is chosen by GFS and returned to the client

it marks the beginning of a defined region that contains the record

in case of failure at any replica, the append is failed and gets retried by the client some replicas can have duplicates!!


• GFS consistency guarantees o after a data mutation, a file region can be

consistent: any client sees the same data, regardless of chosen replicas

defined: it is consistent and clients see the mutation in its entirety

inconsistent: it is not consistent

o inconsistency for record appends is due to possible duplicates and padding

relaxed consistency model!!

Write Record Append

Serial Success defined defined interspersed with

inconsistent Concurrent Successes consistent

but not defined

Failure inconsistent


• how to cope with GFS relaxed consistency model o managed by the client

o padding can be discarded by employing checksum techniques

the writer includes checksums in the records

o duplicates can be filtered out by using unique identifiers

the writer provides unique identifiers to the records

o these mechanisms are implemented

in a library linked by the client

in lower layers of the client itself

higher layers of the client can rely on strong consistency guarantees for append mutations

Architecture – Chubby

• a highly-available and persistent distributed lock service o five active replicas, one elected as the master that serves requests

o service is alive when a majority of the replicas are running

o Paxos algorithm to keep replicas consistent despite of failures

• Chubby provides a namespace that consists of directories and small files

• dirs and files can be used as a lock, reads/writes to a file are atomic

• used for several tasks

o ensure there is at most one active master at any time

o store the bootstrap location of BigTable data

o discover tablet servers and finalize tablet server deaths

o store BigTable schema information

o store access control lists

Architecture – BigTable Components (1/2)

• one master server o assignment of tablets to tablet servers

o detection of addition/removal of tablet servers

o load balancing of tablet servers

o garbage-collection of files in GFS

o handling of schema changes

• many tablet servers o can be dynamically added/removed

o each server manages a set of tablets

o handling of read/write requests to data of managed tablets

o splitting of tablets that have grown too large

• a library for the clients o client data does not move through the master: direct communication with

tablet servers the master is lightly loaded in practice

Architecture – BigTable Components (2/2)

• components connection [6]

Architecture – Tablet Location (1/2)

• three-level hierarchy o first: a file stored in Chubby that contains the location of the root tablet

o second: the root tablet stores the locations of the tablets of a METADATA table

o third: a METADATA tablet contains the location of a set of user tablets

Architecture – Tablet Location (2/2)

• the METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table ID and its end row

• the client library caches tablet locations

• the client library prefetches tablet locations o it reads the metadata for more tablets when it reads the METADATA table

Architecture – Tablet Assignment (1/2)

• the master keeps track of live tablet servers and current assignment of tablets to tablet servers o unassigned tablets are included

o an unassigned tablet is assigned to a tablet server having sufficient space

• when a tablet server starts, it creates and acquires a lock on a uniquely-named Chubby file in a servers directory

• the master monitors such directory to discover tablet servers

• a tablet server stops serving its tablets if it loses its lock o it attempts to reacquire the lock on its file as long as the file exists

o if the file no longer exists, the tablet server kills itself

o whenever a tablet server terminates, it attempts to release its lock so that the master will reassign its tablets more quickly

Architecture – Tablet Assignment (2/2)

• the master has to detect when a tablet server isn’t working o it periodically asks each tablet server for the status of its lock

o in case of problems (lock lost, unreachable server), it attempts to get the lock

o if the lock is got, then Chubby is live and the tablet server has problems the master deletes the file to avoid that such server can serve again

o the master can move all the tablets that were assigned to that server into the set of unassigned tablets

• the master kills itself if its Chubby session expires o the cluster is not vulnerable to issues between the master and Chubby

omaster failures do not change the assignment of tablets to tablet servers

Architecture - SSTable

• file format used internally to store BigTable data

• a persistent, ordered immutable map from keys to values

• operations o look up the value associated with a specified key

o iterate over all key/value pairs in a specified key range

• each SSTable contains a sequence of blocks

• a block index is used to locate blocks, it is loaded into memory o the block index is stored at the end of the SSTable

• a lookup can be performed with a single disk seek o first find the block with a binary search in the in-memory index

o read the appropriate block from disk

Architecture – Tablet Serving (1/2) • the persistent state of a tablet is stored in GFS

• updates are committed to a commit log that stores redo records

• recently committed ones are stored in memory in a sorted buffer (memtable)

• older updates are stored in a sequence of SSTables

Architecture – Tablet Serving (2/2) • to recover a tablet, a tablet server reads its metadata from the

METADATA table that contains o the list of SSTables that comprise a tablet

o a set of a redo points, which are pointers into any commit logs that may contain data for the tablet

• write operation o authorization check: read the list of permitted writers from a Chubby file

o a valid mutation is written to the commit log

o its contents are inserted into the memtable

• read operation o authorization check

o executed on a merged view of the sequence of SSTables and the memtable

o since the SSTables and the memtable are lexicographically sorted, the merged view can be formed efficiently

Architecture – Compactions (1/2)

• as write operations execute, the size of the memtable increases

• when the memtable size reaches a threshold

o the memtable is frozen

o a new memtable is created

o the frozen memtable is converted to an SSTable and written to GFS

• this minor compaction process has two goals

1. it shrinks the memory usage of the tablet server

2. it reduces the amount of data that has to be read during recovery

• if this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables!!

Architecture – Compactions (2/2)

• bound the number of GFS files by periodic merging compactions o read a few SSTables and the memtable, and write out a new SSTable

o when the compaction completes, input SSTables and memtable are discarded

• a merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction o SSTables produced by non-major compactions can contain special deletion

entries that suppress deleted data in older SSTables that are still live

o a major compaction produces an SSTable that contains no deletion information or deleted data

o major compactions allow to reclaim resources used by deleted data, and to ensure that deleted data disappears from the system in a timely fashion

Why is BigTable CP? • if the master dies, the services it provided are no longer

functioning until a new master is started availability is given up

• if a tablet server dies, client requests to the tablets it managed cannot be served until such tablets are assigned by the master to another tablet server availability is given up

• if Chubby fails (a majority of its replicas die), BigTable cannot execute any synchronization or serve any client request availability is given up

• in case of GFS failure, SSTables and commit logs are not available until recovery availability is given up

Which Consistency Guarantees? • each row is managed by a single tablet server requests for a row are serialized

• data (SSTables and commit logs) for a specific tablet is written to GFS by a single tablet server reads and writes for tablet are serialized

• writes to GFS files are always appends no overwrites

• SSTables are written once and then read writes are not interleaved with reads

• commit logs are only read upon compactions or recovery, and these operations are not interleaved with writes writes are not interleaved with reads

strong consistency

References

[1] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data”, OSDI 2006, http://research.google.com/archive/bigtable-osdi06.pdf

[2] Column-oriented DBMS, http://en.wikipedia.org/wiki/Column-oriented_DBMS

[3] DataStax, Cassandra 1.1 Documentation, Section “Understanding the Cassandra Data Model”, http://www.datastax.com/doc-source/pdf/cassandra11.pdf

[4] S. Ghemawat, H. Gobioff, S. Leung, “The Google File System”, SOSP 2003, http://research.google.com/archive/gfs-sosp2003.pdf

[5] M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, OSDI 2006, http://research.google.com/archive/chubby-osdi06.pdf

[6] G. Coulouris, J. Dollimore, T. Kindberg, G. Blair, “Distributed Systems, Concepts and Design”, Fifth Edition, Addison-Wesley 2012

http://research.google.com/archive/bigtable-osdi06.pdf



http://en.wikipedia.org/wiki/Column-oriented_DBMS



http://www.datastax.com/doc-source/pdf/cassandra11.pdf



http://research.google.com/archive/gfs-sosp2003.pdf



http://research.google.com/archive/chubby-osdi06.pdf