An example of CP system: Google’s BigTable
Leonardo Aniello [email protected]
References to CAP Theorem
• the theorem in a nutshell
between Consistency, Availability and tolerance to Partitions, only two can be achieved
• CP systems give up availability to get strong consistency
in case of partitions, requests are not served
Scenario - BigTable
• BigTable is a distributed storage for structured data used at Google to provide data persistence to several services o web indexing
o Google Earth
o Google Finance
o Google Analytics
Data Model
• “sparse, distributed, persistent multi-dimensional sorted map” [1]
o sparse: NULL values are not stored, save disk space
o distributed: data set managed by distinct physical nodes
o persistent: data saved on stable storage
omulti-dimensional: data indexed by row, column, timestamp
o sorted: data is stored to disk according to a pre-defined order
Data Model – Column vs Row Oriented (1/2)
• consider a tabular view of the data [2]
• RDBMSs are row oriented, all the values of a row are serialized together 1, Smith, Joe, 40000;
2, Jones, Mary, 50000;
3, Johnson, Cathy, 44000;
o to get the values of a few columns, all the values of the rows have to be read
many disk seeks
EmpID Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000
Data Model – Column vs Row Oriented (2/2)
• consider a tabular view of the data [2]
• column oriented data model: all the values of a column are serialized together 1, 2, 3;
Smith, Jones, Johnson;
Joe, Mary, Cathy;
40000, 50000, 44000;
o to get the values of a few columns, only interested columns have to be read
disk seeks are heavily reduced
• BigTable takes inspiration from column oriented data model
EmpID Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000
Data Model – Example (1/5)
• scenario [3]
o users create blog entries
o blog entries are categorized by subject area
o users can choose to subscribe to the blogs of other users
• required queries
o what users subscribe to my blog?
o show all the blog entries about fashion
Data Model – Example (2/5)
user table
user_id username state
1 jbellis TX
2 dhutch CA
3 egilmore NULL
blog table
blog_id user_id entry category_id
101 1 Today I ... 3
102 2 I am ... 2
103 1 This is ... 3
subscriber table
subscriber blogger row_id
1 2 1
2 1 2
1 3 3
category table
category category_id
sports 1
fashion 2
tech 3
Relational Data Model
• relational data model, queries implementation
o what users subscribe to my blog? select user.username
from user join subscriber on user.user_id = subscriber.subscriber
where subscriber.blogger = MY_USER_ID
o show me all of the blog entries about fashion select blog.entry
from blog join category on blog.category_id = category.category_id
where category.category = ‘fashion’
joins are required to correlate information from different tables
Data Model – Example (3/5)
Data Model – Example (4/5)
BigTable Data Model user
jbellis name state
jonathan TX
dhutch name state
daria CA
egilmore name
eric
blog
92dbeb5 body user category
Today I ... jbellis tech
d418a66 body user category
I am ... dhutch fashion
6a0b483 body user category
This is ... egilmore sports
subscribes_to
jbellis dhutch egilmore
dhutch jbellis
egilmore jbellis dhutch
subscribes_of
jbellis dhutch egilmore
dhutch jbellis egilmore
egilmore jbellis
NULL values are not stored
• BigTable data model, queries implementation
o what users subscribe to my blog? select <column names>
from subscribes_to
where key = MY_USERNAME
o show me all of the blog entries about fashion select blog.body
from blog
where blog.category = ‘fashion’
joins are not required, data is denormalized to fit the queries
Data Model – Example (5/5)
Data Model – Differences with RDBMS
• no referential integrity o to be managed at application level
o no concept of joins
• NULL values are not stored: save disk space
• denormalization: meet practical requirements o performance: joins are very demanding
o retention: referenced data can change over time, need for snapshot in history
Data Model – Rows
• rows are arbitrary strings
• reads/writes to a single row are atomic easier to manage concurrent requests
• data sorted by row key in a lexicographic order
Data Model – Column Families
• columns grouped in column families
• access control and disk/memory management are performed at column family level
• design choices o one table for each application
o small number of column families, unbounded number of columns
o use a column family for each required query
• column key syntax: family:qualifier
Data Model – Timestamps
• a cell in table is indexed by row key and column key and can include several versions of the data o these versions are indexed by timestamp
o timestamps are assigned by
BigTable: they represent real time in microseconds
client: they are chosen on the basis of application-specific needs
o versions sorted by decreasing timestamp: most recent version is read first
• garbage-collection mechanisms o keep only the last N versions
o keep only the versions written in the last N days
Data Model – Locality
• dynamic partitioning of the rows of a table
• tablet: a row range, unit of distribution and load balancing
• rows with contiguous keys are stored in the same tablet
• all the rows of a tablet are managed by a single node
• clients can leverage this property to get good locality for data accesses
Data Model – Real Example (1/3)
• WebTable: a large collection of web pages and related information
o rows: URLS of the pages
o columns: properties of the pages (content, links, language, …)
o timestamp indicating when the information has been retrieved
Data Model – Real Example (2/3)
• data locality in WebTable
o need to group together pages in the same domain
o solution: instead of using the exact URL of the page, the hostname is reversed
o example original URL: maps.google.com/index.html
row key: com.google.maps/index.html
all the pages of google.com domain are stored contiguously (otherwise, pages of maps.google.com wouldn’t be contiguous to pages of drive.google.com)
Data Model – Real Example (3/3)
• column families in WebTable
o the language column family has a single column where the information about the language of the page is stored
o the anchor column family is used to store the references to a page
one column for each reference
the qualifier is the name of the referring site
the value is the text of the link
Data Model – API
•modify data RowMutation(table_name, row_ID)
o Set(column_ID, value)
o Delete(column_ID)
o applied atomically to a row
• retrieve data Scanner(table_name)
o FetchColumnFamily(column_family_name)
o Lookup(row_ID)
o ScanStream: iterate through the results
Architecture
• physical resources are shared among several Google services
• data is persistently stored to Google File System (GFS) [4] [6]
• coordination activities are supported by Chubby service [5] [6]
Architecture – Google File System (1/6)
• GFS assumptions/requirements o store a modest number of large files
o typical workloads
two kinds of reads: large streaming reads and small random reads
many large, sequential writes that append data to files
small writes at arbitrary positions are supported but do not have to be efficient
o high throughput is more important than low latency
• GFS interface: supported file operations o create/delete
o open/close
o read/write
Architecture – Google File System (2/6)
• files are divided into fixed-size chunks (default 64 MB)
• each chunk is replicated on multiple chunkservers (default 3)
• the master maintains all file system metadata o namespace
o access control information
omapping from files to chunks
omapping from chunks to chunkservers
• clients never read and write file data through the master o a client asks the master which chunkservers it should contact
o it caches this information for a limited time
o it interacts with the chunkservers directly for many subsequent operations
Architecture – Google File System (3/6)
• GFS architecture
Architecture – Google File System (4/6)
• GFS mutations o a mutation affects a specific part of a file, called file region
o a mutation can be concurrent with other mutations
o if a mutation would cause a chunk to exceed the maximum size
o padding is used to fill the chunk
o the client is notified to continue the mutation on another chunk
owrite: data is written atomically at an application-specific file offset
o record append: data (the record) is appended atomically at-least-once
the offset is chosen by GFS and returned to the client
it marks the beginning of a defined region that contains the record
in case of failure at any replica, the append is failed and gets retried by the client some replicas can have duplicates!!
Architecture – Google File System (5/6)
• GFS consistency guarantees o after a data mutation, a file region can be
consistent: any client sees the same data, regardless of chosen replicas
defined: it is consistent and clients see the mutation in its entirety
inconsistent: it is not consistent
o inconsistency for record appends is due to possible duplicates and padding
relaxed consistency model!!
Write Record Append
Serial Success defined defined interspersed with
inconsistent Concurrent Successes consistent
but not defined
Failure inconsistent
Architecture – Google File System (6/6)
• how to cope with GFS relaxed consistency model o managed by the client
o padding can be discarded by employing checksum techniques
the writer includes checksums in the records
o duplicates can be filtered out by using unique identifiers
the writer provides unique identifiers to the records
o these mechanisms are implemented
in a library linked by the client
in lower layers of the client itself
higher layers of the client can rely on strong consistency guarantees for append mutations
Architecture – Chubby
• a highly-available and persistent distributed lock service o five active replicas, one elected as the master that serves requests
o service is alive when a majority of the replicas are running
o Paxos algorithm to keep replicas consistent despite of failures
• Chubby provides a namespace that consists of directories and small files
• dirs and files can be used as a lock, reads/writes to a file are atomic
• used for several tasks
o ensure there is at most one active master at any time
o store the bootstrap location of BigTable data
o discover tablet servers and finalize tablet server deaths
o store BigTable schema information
o store access control lists
Architecture – BigTable Components (1/2)
• one master server o assignment of tablets to tablet servers
o detection of addition/removal of tablet servers
o load balancing of tablet servers
o garbage-collection of files in GFS
o handling of schema changes
• many tablet servers o can be dynamically added/removed
o each server manages a set of tablets
o handling of read/write requests to data of managed tablets
o splitting of tablets that have grown too large
• a library for the clients o client data does not move through the master: direct communication with
tablet servers the master is lightly loaded in practice
Architecture – BigTable Components (2/2)
• components connection [6]
Architecture – Tablet Location (1/2)
• three-level hierarchy o first: a file stored in Chubby that contains the location of the root tablet
o second: the root tablet stores the locations of the tablets of a METADATA table
o third: a METADATA tablet contains the location of a set of user tablets
Architecture – Tablet Location (2/2)
• the METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table ID and its end row
• the client library caches tablet locations
• the client library prefetches tablet locations o it reads the metadata for more tablets when it reads the METADATA table
Architecture – Tablet Assignment (1/2)
• the master keeps track of live tablet servers and current assignment of tablets to tablet servers o unassigned tablets are included
o an unassigned tablet is assigned to a tablet server having sufficient space
• when a tablet server starts, it creates and acquires a lock on a uniquely-named Chubby file in a servers directory
• the master monitors such directory to discover tablet servers
• a tablet server stops serving its tablets if it loses its lock o it attempts to reacquire the lock on its file as long as the file exists
o if the file no longer exists, the tablet server kills itself
o whenever a tablet server terminates, it attempts to release its lock so that the master will reassign its tablets more quickly
Architecture – Tablet Assignment (2/2)
• the master has to detect when a tablet server isn’t working o it periodically asks each tablet server for the status of its lock
o in case of problems (lock lost, unreachable server), it attempts to get the lock
o if the lock is got, then Chubby is live and the tablet server has problems the master deletes the file to avoid that such server can serve again
o the master can move all the tablets that were assigned to that server into the set of unassigned tablets
• the master kills itself if its Chubby session expires o the cluster is not vulnerable to issues between the master and Chubby
omaster failures do not change the assignment of tablets to tablet servers
Architecture - SSTable
• file format used internally to store BigTable data
• a persistent, ordered immutable map from keys to values
• operations o look up the value associated with a specified key
o iterate over all key/value pairs in a specified key range
• each SSTable contains a sequence of blocks
• a block index is used to locate blocks, it is loaded into memory o the block index is stored at the end of the SSTable
• a lookup can be performed with a single disk seek o first find the block with a binary search in the in-memory index
o read the appropriate block from disk
Architecture – Tablet Serving (1/2) • the persistent state of a tablet is stored in GFS
• updates are committed to a commit log that stores redo records
• recently committed ones are stored in memory in a sorted buffer (memtable)
• older updates are stored in a sequence of SSTables
Architecture – Tablet Serving (2/2) • to recover a tablet, a tablet server reads its metadata from the
METADATA table that contains o the list of SSTables that comprise a tablet
o a set of a redo points, which are pointers into any commit logs that may contain data for the tablet
• write operation o authorization check: read the list of permitted writers from a Chubby file
o a valid mutation is written to the commit log
o its contents are inserted into the memtable
• read operation o authorization check
o executed on a merged view of the sequence of SSTables and the memtable
o since the SSTables and the memtable are lexicographically sorted, the merged view can be formed efficiently
Architecture – Compactions (1/2)
• as write operations execute, the size of the memtable increases
• when the memtable size reaches a threshold
o the memtable is frozen
o a new memtable is created
o the frozen memtable is converted to an SSTable and written to GFS
• this minor compaction process has two goals
1. it shrinks the memory usage of the tablet server
2. it reduces the amount of data that has to be read during recovery
• if this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables!!
Architecture – Compactions (2/2)
• bound the number of GFS files by periodic merging compactions o read a few SSTables and the memtable, and write out a new SSTable
o when the compaction completes, input SSTables and memtable are discarded
• a merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction o SSTables produced by non-major compactions can contain special deletion
entries that suppress deleted data in older SSTables that are still live
o a major compaction produces an SSTable that contains no deletion information or deleted data
o major compactions allow to reclaim resources used by deleted data, and to ensure that deleted data disappears from the system in a timely fashion
Why is BigTable CP? • if the master dies, the services it provided are no longer
functioning until a new master is started availability is given up
• if a tablet server dies, client requests to the tablets it managed cannot be served until such tablets are assigned by the master to another tablet server availability is given up
• if Chubby fails (a majority of its replicas die), BigTable cannot execute any synchronization or serve any client request availability is given up
• in case of GFS failure, SSTables and commit logs are not available until recovery availability is given up
Which Consistency Guarantees? • each row is managed by a single tablet server requests for a row are serialized
• data (SSTables and commit logs) for a specific tablet is written to GFS by a single tablet server reads and writes for tablet are serialized
• writes to GFS files are always appends no overwrites
• SSTables are written once and then read writes are not interleaved with reads
• commit logs are only read upon compactions or recovery, and these operations are not interleaved with writes writes are not interleaved with reads
strong consistency
References
[1] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data”, OSDI 2006, http://research.google.com/archive/bigtable-osdi06.pdf
[2] Column-oriented DBMS, http://en.wikipedia.org/wiki/Column-oriented_DBMS
[3] DataStax, Cassandra 1.1 Documentation, Section “Understanding the Cassandra Data Model”, http://www.datastax.com/doc-source/pdf/cassandra11.pdf
[4] S. Ghemawat, H. Gobioff, S. Leung, “The Google File System”, SOSP 2003, http://research.google.com/archive/gfs-sosp2003.pdf
[5] M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, OSDI 2006, http://research.google.com/archive/chubby-osdi06.pdf
[6] G. Coulouris, J. Dollimore, T. Kindberg, G. Blair, “Distributed Systems, Concepts and Design”, Fifth Edition, Addison-Wesley 2012