Fall 2012: CSE 704 Web-scale Data Management
1
Bigtable: A Distributed Storage System for Structured Data
Fay Chang et al. (Google, Inc.)Presenter: Kyungho [email protected]
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
2
Motivation and Design Goal
• Distributed Storage System for Structured Data– Scalability• Petabytes of data on Thousands of
(commodity) machines
–Wide Applicability• Throughput-oriented and Latency-sensitive
– High Performance– High Availability
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
3
Data Model
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
4
Data Model
• Not a Full Relational Data Model• Provides a simple data model– Supports Dynamic Control over Data
Layout– Allows clients to reason about the
locality properties
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
5
Data Model – A Big Table
• A Table in Bigtable is a:– Sparse– Distributed– Persistent–Multidimensional – Sorted map
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
6
Data Model
• Data is indexed using row and column names
• Data is treated as uninterpreted strings– (row:string, column:string, time:int64)
string
• Data locality can be controlled through careful choices of the schema
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
7
Data Model
• Rows– Data maintained in lexicographic order by
row key– Tablet: rows with consecutive keys
• Units of distribution and load balancing
• Columns– Column families
• Family:qualifier
• Cells• Timestamps10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
8
Data Model – WebTable Example
10/22/2012
A large collection of web pages and related information
Fall 2012: CSE 704 Web-scale Data Management
9
Data Model – WebTable Example
Row Key
Tablet - Group of rows with consecutive keys.
Unit of DistributionBigtable maintains data in lexicographic order by row key
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
10
Data Model – WebTable Example
Column FamilyColumn family is the unit of access control
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
11
Data Model – WebTable Example
ColumnColumn key is specified by “Column family:qualifier”
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
12
Data Model – WebTable Example
ColumnYou can add a column in a column family if the column family was created10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
13
Data Model – WebTable Example
CellCell: the storage referenced by a particular row key, column key, and timestamp
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
14
Data Model – WebTable Example
Different cells in a table can contain multiple versions
indexed by timestamp
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
15
API
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
16
API
• Write or Delete values in Bigtable• Look up values from individual rows• Iterate over a subset of the data in a
table
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
17
API – Update a Row
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
18
API – Update a Row
Opens a Table
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
19
API – Update a Row
We’re going to mutate the row
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
20
API – Update a Row
Store a new item under the column key
“anchor:www.c-span.org”
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
21
API – Update a Row
Delete an item under the column key
“anchor:www.abc.com”
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
22
API – Update a Row
Atomic Mutation
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
23
API – Iterate over a Table
10/22/2012
Create a Scanner instance
Fall 2012: CSE 704 Web-scale Data Management
24
API – Iterate over a Table
10/22/2012
Access “anchor” column family
Fall 2012: CSE 704 Web-scale Data Management
25
API – Iterate over a Table
10/22/2012
Specify “return all versions”
Fall 2012: CSE 704 Web-scale Data Management
26
API – Iterate over a Table
10/22/2012
Specify a row key
Fall 2012: CSE 704 Web-scale Data Management
27
API – Iterate over a Table
10/22/2012
Iterate over rows
Fall 2012: CSE 704 Web-scale Data Management
28
API – Other Features
• Single row transaction• Client-supplied scripts in the address
space of the server• Input source/Output target for
MapReduce jobs
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
29
A Typical Google Machine
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
30
A Google Cluster
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
31
A Google Cluster
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
32
Building Blocks
• Chubby– Highly-available and persistent
distributed lock service
• GFS– Store logs and data files– SSTable
• Google’s immutable file format• A persistent, ordered immutable map from
keys to values• http://code.google.com/p/leveldb/
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
34
Chubby
• Highly-available and persistent distributed lock service– 5 replicas, one is elected as a master– Paxos– Provides a namespace that consists of
directories and small files
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
35
Implementation
• Client Library• Master– one and only one!
• Tablet Servers–Many
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
36
Implementation - Master
• Responsible for assigning tablets to table servers– Addition/removal of tablet server– Tablet-server load balancing– Garbage collecting files in GFS
• Handles schema changes• Single master system (as GFS did)
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
37
Tablet Server
• Manages a set of tablets• Handles read and write requests to
the tablets• Splits tablets that have grown too
large
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
38
How Does a Client Find a Tablet?
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
39
Tablet Assignment
• Each tablet is assigned to at most one tablet server at a time
• When a tablet is unassigned, and a tablet server is available, the master assigns the tablet by sending a tablet load request
• Bigtable uses Chubby to keep track of tablet servers
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
40
Tablet Assignment
• Detecting a tablet server which is no longer serving its tablets– The master periodically asks each tablet server for the
status of its lock– If a tablet server reports it has lost its lock, or if the master
cannot reach a tablet server,– The master attempts to acquire an exclusive lock on the
server’s file– If the lock acquire is successful -> Chubby is alive, so the
tablet server must have a problem– The master deletes the server’s file in Chubby to ensure the
tablet server can never serve again– Then, the master move all the tablets that were previously
assigned to that server into the set of unassigned tablets
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
41
Tablet Assignment
• When a master is started, the master…– Grabs a unique master lock in Chubby– Scans the servers directory in Chubby to
find the live servers– Communicates with every live tablet
server to discover the current tablet assignment
– Scans the METADATA table and adds unassigned tablets to the set of unassigned tablets10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
42
Tablet Serving
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
43
Tablet Serving
• Memtable– A sorted buffer–Maintains the updates on a row-by-row
basis– Each row is copy-on-write to maintain
row-level consistency– Older updates are stored in a sequence
of SSTable
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
44
Tablet Serving
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
45
Tablet Serving - Write
• Write operation– The server checks if the operation is
valid– A valid mutation is written to the
commit log– After the write has been committed, its
contents are inserted into the memtable
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
46
Tablet Serving
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
47
Tablet Serving - Read
• Read operation– Check if the operation is valid– A valid operation is executed on a
merged view of the sequence of SSTables and the memtable
– The merged view can be formed efficiently since SSTables and the memtable are lexicographically sorted data structure
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
48
Tablet Serving - Recover
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
49
Tablet Serving - Recover
• Recover a table– A tablet server reads its metadata from
METADATA table– The metadata contains the list of
SSTables that comprise a tablet and a set of redo points
– The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points 10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
50
Compaction
• Minor compaction–When the memtable size reaches a
threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable
• Major compaction– Rewrite multiple SSTables into one
SSTable
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
51
Compaction
memtable
SSTable
Memory
GFS
Write Op
Commit LogSSTable
SSTableSSTable
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
52
Compaction
memtable
SSTable
Memory
GFS
Write Op
Commit LogSSTable
SSTableSSTable
Threshold reached
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
53
Compaction
memtable
SSTable
Memory
GFS
Write Op
Commit LogSSTable
SSTableSSTableSSTable
Threshold reached
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
54
Compaction
memtable
SSTable
Memory
GFS
Write Op
Commit LogSSTable
SSTableSSTableSSTable
A new memtable
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
55
Compaction
memtable
SSTable
Memory
GFS
Write Op
Commit Log
Major compaction
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
56
Schema Management
• Bigtable schemas are stored in Chubby
• The master update the schema by rewriting the corresponding schema file in Chubby
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
57
Optimization
• Locality Group– Client defined– An abstraction that enables clients to
control their data’s storage layout– A separate SSTable is generated for
each locality group in each tablet during compaction
– A locality group can be declared to be in-memory
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
58
Optimization
• Compression– Client can control whether the SSTables
for a locality group are compressed
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
59
Optimization
• Two-level Caching for Read Performance– Scan cache: • higher level. • Caches the key-value pairs returned by the
SSTable interface to the tablet server code
– Block cache: • lower level• Caches SSTable blocks
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
61
Optimization
• Commit-Log Implementation– Using one log per tablet server– Recovery?• A tablet server hosted 100 tablets failed• 100 other machines were each assigned a
single tablet• 100 reads?• Sort the commit log by <table, row name,
log seq #>
–Writing commit logs• Two log-writer threads
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
62
Performance Evaluation
• Sequential writes/reads– Row keys with names 0 to R-1, partitioned into 10N equal-
sized ranges– Wrote a single string under each row key– 1GB / tablet server
• Scan– Uses Bigtable Scan API
• Random writes/reads– Similar to Sequential write/read, but the row key was
hashed
• Random reads (Mem)– 100MB / tablet server, the locality group is marked as in-memory
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
63
Single Tablet Server Performance
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
64
Aggregate Throughput
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
65
Real Applications
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
66
Lessons Learned
• Failures!• Delay new features until it is clear
how the new features will be used• Monitoring• Simple Design!
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
67
Acknowledgement
• Jeff Dean, “Handling Large Datasets at Google: Current Systems and Future Directions”
10/22/2012