Lecture 16 - BigTable and NoSQLswift/classes/cs739-fa19/wiki...Bloom Filter Tradeoffs •Three factors: m,kand n. •Normally, n and m are given, and we select k. –More hash functions

BigTable and NoSQL

CS 739Fall 2019

Notes from reviews

Outline

1. What is the basic abstraction for BigTable,what is unique about it?

2. What is the organization of BigTable, what is unique about it?

3. How well does BigTable perform? What are its bottlenecks

DataStorage for Scalable Applications

• Options:– File system: GFS, etc.• Good for sequential reads/writes• Need index structure for random access• Limited concurrency control

– RDBMS: Oracle• Lots of functionality• Can scale fairly large ($$$$$)

Database Storage

• Store data in table without replication– Use join to combine tables– Example:• Table 1: web page + contents• Table 2: web page + link to web page• Use join links to table 1 to find what links go to

Why not a database?• Terminology:

– not about SQL but about transactions & joins• RDBMS limitations:

– Scaling: • low end systems need expensive hardware to scale up

– Expect SAN & one SMP, not cluster with local disks• Joins are expensive• Data is not logically partitioned, so difficult to do automatically within RDBMS

– Complexity:• Lots of tuning options, must convert data, hard to store unstructured or flexible data –

need a DBA per application• Lots of (unneeded) features

– SQL language• Difficult/cumbersome to fit into some languages

– Inflexible• Changing schema requires rewriting data• Craigslist: adding a column takes 2 months

NoSQL

• General data model– Sets of tables – Lots of columns (denormalized data)

• What gets lost– No general transactions

• May not even offer strong consistency with failures• Transactions only when convenient (local data, single

machine)– No rigid schema

• Rows are pairs of keys/value, easy to extend– No joins

• Scan, get, put only (select, project, update)

NoSQL benefits

• Price– Open source

• Speed– No support for complex, slow operations (so can only

do fast things)• Scalability– Programmer hints on sharding data across machines

• Range of rows, subset of columns

• Move some complexity from DB & hardware to developers

NoSQL types

• Key/value store– Dynamo:Single keys/values– DynamoDB/SimpleDB: tables of items, items are a set

of key/value pairs• Some consistency between keys• Some locality between data

• Document store– Key + object (no schema at all)

• Alternatives:– MySQL + memcached – only for read-intensive

workloads

BigTable

• Structure storage:– sparse, distributed, multi-dimensional sorted map

(keys in multiple dimensions map to values)–Map indexed by row key, column key, and time (3

dimensions)– Values are arrays of bytes (no interpretation by

BigTable for operations)• Why?– Can optimize random access as compared to raw

GFS

BigTable abstraction• row keys: arbitrary strings

– writes to a single row are atomic no matter how many columns accessed

– Range of rows managed together = tablet• replicated, distributed together• All columns stored with tablet

• QUESTION: How export locality for efficient access?– Rows stored sorted by row key

• allows for locality of scans– QUESTION: Why?

• clients can choose row keys for good scan order with locality• e.g. reverse url:

– www.cs.wisc.edu/~swift becomes– edu.wisc.cs/~swift (all cs sites located together)

– Can create ”column groups” stored separately• Allows more efficient search/retrieval just of related fields in a column

http://www.cs.wisc.edu/~swift

BigTable abstraction (2)• Columns grouped into families by name

– family:qualifier. E.g– Access control done on families, not columns– Data for a column family compressed together (helps to have

similar data types)– Families must be created – have metadata

• Columns within a family do not need to be created; just use any name• No metadata about columns

• QUESTION: why?– More efficient (larger grouping)– Often have related data in a row; would normally separate into

separate tables in a real database keyed off family + row identifier using joins

– e.g. anchors for a web page

BigTable abstraction (3)

• Time– Every cell (not row,column) can have multiple versions

indexed by time• Time either assigned by BigTable at write• or set by application

– Can set GC policy:• Keep last n versions• Keep versions for last n units of time

• Why?– Google uses time– Hard to do otherwise

BigTable use for crawling

• Every web page (URL) is a row• row key is the url (rearranged as shown)• column keys are, e.g., text of the link from other pages

pointing to this page (e.g. espn.com might reference cnn, so anhor text on espn.com might be “CNN news”)• All anchors (text on referring pages) grouped into

anchor column family• Each scan of web produces new values at new time

BigTable API

• Write API is operation based– Create an operation + add multiple operations to row

• specify row, columns to change, or read• issue operation atomically

• Read API lookup or streaming– streaming: iterate over a set

• e.g. specify column family, row• iterate over elements of family

– Lookup: return columns from a particular row• QUESTION: Random reads are slow. Can anything be done?

Key design

• How to build a new service on top of existing services/abstractions– GFS for bulk storage – big objects, slowly changing– Chubby for coordination

• Service layer is compute only– All data stored in chubby/gfs– Can use any node to be a tablet server or master

Master

• GFS/HDFS/MapReduce master role– Knowledge of where every file is stored– Handles metadata operations (create/delete files)– Scheduling/dispatching each task (Map/Reduce)

• With replicated/available services– Knowledge can be stored in GFS/Chubby directly – no

need to go to master• What is left?– Coordination/placement decisions

• Adding a replica/removing a replica• Partitioning: adding/removing a partition

Major design components

• Table storage:– QUESTION: what is goal?

• Efficiently store nosql data to allow desired operations• Work well with underlying storage services

– Large, sequential operations

• Coordination/metadata management– Deciding who serves data for a tablet

• Creation/deletion– Handling fail/over– Handling server creation/deletion

BigTable Data Storage Implementation

• Tables broken down into tablets– Contiguous range of primary keys– Tablets can be split into smaller tablets if they get too large

• Tablets stored as an SSTable– Key:value lists– Stored in GFS– Immutable

• Read/writes: handled later– NOTE: storing a DB on a distributed file system is kind of new…

• Cache data in memory– Lookup table (memtable) contains sorted, up-to-date values

TabletServers

• Each tablet assigned to a server– only one server per tablet – no consistency issues– load balancing by moving tablets around

• QUESTION: Why do this and not use GFS load balancing?– answer: need a service to do serialization on a single

copy; • QUESTION: Why not run tablet servers on GFS

nodes?– perhaps could … but tablet servers use a lot of CPU +

I/O, could conflict

Making BigTable big

• Support multiple levels of row lookup– Data distributed as tablets; master assigns tablets to

servers (10,000 tablets per server) • Think sharding/partitioning

– Root tablet points to METADATA table• METADATA contains location of all user tables

• METADATA stores location of all rows in all tables– encoded by table ID (a number) and last row of tablet

• 3 level lookup:– root table finds metadata table, metadata table finds

user table

BigTable layout

General update idea• Updates handled as a checkpoint + log

– Writes go to a log– Periodically write out checkpoints & truncate log– Random + sequential writes have same speed– Well optimized for update-intensive workloads

• Reads:– Maintain memory representation of log for fast reads:

memtable– Logically find newest value from memtable + all partial

checkpoints• Reclaiming space

– Merge checkpoints (SSTables) by discarding overwritten/deleted data

SSTable data access

Bloom filters limit reads

Read path

SSTable mergesMinor compaction Major (merging) compaction

Making reads fast

• Logically, a read consults all stored SSTables to find newest value– Could be slow

• Optimization: bloom filters– Store approximate representation of SSTable

contents– QUESTION: What is need for a filter?• No negatives (say not present when it is)

Bloom FiltersStart with an m bit array, filled with 0s.

Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

To check if y is in S, check B at Hi(y). All k values must be 1.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0BPossible to have a false positive; all k values are 1, but y is not in S.

Bloom Filter

01000 10100 00010

x

h1(x) h2(x) hk(x)

V0 Vm-1

h3(x)

Bloom Errors

01000 10100 00010h1(x) h2(x) hk(x)

V0 Vm-1

h3(x)

a b c d

x didn’t appear, yet its bits are already set

Computational Factors

• Size m/n : bits per item.– |U| = n: Number of elements to encode.– hi: U®[1..m] : Maintain a Bit Vector V of size m

• Time k : number of hash functions.– Use k hash functions (h1..hk)

• Error f : false positive probability.

Bloom Filter Tradeoffs

• Three factors: m,k and n.• Normally, n and m are given, and we select k.–More hash functions yields more chances to

find a 0 bit for elements not in S– Fewer hash functions increases the fraction

of the bits that are 0.• Not surprisingly, when k is optimal, the “hit

ratio” (ratio of bits flipped in the array) is 0.5 .

Commit logs• Logically need to write to log for every update– High I/O cost

• How structure logs?– Per tablet:

• Each server has 10,000, so lots of logs -> lots of seeks (common)• Easy recovery: read single log for tablet (rare)

– Per server:• Can write sequential logs for all tablets (common)• Must partition during recovery (rare)

• Group commit:– Batch a bunch of updates and write them all at once– Hurts write latency, greatly improves throughput

Keeping tablets small

• Tablet server can split a tablet that gets too big– Too much load for one server– Index does not fit in memory

• How:– Write out new SSTABLES– Write new tablet to METADATA table (using BigTable)– Notify master so it can assign server to host new

tablet

Consistency• BigTable relies on chubby for consistency

– ensure there is only one master (locking/consistency)– store location of data (bootstrapping)

• stores location (name) of BigTable master– to discover tablet servers and kill them

• tablet servers create a file in Chubby with a unique name for the server• Chubby notifies master if a server dies and loses its lock• Master acquires servers lock to kill it (prevent it from running)

– to store access control lists over column families• Only one tablet server serves a tablet

– consistency for tablet writes, GFS provides durability• QUESTION: Why doesn’t GFS use chubby?

– older code base…

BigTable master

• Master handled for consistent updates to shared state, such as adding/removing servers, GC, assign tablets to tablet servers

• Master doesn’t handle requests;– client’s look in Chubby to see who serves the

root/metadata tablets instead

Evaluation• How evaluate such a system?

– What are considerations?• Latency• Throughput• Scalability with increasing # of servers• Performance during failure

– How know if it is good or bad? What is the right performance?• A: look at what a disk can do: 50-100 MB/sec

– How does Chubby do?• 8 mb/sec sequential write, 4 mb/sec sequential read with one server• 1 mb/sec random read

– Must read 64kb chunks from GFS to serve 1 kb of data (read amplification)

– Why?• Separation from GFS leads to conflicting network access• 1 gbps networks have a hard time keeping up with disk speeds

BigTable notes• Simple table abstraction without joins

– allows for random lookup, sequential scan• Relies on GFS for storage

– doesn’t handle data reliability, but worries about consistency of bigtable servers

– General idea of separating out storage layer from processing layer becoming common – easier to repartition data if not have to move it, but adds latency

• Semantics match use of Google data processing– non-traditional APIs– Limited transaction support

• Performance worse than a DB– per-node I/O scales negatively with cluster size (network limited, node

imbalance)– Read rate same as write rate despite lack of replication

Bentley McIlroy data compression• Ref:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&type=pdf

• Fingerprint idea:– Compute fingerprint of fixed-length substrings

• Every b bytes – finds overlaps of at least b bytes– Compare fingerprint against every byte offset

• If matches, try to extend forwards and backwards to make longer– Replace overlap with starting byte of first appearance of

substring, length of overlap

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&type=pdf

Documents

Lecture 16 - BigTable and NoSQLswift/classes/cs739-fa19/wiki...Bloom Filter Tradeoffs •Three factors: m,kand n. •Normally, n and m are given, and we select k. –More hash functions