View
6
Download
0
Category
Preview:
Citation preview
Distributed Database
Architecture
• Data distribution
• Data replication
Outline
@source IBM
Distributed data:
summary
Appl
DBMS
DB
Basic (single db)
connect
Appl
DBMS
DB
DBMS
DB
connect
Fed Srv
Federation
Appl
DBMS
DB
Appl
DBMS
DB
Repl
Srv
Replication
Appl
Appl
DBMS
DB
EP
Srv
Event Publishing
Appl
DBMS
DB
Appl
DBMS
DB
ETL srv
DW
DBMS
Extract Trasform
& Load
Appl
DBMS
DB
DBMS
DB
connect connect
Distributed Access
DA
TA
MO
VE
: N
OD
ATA
MO
VE
: Y
ES
Data distribution
• Shared everything
• Shared disk
• Shared nothing
Type of architecture
Business Logic Presentation logic
Mainframe
Shared everything
Database
Business LogicPresentation logic
Dumbterminal
Dumbterminal
Dumbterminal
Database server
Database
Dumbterminal
Dumbterminal
Dumbterminal
Web Browser
Shared everything
Presentation logic (javascript)
Database server
Database
Dumbterminal
Dumbterminal
Dumbterminal
Application serverApplication server
Business LogicPresentation logic
Business LogicPresentation logic
Shared disk
• Adopted solution of Nosql database architecturesupporting scale out
Shared nothing
• http://www.mullinsconsulting.com/db2arch-sd-sn.html
Evaluation
• What is high availability? Is a mix of
• Architecture design
• people!
• process
• technology
• What is NOT high availability
– A pure technology solution
– A close term to
• scalability
• manageability
Scalability or availability?
How many 9?
availability Downtime (in one year)
100% Never
99.999% < 5.26 minutes
99.99% 5.26 – 52 minutes
99.9 % 52 minutes – 8 hours and 45 minutes
99 % 8 hours and 45 minutes –
87 hours and 36 minutes
90% 788 hours and 24 minutes –
875 hours and 54 minutes
Replication
• A log is a sequential file that is stored in a stable memory (that it a “conceptual” storage the will never fail)
• It stores all activitied realized by all transactions in a chronological order.
• Two type of record are stored:
– Transaction log • It includes operation on tables
– System events• Checkpoint
• Dump
System log
– It dependes on the specific relational operation
– Legenda
O=object, AS = After State, BS = Before State
Possible operation in a transaction
– begin, B(T)
– insert, I(T,O, AS)
– delete, D(T,O,BS)
– update, U(T,O,BS,AS)
– commit, C(T), o abort, A(T)
Transaction log
A log example
B(T1)B(T2) C(T2) B(T3)
U(T3,…)U(T1,…)
U(T1,…)U(T2,…) U(T1,…)
Checkpoint
• checkpoint is in charge to storing the set of running transcations in a giventime point T1, …, Tn
Example of transaction log and
checkpoing
dump
CK
B(T1) B(T2) C(T2) B(T3)
U(T3,…)U(T1,…)
U(T1,…)U(T2,…)U(T1,…)
T1
T2
T3
committed
uncommitted
It is not started yet
• A dump is a full copy of the entire state of a DB in a stable memory
• offline execution
• It generates a backup
• After the backup is completed the dump record of log is written
2.3 Dump
Log example
CKB(T1)B(T2) C(T2) B(T3)
U(T3,…)U(T1,…)
U(T1,…)U(T2,…)U(T1,…)
dump
21 @source IBM
CD1SOURCE
TARGET TARGET TARGET
Data Distribution (1:many)
CD1SOURCE CD1SOURCE CD1SOURCE
TARGET
Data Consolidation (many:1)
CD1SOURCE
CD1STAGING CD1STAGING
TARGETTARGET
Multi-Tier Staging
TARGETTARGET
CD1SOURCE
Peer-to-Peer
CD1SOURCE CD1SOURCE
CD1PRIMARY
Bi-directional
SECONDARY
Conflic
t D
ete
ction/R
esolu
tion
Replica architecture
22
How to create a replica
1. Detach 2. Copy 4. Attach
3. Attach
23
How to create a replica
1. Backup (2. Copy) 3. Restore
How to create a replica
Full backupTransaction log
How to create a replica
TX1: INSERT S1
TX2: INSERT S2
TX3: ROLLBACK
TX1: COMMIT
TX1: UPDATE S1
TX3: DELETE S1
D2 Log
Q-SUBS
Q-PUBS
SOURCE2
SOURCE1
TX1: INSERT S1
TX1: COMMIT
TX1: UPDATE S1
CAPTURE
In-Memory-Transactions
Transaction is still „in-flight“
Nothing inserted yet.
„Zapped“ at Abort
Never makes it to send queue
TX3: DELETE S1
TX3: ROLLBACK TX2: INSERT S2
Restart
Queue
MQ Put when Commit
record is found
Send Queue
Source
SOURCE2
SOURCE1
DB Log
Capture
• From a conceptual view point it is a replica without apply
Target
SOA/User
Application
User
Application
WBI Event
Broker
TARGET
TARGET
TARGET
Event Publishing
Replica execution
Primary
Full backup Full restore
Copy
Secondary
Log backup Log restore
Copy
Inizializzazione
Sincronizzazione
Monitor
Another architecture
Subscribers
Distributor
Publisher
Distribution in NoSQL
MongoDB's Approach to Sharding
Partitioning
• User defines shard key
• Shard key defines range of data
• Key space is like points on a line
• Range is a segment of that line
Initially 1 chunk
Default max chunk size: 64mb
MongoDB automatically splits & migrates chunks when max reached
Data Distribution
Queries routed to specific shards
MongoDB balances cluster
MongoDB migrates data to new nodes
Routing and Balancing
MongoDB Auto-Sharding
• Minimal effort required
– Same interface as single mongod
• Two steps
– Enable Sharding for a database
– Shard collection within database
Architecture
What is a Shard?
• Shard is a node of the cluster
• Shard can be a single mongod or a replica set
Meta Data Storage
• Config Server
– Stores cluster chunk ranges and locations
– Can have only 1 or 3 (production must have 3)
– Not a replica set
Routing and Managing Data
• Mongos
– Acts as a router / balancer
– No local data (persists to config database)
– Can have 1 or many
Sharding infrastructure
Configuration
Example Cluster
mongod --configsvr
Starts a configuration server on the default port (27019)
Starting the Configuration Server
mongos --configdb <hostname>:27019
For 3 configuration servers:
mongos --configdb<host1>:<port1>,<host2>:<port2>,<host3>:<port3>
This is always how to start a new mongos, even if the cluster is already running
Start the mongos Router
mongod --shardsvr
Starts a mongod with the default shard port (27018)
Shard is not yet connected to the rest of the cluster
Shard may have already been running in production
Start the shard database
On mongos:
– sh.addShard(‘<host>:27018’)
Adding a replica set:
– sh.addShard(‘<rsname>/<seedlist>’)
Add the Shard
db.runCommand({ listshards:1 })
{ "shards" :
[{"_id”: "shard0000”,"host”: ”<hostname>:27018” } ],
"ok" : 1
}
Verify that the shard was added
Enabling Sharding
• Enable sharding on a database
sh.enableSharding(“<dbname>”)
• Shard a collection with the given key
sh.shardCollection(“<dbname>.people”,{“country”:1})
• Use a compound shard key to prevent duplicates
sh.shardCollection(“<dbname>.cars”,{“year”:1,
”uniqueid”:1})
Tag Aware Sharding
• Tag aware sharding allows you to control the distribution of your data
• Tag a range of shard keys
– sh.addTagRange(<collection>,<min>,<max>,<tag>)
• Tag a shard
– sh.addShardTag(<shard>,<tag>)
Mechanics
Partitioning
• Remember it's based on ranges
Chunk is a section of the entire range
A chunk is split once it exceeds the maximum size
There is no split point if all documents have the same shard key
Chunk split is a logical operation (no data is moved)
Chunk splitting
Balancer is running on mongos
Once the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts
Balancing
The balancer on mongos takes out a “balancer lock”
To see the status of these locks:use config
db.locks.find({ _id: “balancer” })
Acquiring the Balancer Lock
The mongos sends a moveChunk command to source shard
The source shard then notifies destination shard
Destination shard starts pulling documents from source shard
Moving the chunk
When complete, destination shard updates
config server
– Provides new locations of the chunks
Committing Migration
Source shard deletes moved data
– Must wait for open cursors to either close or time out– NoTimeout cursors may prevent the release of the lock
The mongos releases the balancer lock after old chunks are deleted
Cleanup
Routing Requests
Cluster Request Routing
• Targeted Queries
• Scatter Gather Queries
• Scatter Gather Queries with Sort
Cluster Request Routing: Targeted
Query
Routable request received
Request routed to appropriate shard
Shard returns results
Mongos returns results to client
Cluster Request Routing: Non-Targeted
Query
Non-Targeted Request Received
Request sent to all shards
Shards return results to mongos
Mongos returns results to client
Cluster Request Routing: Non-Targeted
Query with Sort
Non-Targeted request with sort
received
Request sent to all shards
Query and sort performed locally
Shards return results to mongos
Mongos merges sorted results
Mongos returns results to client
Shard Key
Shard Key
• Shard key is immutable
• Shard key values are immutable
• Shard key must be indexed
• Shard key limited to 512 bytes in size
• Shard key used to route queries
– Choose a field commonly used in queries
• Only shard key can be unique across shards
– `_id` field is only unique within individual shard
Shard Key Considerations
• Cardinality
• Write Distribution
• Query Isolation
• Reliability
• Index Locality
HBase Architecture
87
Three Major Components
88
• The HBaseMaster
– One master
• The HRegionServer
– Many region servers
• The HBase client
HBase Components
• Region– A subset of a table’s rows, like horizontal range
partitioning– Automatically done
• RegionServer (many slaves)– Manages data regions– Serves data for reads and writes (using a log)
• Master– Responsible for coordinating the slaves– Assigns regions, detects failures– Admin functions
89
Big Picture
90
Hbase architecture
ZooKeeper
• HBase depends on
ZooKeeper
• By default HBase manages
the ZooKeeper instance
– E.g., starts and stops
ZooKeeper
• HMaster and HRegionServers
register themselves with
ZooKeeper
92
Cassandra Architecture
Cassandra Architecture Overview
○ Cassandra was designed with the understanding that system/
hardware failures can and do occur
○ Peer-to-peer, distributed system
○ All nodes are the same
○ Data partitioned among all nodes in the cluster
○ Custom data replication to ensure fault tolerance
○ Read/Write-anywhere design
○ Google BigTable - data model
○ Column Families
○ Memtables
○ SSTables
○ Amazon Dynamo - distributed systems technologies
○ Consistent hashing
○ Partitioning
○ Replication
○ One-hop routing
Transparent Elasticity
Nodes can be added and removed from Cassandra online, with no downtime being experienced.
1
2
3
4
5
6
1
7
10 4
2
3
5
68
9
11
12
Transparent Scalability
Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data.
1
2
3
4
5
6
1
7
10 4
2
3
5
68
9
11
12
Performance throughput = N
Performance throughput = N x 2
High Availability
Cassandra, with its peer-to-peer architecture has no single point of failure.
Multi-Geography/Zone Aware
Cassandra allows a single logical database to span 1-N datacenters that are geographically dispersed. Also supports a hybrid on-premise/Cloud implementation.
Data Redundancy
Cassandra allows for customizable data redundancy so that data is completely protected. Also supports rack awareness (data can be replicated between different racks to guard against machine/rack failures).
uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for
• Nodes are logically structured in Ring Topology.
• Hashed value of key associated with data partition is used to assign it to a node in the ring.
• Hashing rounds off after certain value to support ring structure.
• Lightly loaded nodes moves position to alleviate highly loaded
nodes.
Partitioning
01
1/2
F
E
D
C
B
A N=3
h(key2)
h(key1)
Partitioning & Replication
• Used to discover location and state information about the other nodes participating in a Cassandra cluster
• Network Communication protocols inspired for real life rumor spreading.
• Periodic, Pairwise, inter-node communication.
• Low frequency communication ensures low cost.
• Random selection of peers.
• Example – Node A wish to search for pattern in data
– Round 1 – Node A searches locally and then gossips with node B.
– Round 2 – Node A,B gossips with C and D.
– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……
• Round by round doubling makes protocol very robust.
Gossip Protocols
• Gossip process tracks heartbeats from other nodes both directly and indirectly
• Node Fail state is given by variable Φ
– tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).
• This type of system is known as Accrual Failure Detector
• Takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate
• A threshold for Φ tells is used to decide if a node is dead
• If node is correct, phi will be constant set by application.
Generally Φ(t) = 0
Failure Detection
Write Operation Stages
• Logging data in the commit log
• Writing data to the memtable
• Flushing data from the memtable
• Storing data on disk in SSTables
• Commit Log
– First place a write is recorded– Crash recovery mechanism– Write not successful until recorded in commit log– Once recorded in commit log, data is written to Memtable
• Memtable
– Data structure in memory– Once memtable size reaches a threshold, it is flushed (appended) to SSTable– Several may exist at once (1 current, any others waiting to be flushed)– First place read operations look for data
• SSTable
– Kept on disk– Immutable once written– Periodically compacted for performance
Write Operations
Write Operations
Consistency
• Read Consistency– Number of nodes that must agree before read request
returns– ONE to ALL
• Write Consistency– Number of nodes that must be updated before a write is
considered successful– ANY to ALL– At ANY, a hinted handoff is all that is needed to return.
• QUORUM– Commonly used middle-ground consistency level– Defined as (replication_factor / 2) + 1
Write Consistency (ONE)
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE
Write Consistency (QUORUM)
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM
• Write intended for a node that’s offline
• An online node, processing the request, makes a note to carry out the write once the node comes back online.
Write Operations: Hinted
Handoff
Hinted Handoff
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3and
hinted_handoff_enabled = true
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY
Write locally: system.hints
Note: Doesn’t not count toward consistency level (except ANY)
• Tombstones
– On delete request, records are marked for deletion.
– Similar to “Recycle Bin.”
– Data is actually deleted on major compaction or configurable timer
Delete Operations
• Compaction runs periodically to merge multiple SSTables– Reclaims space– Creates new index– Merges keys– Combines columns– Discards tombstones– Improves performance by minimizing disk seeks
• Two types– Major– Read-only
Compaction
Compaction
• Ensures synchronization of data across nodes
• Compares data checksums against neighboring nodes
• Uses Merkle trees (hash trees)
• Snapshot of data sent to neighboring nodes
• Created and broadcasted on every major compaction
• If two nodes take snapshots within TREE_STORE_TIMEOUT of each other, snapshots are compared and data is synced.
Anti-Entropy
Merkle Tree
• Read Repair
– On read, nodes are queried until the number of nodes which respond with the most recent value meet a specified consistency level from ONE to ALL.
– If the consistency level is not met, nodes are updated with the most recent value which is then returned.
– If the consistency level is met, the value is returned and any nodes that reported old values are then updated.
Read Operations
Read Repair
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6R1
R2
R3Client
SELECT * FROM table USING CONSISTENCY ONE
replication_factor = 3
• Bloom filters provide a fast way of checking if a value is not in a set.
Read Operations: Bloom Filters
Read
MemoryDisk
Bloom Filter
Key Cache
Partition Summary
Compression Offsets
Partition Index Data
Cache Hit
Cache Miss
= Off-heap
key_cache_size_in_mb > 0
index_interval = 128 (default)
Recommended