Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
1
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.0
CS435 Introduction to Big Data
PART 2. LARGE SCALE DATA STORAGE SYSTEMSDISTRIBUTED FILE SYSTEMS
Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs435
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.1
FAQs• Term project presentations• 12/9 (team 1- 6), 12/11 (team 6-12), 12/13 (team 13-16)• Please attend at least 2 presentation sessions and ask questions or provide comments
• Participation score (attendance + question)
• 12 minutes (including transition time)/team• Submit your slides (No PDF!) on canvas
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
2
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.2
Today’s topics
• Google File System
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.3
Inconsistent Regions
Data 3 Data 3
Data 1 Data 1 Data 1
Data 2 Data 2 Data 2
User will re-try to store Data 3
Data 3 Data 3
Data 1
Data 2
Data 3
Data 1
Data 2
Data 3
Data 1
Data 2
Data 3
Data 3
Failed
Empty
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
3
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.4
What if record append fails at one of the replicas
• Client must retry the operation
• Replicas of same chunk may contain• Different data• Duplicates of the same record
• In whole or in part
• Replicas of chunks are not bit-wise identical!• In most systems, replicas are identical
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.5
GFS only guarantees that the data will be written at least once as an atomic unit
• For an operation to return success• Data must be written at the same offset on all the replicas
• After the write, all replicas are as long as the end of the record
• Any future record will be assigned a higher offset or a different chunk
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
4
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.6
GFS client code implements the file system API
• Communications with master and chunk servers done transparently• On behalf of apps that read or write data
• Interact with master for metadata
• Data-bearing communications directly to chunk servers
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.7
Handling failed “write” in Hadoop HDFS [1/2]
• Different from GFS
1. Pipeline is closed• Any packets in the ack queue are added to the front of the data queue
• Datanodes that are downstream from the failed node will not miss any packets
2. The current block on the good datanodes is given a new identity• Reports to the namenode
• To detect and delete partial block on the failed datanode later on
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
5
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.8
Handling failed “write” in Hadoop HDFS [2/2]
3. Remainder of the block’s data is written to the other good datanodes in the pipeline
4. Namenode notices the block is under-replicated • It arranges for a further replica to be created on another node• Write quorum
• dfs.replication.min (default to 1)
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.9
Part 2. Large scale data storage system
Distributed File SystemGoogle File System (GFS): Creating Snapshot
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
6
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.10
Snapshots
• Copying file or directory tree almost instantaneously
• Minimizing any interruptions of ongoing mutations
• Providing checkpoint mechanism• Users can commit later
• Rollback
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.11
Snapshots allow you to make a copy of a file very fast
1. Master revokes outstanding leases for any chunks of the file (source) to be snapshot
2. Log the operation to disk
3. Update in-memory state1. Duplicate metadata of the source file
4. Newly created file points to the “same chunks” as the source
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
7
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.12
When a client wants to write to a chunk C after the snapshot operation• Master sees “the reference count to C” > 1• Pick new chunk-handle C’
• Ask chunk-server with current replica of C• Create new chunk C’• Data is copied locally, not over the network
• From this point chunk handling of C’ is no different
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.13
GFS does not have a per-directory structure that lists files in the directory
• Name spaces represented as a lookup table• Maps full pathnames to metadata
• Each node has an associated read/write lock
• File creation does not require a lock on the directory structure• No inode needs to be protected from modification
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
8
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.14
Each master operation acquires a set of locks before it runs
• Read lock prevents a directory/file from being deleted, renamed, or snapshotted• Others can still read it
• Others CANNOT mutate (write/append/delete) this file
• Write lock on directory/file names serialize attempts to create a file with the same twice• Others CANNOT mutate (write/append/delete) this file
• Others CANNOT read this file
• If operation involves /d1/d2/…/dn/leaf• Acquire read locks on directory names
• /d1, /d1/d2, …, /d1/d2/…/dn• Read or write lock on full pathname
• /d1/d2/…/dn/leaf
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.15
Each master operation acquires a set of locks before it runs: Example
• Person A is reading a file mybooks/les_miserable
• Person B is trying to read a file mybooks/les_miserable
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
9
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.16
Each master operation acquires a set of locks before it runs: Example• Person A is reading a file
mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable: read lock
• Person B is trying to read a file mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable: read lock
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.17
Each master operation acquires a set of locks before it runs: Example
• Person A is reading a file mybooks/les_miserable
• Person B is trying to create a file mybooks/pride_and_prejudice
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
10
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.18
Each master operation acquires a set of locks before it runs: Example• Person A is reading a file
mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable: read lock
• Person B is trying to create a file mybooks/pride_and_prejudicemybooks/ :read lockmybooks/pride_and_prejudice :write lock
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.19
Each master operation acquires a set of locks before it runs: Example
• Person A is trying to create a file mybooks/les_miserable
• Person B is trying to create a file mybooks/les_miserable
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
11
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.20
Each master operation acquires a set of locks before it runs: Example• Person A is trying to create a file
mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable :write lock
• Person B is trying to create a file mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable :write lock
Write locks will serialize the actions
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.21
Each master operation acquires a set of locks before it runs: Example
• Person A is trying to create a file mybooks/les_miserable
• Person B is trying to delete a directory mybooks/
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
12
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.22
Each master operation acquires a set of locks before it runs: Example• Person A is trying to create a file
mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable :write lock
• Person B is trying to delete a file mybooks/mybooks/ :write lock
Write locks will serialize the actions
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.23
Locks are used to prevent operations during snapshots
• /home/user is being snapshotted to /save/user • Read locks on /home and /save• Write lock on /home/user and /save/user
• To create file, /home/user/foo
• Read lock on /home and /home/user• Write lock on /home/user/foo
• The two operations will be serialized
• because they try to obtain /home/user• File creation does not require write lock on parent directory … there is no “directory”
• Read locks on /home and /home/user• Write lock on /home/user/foo
Q: How do we prevent creating a file /home/user/foowhile a directory /home/user/ is being snapshotted to /save/user/ ?
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
13
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.24
Part 2. Large scale data storage system
Distributed File SystemGoogle File System (GFS): Deletion of files and garbage
collection
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.25
Garbage collection in GFS
• After a file is deleted, GFS does not reclaim space immediately
• Done lazily during garbage collection at• File and chunk levels
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
14
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.26
Master logs a file’s deletion immediately
• File is renamed to a hidden name• Includes deletion timestamp
• Master scans the file system namespace• Delete if hidden file existed for more than 3 days
• When file removed from namespace• In memory metadata is also removed• Severs links to all its chunks!
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.27
Garbage collection:When Master scans its chunk namespace
• Identifies orphaned chunks• Not reachable from any file
• Erase metadata for these chunks
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
15
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.28
The role of heart-beats in garbage collection
• Chunk server reports subset of chunks it currently has
• Master replies with identity of chunks no longer present• Chunk server free to delete its replica of such chunks
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.29
Stale chunks and issues • If a chunk server fails • AND misses mutations to the chunk• The chunk replica becomes stale
• Working with a stale replica causes problems with: • Correctness• Consistency
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
16
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.30
Aiding the detection of stale chunks
• Master maintains a chunk version number for each chunk• Distinguish between stale and up-to-date chunks
• When master grants a new lease on chunk• Increase version number• Inform replicas• Record new version persistently
Occurs BEFORE any client can write tochunk
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.31
If a replica is unavailable its version number will not be advanced
• When a chunk server restarts, it reports to the Master with the following:• Set of Chunks• Corresponding version numbers
• Used to detect stale replicas
• Remove stale replicas in regular garbage collection
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
17
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.32
Additional safeguards against stale replicas
• Include chunk version number• When client requests chunk information
• Client/Chunk server verify version to make sure things are up-to-date
• During cloning operations• Clone the most up-to-date chunk
• Clients and chunk servers expected to verify versioning information
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.33
Part 2. Large scale data storage system
Distributed File SystemGoogle File System (GFS): Data Integrity
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
18
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.34
Data Integrity• Impractical to detect chunk corruptions across replicas• Not bytewise identical in any case!
• Detection of corruption should be self-contained
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.35
Data Integrity
• Break chunks into 64 KB data blocks
• Compute 32-bit checksum for block• Keep in chunk server memory• Store persistently, separate from the data
• Verify checksums of data blocks that overlap read range
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
19
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.36
Part 2. Large scale data storage system
Distributed File SystemGoogle File System (GFS): Inefficiencies
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.37
The master server is a single point of failure
• Master server restart takes several seconds
• Shadow servers exist• Can handle reads of files
• In place of the master • But not writes
• Requires a massive main memory
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
20
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.38
The system is optimized for large files
• But not for a very large number of very small files
• Primary operation on files• Long, sequential reads/writes• Large number of random overwrites will clog things up quite a bit
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.39
Consistency Issues: GFS expects clients to resolve inconsistencies
• File chunks may have gaps or duplicates of some records• The client has to be able to deal with this
• Imagine doing this for a scientific application• Portions of a massive array are corrupted
• Clients would have to detect this
• HDFS does NOT have this problem
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
21
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.40
Security model
• None• Operation is expected to be in a trusted environment
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.41
Part 2. Large scale data storage system
Distributed File SystemGoogle File System II: Colossus
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
22
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.42
Storage Software: Colossus (GFS2)
• Next-generation cluster-level file system
• Automatically sharded metadata layer• Distributed Masters (64MB block size à 1MB)• Data typically written using Reed-Solomon (1.5x) • Client-driven replication, encoding and replication • Metadata space has enabled availability
• Why Reed-Solomon?• Cost
• Especially with cross cluster replication
• More flexible cost vs. availability choices
• Google File System II: Dawn of the Multiplying Master Nodes, http://www.theregister.co.uk/2009/08/12/google_file_system_part_deux/?page=1
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.43
Reed-Solomon Codes• Block-based error correcting codes• Digital communication and storage
• Storage devices (including tape, CD, DVD, barcodes, etc)
• Wireless or mobile communications
• Satellite communications
• Digital TV
• High-speed modems
SOURCE: https://en.wikiversity.org/wiki/Reed–Solomon_codes_for_coders
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
23
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.44
What does the R-S code do?
• Takes a block of digital data
• Adds extra “redundant” bits
• If an error happens, the R-S decoder processes each block and recovers original data
Reed-Solomon Encoder
Reed-Solomon Decoder
Communication channel or storage
devices
Noise, Errors
Data source
Data Sink
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.45
A Quick Example of the R-S encoding
• 4+2 coding
• Original files are broken into 4 pieces
• 2 parity pieces are added
• First piece of data “ABCD”, second piece of data “EFGH”…
A B C DE F G HI J K LM N O P
Original Data
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
24
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.46
A Quick Example of the R-S encoding
• Applying coding matrix
A B C DE F G HI J K LM N O P
01 00 00 0000 01 00 0000 00 01 0000 00 00 011b 1c 12 141c 1b 14 12
A B C DE F G HI J K LM N O P51 52 53 4955 56 57 25
x =
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.47
A Quick Example of the R-S encoding
• Data loss• 2 of 6 rows are lost
A B C DE F G HI J K LM N O P
01 00 00 0000 01 00 0000 00 01 0000 00 00 011b 1c 12 141c 1b 14 12
A B C DE F G HI J K LM N O P51 52 53 4955 56 57 25
x =
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
25
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.48
A Quick Example of the R-S encoding
• Without 2 rows
A B C DE F G HI J K LM N O P
01 00 00 0000 01 00 001b 1c 12 141c 1b 14 12
A B C DE F G H51 52 53 4955 56 57 25
x =
A B C DE F G HI J K LM N O P
01 00 00 0000 01 00 0000 00 01 0000 00 00 011b 1c 12 141c 1b 14 12
A B C DE F G HI J K LM N O P51 52 53 4955 56 57 25
x =
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.49
A Quick Example of the R-S encoding
• Multiplying each side with the inverted matrix
A B C DE F G HI J K LM N O P
01 00 00 0000 01 00 001b 1c 12 141c 1b 14 12
A B C DE F G H51 52 53 4955 56 57 25
x
=
01 00 00 0000 01 00 008d f6 7b 01f6 8d 01 7b
x
01 00 00 0000 01 00 008d f6 7b 01f6 8d 01 7b
x
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
26
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.50
A Quick Example of the R-S encoding
• The Inverse Matrix and the Coding Matrix Cancel Out
A B C DE F G HI J K LM N O P
01 00 00 0000 01 00 001b 1c 12 141c 1b 14 12
A B C DE F G H51 52 53 4955 56 57 25
x
=
01 00 00 0000 01 00 008d f6 7b 01f6 8d 01 7b
x
01 00 00 0000 01 00 008d f6 7b 01f6 8d 01 7b
x
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.51
A Quick Example of the R-S encoding
• Reconstructing the Original Data
A B C DE F G HI J K LM N O P
A B C DE F G H51 52 53 4955 56 57 25
=01 00 00 0000 01 00 008d f6 7b 01f6 8d 01 7b
x
CS435 Introduction to Big DataFall 2019 Colorado State University
11/18/2019 Week 13-ASangmi Lee Pallickara
27
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.52
Properties of Reed-Solomon codes
• RS(n,k) with s-bit symbols• Encoder takes k data symbols (blocks) of s bits each • Adds parity symbols to make n symbol code word
• There are n-k parity symbols of s bits each
• A Reed-Solomon decoder can correct up to t symbols that contain errors in a code word, where 2t = n-k.• t= (n-k)/2 for n-k even• t = (n-k-1)/2 for n-k odd
data Parity
2tkn
11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.53
Example
• RS(255,223) with 8 bit symbols• Each code word contains 255 code word bytes• 223 bytes are data and 32 bytes are parity• n=255, k=223, s=8, 2t = 32, t=16
• The decoder can correct any 16 symbol errors in the code word• Errors in up to 16 bytes anywhere in the codeword can be automatically corrected.
data Parity
2tkn