Upload
sharis
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
S CALABLE D ECENTRALIZED D E-DUPLICATION S TORE. Prakash Chandrasekaran – Anand Gupta Gautham Narayanasamy – Vijayaraghavan Subbaiah. Motivation. Importance of storage space Finding enough space to meet the demands of the customers has been a huge challenge for cloud providers. - PowerPoint PPT Presentation
Citation preview
SCALABLE DECENTRALIZED DE-DUPLICATION STORE
Prakash Chandrasekaran – Anand Gupta Gautham Narayanasamy – Vijayaraghavan Subbaiah
Motivation Importance of storage space
Finding enough space to meet the demands of the customers has been a huge challenge for cloud providers.
Saving significant resources during web crawling, indexing, and search.
Backup Strategies To backup the data and replicate them across
many geographical locations. Need for devising ingenious techniques to use
the storage space more efficiently.
Deduplication Removing duplicate copies of files and storing only
the pointers to the original copy. Block-level deduplication
Allows more granularity and hence offers a greater reduction in storage space.
Requires more processing power when compared to file-level deduplication.
Use case Storage of snapshots of virtual machine (VM) images
in a virtualized cloud environment. Detecting exact duplicates and near duplicates in web
pages.
Architecture
Cassandra Schema create keyspace minhash;
create column family minhash_chunks with column_type=Super;
create column family minhash_filerecipe with column_type=Super;
create column family minhash_fullhash;
create keyspace files; create column family files_minhash;
Data DistributionClient / Application
Cassandra Cluster
Load Balancing
Cassandra Nodes
Data Flow in Cassandra
Cassandra Cluster
Client
OS Snapshot file / Web page
File input to Client
File Name Match Check file already exists
Start Chunking Process
Chunks
Compute minhash and fullhash Check full
hash already exists
MinHash
Full hash
Insert <fileid , minhash> Insert <minhash,filerecipe>Insert <minhash, fullhash>
Insert <minhash, chunkData>
System Implementation
Sequence - put
Sequence – get
System Efficiency Calculating the total amount of space
saved. Demonstrate the extent of similarity in
various snapshots and web pages. The overhead associated with file
storage and retrieval in our system.
Questions ?