VIAF: Verification-based Integrity Assurance Framework for MapReduce

VIAF: Verification-based Integrity Assurance Framework for MapReduce

Yongzhi Wang, Jinpeng WeiVIAF: Verification-based Integrity Assurance Framework for MapReduce

This paper is to solve the problem of how to detect non-collusive and collusive workers in the MapReduce system and therefore improve the calculation integrity. In this paper, we are trying to combine the duplication and verification approach to solve this problem, and the theoretical and experimental result is fairly satisfactory.1MapReduce in BriefSatisfying the demand for large scale data processingIt is a parallel programming model invented by Google in 2004 and become popular in recent years.Data is stored in DFS(Distributed File System)Computation entities consists of one Master and many Workers (Mapper and Reducer)Computation can be divided into two phases: Map and Reduce 2Inverted index construction in information retrieval systemPage rank calculation in search engineData mining algorithm2Word Count instance

Hello, 2Hello, 4

3Motivation of malicious workers: return incorrect result to temper the calculation resultReturn result without actually performing calculation to gain most profit3Integrity VulnerabilityHow to detect and eliminate malicious mappers to guarantee high computation integrity ? Straightforward approach: Duplication

4Non-collusive workerIntegrity vulnerabilityBut how to deal with collusive malicious workers?

5Verifier5VIAF SolutionDuplicationIntroduce trusted entity to do random verification

6OutlineMotivationSystem DesignAnalysis and EvaluationRelated workConclusion

7Architecture

8AssumptionTrusted entitiesMasterDFSReducersVerifiersUntrusted entitiesMappersInformation will not be tampered in network communicationMappers result reported to the master should be consistent with its local storage (Commitment based protocol in SecureMR)

9SolutionIdea: Duplication + VerificationDeterministically duplicate each task to TWO mappers to discern the non-collusive mappersNon-deterministically verify the consistent result to discern the collusive mappersThe credit of each mapper is accumulated by passing verificationA Mapper become trustable only when its credit achieves Quiz Threshold100% accuracy in detecting non-collusive malicious mapperThe more verification applied on a mapper, the higher accuracy to determine whether it is collusive malicious.

1010Data flow control

115.b5.a3262344111. w1 and w2 are any two workers executing mapping task2. Task Queue, History Cache for each worker and credit, as well as the Result Buffer3. History Cache is used to record hash code of each mapping tasks result. It is first cached at History cache, and passed to Result Buffer when the corresponding worker is trusted by the master.4. Credit of each worker is accumulated by verification. When it achieves Quiz Threshold, result cache is passed to Result buffer, and credit restores to 0.5. Since each task is duplicated, only when the both result hash have been passed to Result buffer, can we pass the result to reducer. It is because we need to guarantee the reducer receive task result from two workers which are both trusted by our verification scheme.11OutlineMotivationSystem DesignAnalysis and EvaluationRelated workConclusion

12Theoretical AnalysisMeasurement metric (for each task) Accuracy -- The probability the reducer receive a good result from mapperMapping Overhead The average number of execution launched by the mapperVerification Overhead The average number of execution launched by the verifier13Accuracy vs Quiz ThresholdCollusive mappers only, different p:m Malicious worker ratioc Collusive worker ratiop probability that two assigned collusive workers are in one collusion group and can commit a cheatq probability that two collusive workers commit a cheatr probability that a non collusive worker commit a cheatv Verification Probabilityk Quiz Threshold. 14

This graph represent the accuracy change with the increase of Quiz Threshold, when we have 40% of malicious nodeSince duplication can guarantee 100% accuracy of detecting non-collusive worker, in this chart, we only analyze the environment that contains collusive workers.Three line represents different value of p, which indicates the probability that two collusive worker can collude with each other. Red line represents p = 1.0, which is the most severe situation given certain value of other factorsWhen p is 1.0, without verification, only 85% of accuracy, when applying k to 1, accuracy increase to 93%, with 5 quizzes, accuracy is very close to 100% 14Overhead vs Quiz ThresholdCollusive worker only:Overhead for each task stays between 2.0 to 2.4Collusive worker only:Verification Overhead for each task stays between 0.2 to 0.207 when v is 0.215

The two graph indicated the Mappers and Verifiers overhead respectively with different Quiz Threshold k.Still, the environment contains 40% of malicious node, and all the malicious nodes are collusive.Again, three lines in each graph represent the different value of p.With different value of p, overhead of Mappers increase from 2 to 2.4 with the increase of k, and it is well bounded when k goes to infinity.Similarly, with different value of p, overhead of Verifiers increase from 0.2 to 0.207 with the increase of k. Here the verification probability is 20%. 15Experimental EvaluationImplementation based on Hadoop 0.21.011 virtual machines(512 MB of Ram, 40 GB disk each, Debian 5.0.6 ) deployed on a 2.93 GHz, 8-core Intel Xeon CPU with 16 GB of RAMword count application, 400 mapping tasks and 1 reduce taskOut of 11 virtual hosts1 is both the Master and the benign worker1 is the verifier4 are collusive workers (malicious ratio is 40%)5 are benign workers

16Evaluation resultQuiz ThresholdAccuracy MapperOverheadVerification Overhead087.20%2.0000199.42%2.04522.00%299.83%2.07423.58%3100%2.05323.00%4100%2.16223.58%5100%2.04621.75%6100%2.11122.58%7100%2.02719.83%Where c is 1.0, p is 1.0, q is 1.0, m is 0.40 17

Without any quiz, Accuracy is only 87%, but with 1 quiz, the accuracy increase to 99%. With 3 as the quiz threshold, the accuracy is as high as 100%.The reason that Evaluation result is much better than the theoretical analysis is because m is decreasing from 40% to 0 in the real environment.17OutlineMotivationSystem DesignAnalysis and EvaluationRelated workConclusion

18Related workReplication, sampling, checkpoint-based solutions were proposed in several distributed computing contexts Duplicate based for Cloud: SecureMR(Wei et al, ACSAC 09), Fault Tolerance Middleware for Cloud Computing( Wenbing et al, IEEE CLOUD 10)Quiz based for P2P: Result verification and trust based scheduling in peer-to-peer grids (S. Zhao et al, P2P 05)Sampling based for Grid: Uncheatable Grid Computing (Wenliang et al, ICDCS04)Accountability in Cloud computing researchHardware-based attestation: Seeding Clouds with Trust Anchors (Joshua et al, CCSW 10)Logging and Auditing: Accountability as a Service for the Cloud(Jinhui, et al, ICSC 10)

19OutlineMotivationSystem DesignAnalysis and EvaluationRelated workConclusion

20ConclusionContributionsProposed a Framework (VIAF) to defeat both collusive and non-collusive malicious mappers in MapReduce calculation.Implemented the system and proved from theoretical analysis and experimental evaluation that VIAF can guarantee high accuracy while incurring acceptable overhead.Future workFurther exploration without the assumption of trust of reducer.Reuse the cached result to better utilize the verifier resource.

21Thanks!Q&A

Documents

VIAF: Verification-based Integrity Assurance Framework for MapReduce