Distributed Systems - Brown Universitycs.brown.edu/courses/csci1380/s19/lectures/Day16_2019.pdf ·...

Preview:

Citation preview

DistributedSystemsDay16:Distributed FileSystems

Mar19,2019

Outline

• RecapofBGPandBFT• BGP:ByzantineGeneralProblem• BFT:ByzantineFaultTolerance

• DistributedFileSystems• Opportunistic Locking• AFSv.NFS• GFSv.AzureFileSystem

FaultRecoveryByzantineV.Consensus

• Consensus• Surviveffailures• N>2f:

• soyouneedatleastN=2f+1

• Byzantinegeneral• Surviveffailures• N>3f

• soyouneedatleastN=3f+1

Howdoyouget‘f ’?• Analyzefailuredistributions• ‘what’sthemaxserver

failureatanytime’

Howdoyouget‘f ’?• Analyzesystemsecurity• ‘what’smaxserver’s

compromisedatanytime?’

Attack! Attack!

She saidRetreat!

She saidAttack!

Attack! Retreat!

She saidRetreat

!

She saidAttack!

Whenf=1(liars),andN=3(total#ofnodes)thecorrectprocessescan’tdifferentiatebetweenagoodorbadgeneral

BadGeneralBadLT

GoodLTcan’tdifferentiatebetweentwoscenarios.3optionsalwayspickleader,alwayspickotherLT,orrandomly

GoodLTGoodLT

ConsensusProperties• C1: All loyallieutenantgenerals obeythesameorder

• C2:Ifthecommandinggeneralisloyal,theneveryloyallieutenant general obeystheorder shesends

• Note:ifthecommandinggeneralislyingthenByzantineGenerals doesn’thelpreachconsensus

GoodLT

AlwaysPickLeader:ViolatesC1

AlwaysPickotherLT:ViolatesC2

With3nodesandoneliar:CannotsatisfybothC1andC2

[4,1]ByzantineGeneralsProblem

Attack!

Attack!

She saidRetreat!

She saidAttack!

She saidAttack!

She saidRetreat!

She saidAttack!

She saidAttack!

Retreat!

BadGeneral

Attack!

Attack!

She saidRetreat

!She saidAttack!

She saidAttack!

She saidRetreat

!She saidAttack!

She saidAttack!

Attack!

BadLT

GoodLTscan’tdistinguishbadGenfromBadLT.

ButGoodLTswillstilldothesamething.[SatisfyC1]• BylookingatmajorityofmessagesGoodLTwilldosamething.

WhenGenistruthful“samething”-->”leader’sorder” [SatisfyC2]

ConsensusProperties• C1: All loyallieutenantgenerals obeythesameorder

• C2:Ifthecommandinggeneralisloyal,theneveryloyallieutenant general obeystheorder shesends

• Note:ifthecommandinggeneralislyingthenByzantineGenerals doesn’thelpreachconsensus

LocalFileSystem:BackgroundandTerminology

WhatisaFile?

• Ablobofbinary?

• Asetofblobs?• Thinkabook:TableofContents+ chapters

• Indexà inode (maprangestodatablocks)• Chaptersà Datablocks

• Howaboutdirectories?• Howaboutfilepermissions?

Data21010101010100100

File11010101010100100

File1Data1Data2

Data11010101010100100

WhatisaDirectory?

• Directory->mapnamestoFileIDs• DirectoryalsocontainsDirectory• InLinuxadirectory--->alsoafile

RootDirectory• File1->IDX• File2à IDY• Dir1à IDZ

Dir1• File3->IDM• File4->IDC

File1Data1Data2

File5Data8Data9

WhatisaFileSystem?

• Filesystemsà systemthatmanagesfiles

• Provides• APIforApplications tointeractw/files• Algorithms forsecuringfiles(access control)• Maintainmetadataaboutafile

Application Application

FileSystem

API

File1 FileN

FileMetaData• Filelength(size)• Timestamp• Location• Referencecount• Type• accesscontrol• Owner

ModifiableByAPP

ModifiableByFileSystem

TraditionalDistributedFileSystems

LocalFSà DistributedFS

Application Application

FileSystem

API

File1 FileN

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

Application

ClientAPI

Application

ClientAPI

RPCcallsRPCcalls

WhatareRPC

semantics?

Semantics

At-least-once(1ormorecalls)At-most-once(0 or1calls)

FileSystemDesignChallenges

• MaintainTransparency• Accessà sameAPIforremote/local files• Locationà same``name’’forremote/local file• Mobilityà clientshouldbeunawareoffilesmoving• Performanceà asworkloadgrows:performance isOK• Scalabilityà #offilesgrow:performance isOK

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

Application

ClientAPI

Application

ClientAPI

RPCcalls

RPCcalls

GeneralDFSFunctionality

• DirectoryService• Mapsnamesà FileID• MapsFileID toDatablocks andblock location

• FlatFileservices• Unawareofdirectorystructure• UnawareofNames• Maintainsdatablocks

• Clientmodule:providestransparency• ProvidesfilesystemAPItoapplications• Hidescomplexity: includesgluelogicaroundRPCcalls• Knowslocationofdirectoryservice• Dealswithmobilityofdata

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

Application

ClientAPI

Application

ClientAPI

RPCcalls

RPCcalls

FlatFileSystemOperations

• Providesat-least-oncesemantics• Callsdeigned tobeidempotent• Client repeatcallsthat itreceives noresponse to• Serverisstateless: state isembedded ineachcall

Semantics

At-least-once(1ormorecalls)At-most-once(0 or1calls)

FileSystemDesignChallenges

• MaintainTransparency• Accessà sameAPIforremote/local files• Locationà same``name’’forremote/local file• Mobilityà clientshouldbeunawareoffilesmoving• Performanceà asworkloadgrows:performance isOK• Scalabilityà #offilesgrow:performance isOK

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

Application

ClientAPI

Application

ClientAPI

RPCcalls

RPCcalls

ReliabilityProvidedbyRedundancy

• Serverfailureà lossofdata• Overcomedata-lossthroughredundancy

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

RPCcalls

Application

ClientAPI

Cache

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

PerformanceProvidedbyCaching

• Client: cachesfiles locally• Eliminatesneedtousenetworktotransferdata!

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

RPCcalls

Application

ClientAPI

Cache

NOCACHINGCACHING

PerformanceProvidedbyCaching

• Client: cachesfiles locally• Eliminatesneedtousenetworktotransferdata!

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

RPCcalls

Application

ClientAPI

Cache

NOCACHINGCACHING

Performanceisgood……ifalloperationstakeplaceonclient… everythingiscachedonclient

Strictconsistencyiseasy……ifalloperationstakeplaceonserver… noclientcaching

DesignChangecreatedbyCaching:• Directoryservicemustmaintain

state• Directoryserviceneedscomplex

logictodealwithstated• Especiallyduringfailures

Traditionaldesign:• Directoryserviceisstateless• OnlykeeptrackofName->ID

mapping

OpportunisticLocks:strictconsistency

• CommonInternetFileSystem• Microsoft’sdistributed filesystem• Alternate lockingapproach

• Features• strictlyconsistent

https://learnteachevangelizeit.wordpress.com/2014/07/31/windows-server-2012-r2-file-and-storage-services/

OverviewofOpportunisticLocks

Server

OpenA

OK,OpLock

Client1 Client2

OpenA

RevokeOpLock

OK,changes

OK

Step0:findlocationofdatablocksStep1:requestlocksStep2:servergrantslocksStep3:downloadfiletolocalcacheStep4:modifylocalcache

…..

StepN:serverrevokeslockStepN+1:clientflushescachetodiskStepN+2:releaselock

ManagingCachingandConsistency:OpportunisticLocking

• Tomodifyblocks,clientrequestslocks• Onceaclients havelocks

• Clientscanread/writeblocksusinglocalcachewithoutusingthenetwork

• LeadstosignificantPerformanceimprovements

• Otherclientsmust:• Waitforservertorevokelocks• Potentialwaittimes

https://docs.microsoft.com/en-us/windows/desktop/fileio/opportunistic-locksFileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

RPCcalls

Application

ClientAPI

Cache

Opportunisticbecauseserveronlygrantsthelocks

if/whenconvenient

ManagingCachingandConsistency:OpportunisticLocking

• Tomodifyblocks,clientrequestslocks• Onceaclients havelocks

• Clientscanread/writeblocksusinglocalcachewithoutusingthenetwork

• LeadstosignificantPerformanceimprovements

• Otherclientsmust:• Waitforservertorevokelocks• Potentialwaittimes

https://docs.microsoft.com/en-us/windows/desktop/fileio/opportunistic-locksFileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

RPCcalls

Application

ClientAPI

Cache

Step1:findlocationofdatablocksStep2:requestlocksStep3:downloadfiletolocalcacheStep4:modifylocalcache

…..

StepN:serverrevokeslockStepN+1:clientflushescachetodiskStepN+2:releaselock

Locks:Opportunistic,Optimistic,Pessimistic

TypesofLocksDiscussed

DistributedTransactions:• PessimisticLocks:

• Getallblocksbeforeatransactionstartsandreleaseafter

• Lowperformancebutguaranteesstrictserializability• Optimisticlocks:

• Proveshighperformancebutmanytransactionsabort• getsnapshotsofdatabeforetransactionstartsandon-

commitcomparesnapshotwithdataandonlycommitifdataisunchanged

DistributedFileSystems:• Opportunisticlocks:

• clientsgetlocksbeforedatamanipulation

• Enablesperformanceimprovements

• Clientswithalockcanmanipulatedatalocallywithoutusingthenetwork

GFS:GoogleFileSystem

DataCenterFS@2003

• Googleneededagooddistributedfilesystem• Redundantstorageofmassiveamountsofdataoncommoditycomputers(cheapandunreliable)

• Whynotuseanexistingfilesystem?• Google’sproblemsaredifferentfromothersintermsofworkloadanddesignpriorities

• GooglefilesystemisdesignedforGoogleapplications

• GoogleapplicationsaredesignedforGFS

Assumptionsà DesignDecisions

• Highcomponentfailurerates• Inexpensive commodity components oftenfail

• Modestnumberofhugefiles• Afewmillion 100MBorlargerfiles

• Filesarewrite-once,mostlyappendedto• Largestreamingreads• Highsustainedbandwidthisfavoredoverlowlatency

144

• Filesstoredaschunks• Fixedsize(64MB)

• Reliability throughreplication• Eachchunkisreplicatedacross3+chunkservers

• Singlemastertocoordinateaccess andkeepmetadata• Simplecentralizedmanagement

• Nodatacaching• Littlebenefitduetolargedatasets,streamingreads

• Familiar interfacebutcustomizetheAPI• Snapshot andrecordappend

GFSArchitecture

145FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

RPCcalls

Application

ClientAPI

Cache

GFSArchitecture

146

OneMaster!!!!• StoresMetaData• Storemapofnameà FileID

ManyChunkServers• Storesdatablocks• Datablocks areCalledchunks

FileSystem

DirectoryServiceFlatfileservice

DataBlocks DataBlocks

RPCcalls

Application

ClientAPI

Cache

GFSArchitecture

147

SingleMaster:Challenges:

SinglepointoffailureScalabilitybottleneck

GFSsolutionsShadowmaster(thinkLeader,followers)Reducemasterinvolvement

onlyusedformetadata(smalldata,fewAPIcalls)Largechunksizeà reduce#ofAPIcalls

ClientsneverusefordataOnwrite:mastergiveschunkalease(Delegatesauthority)

Recommended