Transcript

1

Append/Hflush/ReadDesignHairongKuang,KonstantinShvachko,NicholasSze,SanjayRadia,RobertChansler

Yahoo!HDFSteam08/06/2009

1. DesignchallengesWithhflush,HDFSneedstomakethelastblockofanunclosedfilevisibletoreaders.Thispresentstwochallenges:

1. Readconsistency.Atagiventimedifferentreplicasofthelastblockmayhavedifferentnumberofbytes.WhatreadconsistencyshouldHDFSprovideandhowtoguaranteetheconsistencyevenincaseoffailures.

2. Datadurability.Whenanyerroroccurs,therecoverycannotsimplythrowthelastblockaway.Insteadtherecoveryneedstopreserveatleastthehflushedbyteswhilemaintainingthereadconsistency.

2. Replica/BlockStatesThisdocumentwillcallablockataDataNodeareplicatodifferentiateitfromablockattheNameNode.

2.1. NeedfornewstatesPre‐append/hflushareplicaataDataNodeiseitherfinalizedortemporary.Whenareplicaisfirstcreated,itisinthetemporarystateonDataNode.Atemporaryreplicabecomesfinalizeduponthecloserequestofaclientwhennomorebytewillbewrittentothisreplica.OnaDataNoderestart,temporaryreplicasareremoved.Thisisacceptablepre‐append/hflushbecauseHDFSprovidesbesteffortdurabilityforunder‐constructionblocks.Thisisnotacceptableafterappend/hflusharesupported.HDFSneedstosupportstrongdurabilityforunder‐constructionblocksthatcontainpre‐appenddataandbesteffortdurabilityforhflusheddata.SosometemporaryreplicasneedtobepreservedacrossDataNoderestarts.

2.2. Replicastates(DataNode)AtaDataNode,thisdesignintroducesareplicabeingwritten(rbw)stateandotherstatesforhandlingerrors.InaDataNode’smemory,areplicacouldbeinanyofthefollowingstate:Finalized:Afinalizedreplicahasfinalizeditsbytes.Nonewbytewillbewrittentothisreplicaunlessitisreopenedforappend.Itsdataandmetadatamatch.Theotherreplicasofthesameblockidhavethesamebytesasthisreplica.Butthegenerationstamp(GS)ofafinalizedreplicadoesnotremainconstant.Itmaybebumpedupasaresultoferrorrecovery.

2

Rbw(ReplicaBeingWrittento):Onceareplicaiscreatedorappended,itisintherbwstate.Bytesarebeingwrittentothisreplica.Itisalwaysareplicaofthelastblockofanunclosedfile.Itslengthisnotfinalizedyet.Itson‐diskdataandmetadatamaynotmatch.Otherreplicasofthesameblockidmayhavemoreorlessbytesthanthisone.Bytes(maynotall)inanrbwreplicaarevisibletoreaders.Incaseofanyfailure,bytesinanrbwreplicawilltrytobepreserved.Rwr(ReplicaWaitingtobeRecovered):IfaDataNodediesandrestarts,allitsrbwreplicaschangetobeintherwrstate.Rwrreplicaswillnotbeinanypipelineandthereforewillnotreceiveanynewbytes.Theywilleitherbecomeoutdatedorwillparticipateinaleaserecoveryiftheclientalsodies.rur(ReplicaUnderRecovery):Areplicachangestobeintherurstatewhenareplicarecoverystartsasaresultofleaseexpiration.Moredetailswillbediscussedintheleaserecoverysection.Temporary:atemporaryreplicaisalsoareplicaunderconstructionbutiscreatedforthepurposeofblockreplicationorclusterbalancing.Itsharesmanyofthepropertiesasanrbwreplica,butitsdataareinvisibletoanyreader.IfthereplicaconstructionfailsoritsDataNoderestarts,atemporaryreplicawillbedeleted.OnaDataNode’sdisk,eachdatadirectorywillhavethreesubdirectories:currentcontainsfinalizedreplicas,tmpcontainstemporaryreplicas,rbwcontainsrbw,rwr,andrurreplicas.WhenareplicaisfirstcreatedbyarequestfromaDFSclient,itisputintherbwdirectory.Whenareplicaisfirstcreatedforthepurposeofreplicationorclusterbalancing,itisputinthetmpdirectory.Onceareplicaisfinalized,itismovedtothecurrentdirectory.WhenaDataNoderestarts,thereplicasinthetmpdirectoryareremoved,thereplicasintherbwdirectoryareloadedasrwrreplicas,andthereplicasinthecurrentdirectoryareloadedasfinalizedreplicas.DuringDataNodeupgrade,allreplicasindirectoriescurrentandrbwneedtobekeptinasnapshot.

2.3. Blockstates(NameNode)NameNodealsointroducesmanynewstatesforablock.Ablockcouldbeinanyofthefollowingstate:UnderConstruction:Onceablockiscreatedorappended,itisintheUnderConstructionstate.Bytesarebeingwrittentothisblock.Itisthelastblockofanunclosedfile.ItslengthandGShasnotfinalizedyet.Data(maynotall)inablockunderconstructionarevisibletoreaders.Ablockunderconstructionkeepstrackofitswritepipeline(i.e.,locationsofvalidrbwreplicas)andthelocationsofitsrwrreplicasiftheclientdies.UnderRecovery:Whenafile’sleaseexpires,ifthelastblockisUnderConstruction,itischangedtobeUnderRecoverystateonceblockrecoverystarts.

3

Committed:Acommittedblockhasfinalizeditsbytesandgenerationstamp(GS),buthasnotseenatleastoneGS/lengthmatchedfinalizedreplicafromDataNodesyet.NonewbytewillbewrittentothisblockanditsGSwillnotbebumpedunlessitisreopenedforappend.Inordertoservereadrequests,acommittedblockstillneedstokeepthelocationsofrbwreplicas.ItalsoneedstotracktheGSandlengthofitsfinalizedreplicas.AnunderconstructionblockofanunclosedfileiscommittedwhenNNisaskedbytheclienttoaddanewblocktothefileorclosethefile.Ifthelastblockisinthecommittedstate,thefilecannotbeclosedandtheclienthastoretry.AddBlockandclosewillbeextendedtoincludethelastblock’sGSandlength.Complete:AcompleteblockisablockwhoselengthandGSarefinalizedandNameNodehasseenaGS/lenmatchedfinalizedreplicaoftheblock.Acompleteblockkeepsonlyfinalizedreplicas’locations.Onlywhenallblocksofafilebecomecomplete,afilecouldbeclosed.Differentfromreplica’sstates,ablock’sstatedoesnotpersistonanydisk.WhenNameNoderestarts,thelastblockofanunclosedfileisloadedasUnderConstruction.AlltherestoftheblocksareloadedasComplete.

Moredetailsaboutabovereplica/blockstateswillbediscussedintherestofthedocument.Areplicastatetransitiondiagramandablockstatetransitiondiagramwillbesummarizedinthelastsection.

3. Write/hflush

3.1. BlockConstructionPipelineAnHDFSfileconsistsofmultipleblocks.Eachblockisconstructedthroughawritepipeline.Bytesarepushedtothepipelinepacketbypacket.Ifnoerroroccurs,ablockconstructiongoesthroughthreestagesasshowninthefollowingpictureillustratedbyapipelineof3DataNodes(DN)andablockof5packets.Inthepicture,boldlinesrepresentdatapackets,dottedlinesrepresentackmessages,andregularlinesrepresentcontrolmessages(setup/close).Fromt0tot1isthepipelinesetupstage.T1tot2isthedatastreamingstage,wheret1isthetimewhenthefirstdatapacketgetssentandt2isthetimethattheacktothelastpacketgetsreceived.Notepacket2isanhflushedpacket.T2tot3isthepipelineclosestage.

client DN0 DN1 DN2

pipeline setup

Data streaming

close

packet 0

packet 1

packet 2

packet 3

packet 4

set up

close

t0

t1

t2

t3

4

Stage1. Setupapipeline

AWrite_Blockrequestissentbyaclientdownstreamalongthepipeline.AfterthelastDataNodereceivestherequest,anackissentbytheDataNodeupstreamalongthepipelinebacktotheclient.Asaresultofthis,networkconnectionsalongthepipelinearesetupandeachDataNodehascreatedoropenedareplicaforwriting.

Stage2. DatastreamingUserdatafirstbufferattheclientside.Afterapacketisfilledup,thedatathengetpushedtothepipeline.Nextpacketcanbepushedtothepipelinebeforereceivingtheackforthepreviouspacket.Thenumberofoutstandingpacketsislimitedbytheoutstandingpacketswindowsizeattheclientside.Iftheuserapplicationexplicitlycallshflush,apacketispushedtothepipelinebeforeitisfilledup.Hflushisasynchronousoperationandnodatacanbewrittenbeforeanacknowledgementfortheflushedpacketcomesback.

Stage3. Close(finalizeablockandshutdownpipeline)Theclientsendsacloserequestonlyafterallpacketshavebeenacknowledgedattheclientside.Thisensuresthatifdatastreamingfails,therecoverydoesnotneedtohandlethecasethatsomereplicashavebeenfinalizedandsomedonothaveallthedata.

3.2. PackethandlingataDataNode

Foreachpacket,aDataNodeinthepipelinehastodo3things.

1. Streamdataa. ReceivedatafromtheupstreamDataNodeortheclientb. PushthedatatothedownstreamDataNodeifthereisany

2. Writethedata/crctoitsblockfile/metafile.3. Streamack

a. ReceiveanackfromthedownstreamDataNodeifthereisanyb. SendanacktotheupstreamDataNodeortheclient

Notethatthenumbersabovedonotindicatetheorderthatthethreethingsmustbeexecutedin.Streamingack(3)isdoneafterstreamingdata(1)bythedefinitionofapipeline.Butintheorywritingdatatodisk(2)couldbedoneanytimeafter1.a.Thisalgorithmchoosestodoitrightafter1.bandbeforereceivingthenextpacket.EachDataNodehastwothreadsperpipeline.Thedatathreadisresponsiblefordatastreaminganddiskwriting.Foreachpacket,itdoes1.a,1.b,and2insequence.Onceapacketisflushedtothedisk,itcanberemovedfromthein‐memorybuffer.Theackthreadisresponsibleforackstreaming.Foreachpacket,itdoes3.aand3.binsequence.Sincethedatathreadandtheackthreadrunconcurrently,thereisno

5

guaranteeontheorderof(2)and(3).Theackofapacketmightbesentbeforethepacketisflushedtothedisk.Thisalgorithmprovidesatradeoffonthewriteperformance,datadurability,andalgorithmsimplicity.It

1. Improvesdatadurabilityagainstfailuresbywritingdatatodisksoonerthanlaterwhentheackisreceived;

2. Parallelizesdata/ackstreamingindownstreampipelineandon‐diskwriting;3. Simplifiesbuffermanagementsincethereisatmostonepacketin‐memory

perpipeline.

3.3. Consistencysupport• Whenaclientreadsbytesfromanrbwreplica,theDataNodethatitreads

frommaynotmakeallthebytesthatitreceivedvisibletotheclient.• Eachrbwreplicamaintainstwocounters:

1. BA:numberofbytesthathavebeenacknowledgedbythedownstreamDataNodes.ThosearethebytesthattheDataNodemakesvisibletoanyreader.Intherestofthedocument,wemayinterchangeablycallitthereplica’svisiblelength.

2. BR:numberofbytesthathavebeenreceivedforthisblock,includingthebyteswrittentoitsblockfileandin‐DataNode‐bufferbytes.

• AssumeinitiallyallDataNodesinthepipelinehave(BA,BR)=(a,a).Thenaclientpushesapacketofbbytestothepipelineandnootherpacketsarepushedtothepipelinebeforetheclientreceivesanackforthepacket.

1. ADataNodechangesits(BA,BR)tobe(a,a+b)rightafterstep1.a.2. ADataNodechangesits(BA,BR)tobe(a+b,a+b)rightafterstep

2.a.3. Whenasuccessackissentbacktotheclient,allDataNodesinthe

pipelinehave(BA,BR)=(a+b,a+b).• ApipelineofNDataNodesDN0,…,DNi,…,DNN‐1,whereDN0isthefirstin

thepipeline,i.e.,theclosesttothewriter,hasthefollowingproperty:atanygiventimet,

where isBAoftheblockatDNiattimetand isBRoftheblockatDNiattimet.

NotethatthispropertyguaranteesthatonceabytebecomesvisibleallDataNodesinthepipelinehasthebyte.• Assume isthenumberofbytesthataclienthassenttothepipeline

attimetand isthenumberofbytesthattheclienthasreceivedackfor.Wehave

6

4. ReadWhenanunclosedfileisopenedforread,thechallengeishowtoprovidetheconsistencyguaranteeifthelastblockisunderconstruction.ThealgorithmneedstomakesurethatabytereadatDataNodeDNicanalsobereadatanotherDataNodeDNj,evenifBAi>BAj.Algorithm1:• Whenareaderreadsanunderconstructionblock,itfirstasksoneofthe

replicasforitsBAbysendingarequesttotheDataNode.• IfanapplicationtriestoreadabytebeyondBAoftheblock,thedfsclient

throwsanEOFException.• Onlyareadrequestthatreadfromapositionlessthanthevisiblelength

ofthelastblockwillbeforwardedtoaDataNode.WhenaDataNodegetsareadrequestfromarangeofbytesthatarelessthanitsBR,returnthebytes.

• Assumethatareadrequestisatriple(blck,off,len),whereblckcontainsablockidanditsgenerationstamp,offisthestartingoffsetintheblockfromwhichtoreadtheblock,andlenisthenumberofbytestoread.

• ADataNodecanservetherequestiftheDataNodehasareplicawithanequalornewerGS.

• ThesummationofoffandlenmustbeequalorlessthanBAj,assumingDNjistheDataNodewherethedfsclientfetchedtheblocklength.

• AssumethatthereadrequestissenttoDataNodeDNiandthereplica’sstateis(BAi,BRi).

1. Ifoff+len<=BAi,DNicansafelysendlenbytesbacktothedfsclientstartingatoff.

2. Ifoff+len>BAi,becauseoff+len<=BAj,BAj>=BAi.DNimustbeintheupstreaminthepipelinetoDNj,i.e.,isclosertothewriterthanDNjis.SoBRi>=BRj>=BAj.ThusBRi>BAj,andthereforeBRi>off+len.ThismeansDNimusthavethebytesthatthedfsclientwantstoread.DNideliversthebytestotheclient.

3. Off+lenshouldneverbegreaterthanBRi.Ifthiseverhappens,theDataNodelogtheerrorandrejectstherequest.

• IfDNigoesdownwhileservingtherequest,thedfsclientcanswitchtoreadfromanyotherDataNodecontainingareplicaoftheblock.

• Thisalgorithmissimplebutitrequiresareopenofafiletogetnewdatabecausethelengthofthelastblockisfetchedbeforereadandadfsclientcannotreadbeyondthelengthofthelastblock.

Algorithm2:• Thisalgorithmletsadfsclient,i.e.,areader,performtheconsistency

control,andDataNodesdeliverbytes.• Areadrequestisatriple(blck,off,len),whereblckcontainsablockid

anditsgenerationstamp,offisthestartingoffsetinablockfromwhichtoreadtheblock,andlenisthenumberofbytestoread.

7

• ADataNodecanservetherequestiftheDataNodehasareplicawithanequalornewerGS.

• Assumethattheblockhasastate(BAi,BRi),DNisendsbytes[off,MIN(off+len,BRi))totheclientalongwithitsBAi.

• Theclientreceivesandbuffersthedata.ItalsokeepstrackofthemaximumBAthatithasseenandonlydeliversbytestotheapplicationuptothemaximumBA.

• IfthereadfromDNifails,thedfsclientcanswitchtoreadfromanyotherDataNodecontainingareplicaoftheblock.

• Howreadconsistencyisguaranteed?AssumewehaveapipelineofNDataNodesDN0,…,DNi,…,DNN‐1,whereDN0isthefirstinthepipeline.Assumethatthenumberofbytesthataclientdeliverstoanapplicationattimetis .Wehave

SonomatterwhichDataNodeitreadsfrom,theDataNodeshouldhavethebyteitreadbefore.

• Thisalgorithmrequiresachangeofthereadprotocolandadfsclientismorecomplicatedsinceitneedstocontrolreadconsistency.Butthealgorithmdoesnotrequireareopeninordertoreadthenewdata.

ForHADOOP0.21,wearestilldiscussingwhethertoimplementalgorithm1oralgorithm2.

5. Append5.1. AppendAPIsupport

1. ClientsendsanappendrequesttoNN.2. NNchecksthefileandmakessurethatitisclosed.ThenNNchecksthe

file’slastblock.Ifitisnotfullandhasnoreplica,failappend.Otherwise,changethefiletobeunderconstruction.Ifthelastblockisfull,NNallocatesanewlastblock.Ifthelastblockisnotfull,NNchangesthisblocktobeanunderconstructionblock,withitsfinalizedreplicasasitsinitialpipeline.Itreturnstheblockid,generationstamp,length,anditslocations.Ifthelastblockisnotfull,italsoneedstoreturnanewgenerationstamp.

3. Setupapipelineforappendiflastblockisnotfull.Otherwisesetupapipelineforcreate.Checkpipelinesetupsectionformoredetails.

4. Ifthelastblockdoesnotendatachecksumchunkboundary,readthelastpartialcrcchunk.Thisisforthepurposeofcalculatingchecksums.

5. Therestisthesameasaregularwrite.

5.2. Durabilitysupport• NNmakessurethatthenumberofreplicasfortheCompleteblocksthat

containpre‐appenddatameetsthefile’sreplicationfactor.

8

• ThedurabilityofanUnderConstructionblockthatcontainspre‐appenddataisomittedinthisdesignfornow.

6. ErrorHandling6.1. PipelineRecovery

Whenablockisunderconstruction,errormayoccuratStage1whenapipelineisbeingsetup,atStage2whendataarestreamingtothepipeline,orStage3whenthepipelineisbeingclosed.ThepipelinerecoveryhandlesthecasewhenoneofDataNodesinthepipelinehasanerror.

6.1.1. RecoverfrompipelinesetupfailureIfaDataNodedetectsafailurewhenapipelineisbeingsetup,aDataNodeclosestheblockfileandclosesallitstcp/ipconnectionsafterafailureacknowledgementissenttotheupstreamDataNode.Oncetheclientdetectsthefailure,ithandlesthefailuredifferentlydependingonthepurposeofsettingupthepipeline.

• Ifthepipelinewasbuiltforcreatinganewblock,theclientsimplyabandonstheblockandasksNameNodeforanewblock.Itthenstartstobuildapipelineforthenewblock.

• Ifthepipelinewasbuiltforappendingtoablock,itrebuildsapipelinewiththeremainingDataNodesandbumpstheblock’sgenerationstamp.Seesection7(pipelinesetup)formoredetails.

Onespecialcaseofpipelinesetupfailuresisaccesstokenerror:oneoftheDataNodecomplainsthattheaccesstokenisinvalidwhenusinganaccesstokentosetupapipeline.Ifapipelinesetupfailureiscausedbyanexpiredaccesstoken,thedfsclientshouldrebuildthepipelinewithalltheDataNodesinthepreviouspipeline.Currenttrunk(0.21)avoidsthisspecialhandlingbyalwaysfetchinganewaccesstokenfromNameNoderightbeforesettingupapipeline.Thisdocumentwillkeepthesamedesign.

6.1.2. Recoverfromdatastreamingfailure• AtaDataNodeanerrormayoccurateither1.a,1.b,2,3.a,or3.basexplained

inSection3.2.Wheneveranerroroccurs,aDataNodetakesitselfoutofthewritepipeline:itclosesallthetcp/ipconnections,writesallbufferedbytesontodiskiftheerrordoesnotoccurat3,andclosestheon‐diskfiles.

• Whenthedfsclientdetectsafailure,stopssendingdatatothepipeline.• ThedfsclientreconstructsawritepipelineusingtheremainingDataNodes.

Seesection7(pipelinesetup)formoredetails.Asaresultofthis,allreplicasoftheblockarebumpedtoanewgenerationstamp.

• ThedfsclientresumessendingdatawiththenewgenerationstampstartingfromBAc.NoteanoptimizationcouldbethattheclientresumessendingbytesstartingatMIN(BRi,forallDataNodesDNiinthenewpipeline).

• WhenaDataNodereceivesapacket,ifitalreadyhasthepacket,thedatastreamsimplypushesthedatadownstreamwithoutwritingittothedisk.

9

Thisrecoveryalgorithmhasaniceproperty:anybytesthatwerevisibletoanyclient,evenfromadownDataNodewiththelargestBAoftheoldpipeline,continuetobevisibletoanyreaderduringandafterapipelinerecovery.ThisisbecausethepipelinerecoverydoesnotdecreaseanyDataNode’sBAandBR.Furthermoreanytimeduringthepipelinerecoverythenewpiplelinemaintainsthepropertydescribedinsection3.3(consistencysupport).

6.1.3. RecoverfromaclosefailureOncetheclientdetectsthefailure,itrebuildsapipelinewiththeremainingDataNodes.EachDataNodebumpstheblock’sgenerationstampandfinalizesthereplicaifitisnotfinalizedyet.Thenetworkconnectionistorndownafteranackissent.Seesection7(pipelinesetup)formoredetails.

6.2. DataNodeRestart• WhenaDataNoderestarts,itreadseachreplicaunderdirectoryrbwand

loadsthereplicainmemoryasWaitingToBeRecovered.Itslengthissettobethemaximumnumberofbytesthatmatchitscrc.

• AnyWaitingToBeRecoveredreplicadoesnotserveanyreadanddoesnotparticipateinapipelinerecovery.

• AWaitingToBeRecoveredreplicawilleitherbecomeoutdatedandbedeletedbyNNiftheclientisstillaliveorbechangedtobefinalizedasaresultofleaserecoveryiftheclientdies.

6.3. NameNodeRestart• Noneofblockstatesarepersistedondisk.SowhenNameNoderestarts,it

needstorestoreeachblock’sstate.ThelastblockofanunclosedfilebecomesUnderConstructionnomatterwhatitspre‐lifestatewas.OtherblocksbecomeComplete.

• AskeachDataNodetoregisterandsenditsblockreportincludingfinalized,rbw,rwr,andrurreplicas.

• NameNodedoesexitsafemodeunlessthenumberofcompleteandunderconstructionblocksthathavereceivedatleastonereplicareachesthepre‐definedthreshold.

6.4. LeaseRecoveryWhenafile’sleaseisexpired,NNneedstoclosethefileforthesakeoftheclient.Therearetwoissues:(1)Concurrencycontrol:whatifaleaserecoveryisperformedwhiletheclientisstillaliveeitherintheprocessofsettinguppipeline,writing,close,orrecovery.Whatiftherearemultipleconcurrentleaserecoveries?(2)Consistencyguarantee:Ifthelastblockisunderconstruction,allitsreplicasneedtorollbacktoaconsistentstate:allreplicashavethesameon‐disklengthandthesamenewgenerationstamp.

1. NNrenewslease,changesthefile’sleaseholdertobedfsandpersiststhechangetoitseditlog.Soiftheclientisstillalive,anyofthewrite‐relatedrequestslikeaskingforanewgenerationstamp,gettinganewblock,orclosingthefile,willberejectedbecausetheclientisnottheleaseholderany

10

more.ThispreventstheclientfromconcurrentlychanginganunclosedfileifitevercontactstheNameNode.

2. NNchecksthestateofthelasttwoblocksofitsfile.Otherblocksshouldbeinthecompletestate.Thefollowingtableshowsallthepossiblecombinationsandtheactiontotakeforeachcombination.

Penultimateblock

Lastblock Actions

Complete Complete ClosethefileComplete CommittedCommitted CompleteCommitted Committed

Retryclosingthefilewhenleaseexpiresnexttime;Forcetoclosethefileafteracertainnumberofretries

Complete UnderConstructionCommitted UnderConstruction

Startsblockrecoveryforthelastblock

Complete UnderRecoveryCommitted UnderRecovery

Startsanewblockrecoveryforthelastblock;stoprecoveryafteracertainnumberofretries

6.5. BlockRecovery1. NNchoosesaprimaryDataNode(PD)toworkastheproxyofNameNodeto

performblockrecovery.PDcouldbeaDataNodewhereoneofitsreplicasresides.Ifnoneofitsreplicasareknown,blockrecoveryaborts.

2. NNgetsanewGS,whichmarksthegenerationthattheblockisgoingtobebumpedtowhentherecoverysuccessfullyfinishes.Itthenchangesthelastblock,ifitisUnderConstruction,tobeUnderRecovery.TheUnderRecoveryblockisstampedwithauniquerecoveryid,whichisnewGSthattheblockisgoingtobebumpedto.AnycommunicationsfromaPDtoNNneedstomatchthisrecoveryid.Thisishowconcurrentblockrecoveriesarehandled.Thebasicruleisthatthelatestrecoveryalwayspreemptspreviousrecoveries.

3. NNthenasksPDtorecovertheblock.NNsendsPDthenewGS,blockidanditsgenerationstamp,andallitsreplicalocationsincludingfinalizedreplicas,replicasbeingwrittento,andreplicaswaitingtoberecovered.

4. PDperformsblockrecovery:a. PDaskseachDataNode,whereonereplicaislocated,toperform

replicarecovery.i. PDsendseachDataNodetherecoveryid,blockidandgenerationstamp;

ii. EachDataNodechecksitsreplicastate:1. Checkexistence:IftheDataNodedoesnothavethe

replicaorthereplicaisolderthantheblock’sGSintherequest,ornewerthantherecoveryid(thisisnotsupposedtohappen),throwsaReplicaNotExistsException.

2. Stopwriter:Ifitisareplicabeingwrittentoandthereisanongoingwriterthread,interruptsthewriterand

11

waitsforthewritertoexit.Whenawriterthreadisinterrupted,ifthethreadisinthemiddleofreceivingapacket,stopsandthrowsawaythepartialpacket.Beforethethreadexits,itmakessurethaton‐diskbytesarethesameasBRandthenclosestheblockandcrcfiles.ThishandlesconcurrentclientwritesandblockrecoveryatDataNodes.Blockrecoverypreemptsclientwrites,resultinginpipelinefailure.SubsequentpipelinerecoverywillfailbecausethedfsclientcannotgetanewgenerationstampfromNNforablockunderrecovery.

3. Stoppreviousblockrecovery:Ifthereplicaisalreadyintherurstate,throwsaRecoveryInProgressExceptionifitsrecoveryidisgreaterthanorequaltothenewrecoveryid.IfthenewGSisgreater,stamptherurreplica’srecoveryidtobethenewone.

4. Statechange:Otherwise,changethereplicatoberur.Setitsrecoveryidtobethenewrecoveryidandareferencetoitsoldstate.AnycommunicationsfromaPDtoitselfneedstomatchthisrecoveryid.Note3and4handleconcurrentblockrecoveriesatDataNodes.Thelatestrecoveryalwayspreemptspreviousrecoveryandnotworecoveriescanbeinterleaved.

5. Crccheck:ThenperformaCRCcheckfortheblockfile.Ifthereisamismatch,throwsCorruptedReplicaExceptionifthereplicaisrbworfinalized.Ifreplicaisrwr,truncatetheblockfiletothelastmatchedbyte.

iii. Ifnoexceptionisthrown,eachDataNodereturnsPDitsreplicastatus<replicaid,replicaGS,replicaon‐disklen,pre‐recoverystate>.

b. AfterreceivingareplyfromeachDataNode,PDdecidestheblocklengththatallreplicasshouldagreeon.

i. IfoneDataNodethrowsRecoveryInProgressException,PDabortsblockrecovery.

ii. IfallDataNodesthrowanexception,abortsblockrecovery.iii. Ifmax(LeniforallreportedDNi)==0,asksNNtoremovethis

block.iv. Otherwise,checkreturnedstateofthereplicaswithnon‐zero

length.Thefollowingtableshowsallthepossiblecombinationofstatesinanexampleoftworeplicasandthelengthtoagreeonforeachcombination.

Cases Replica1state

Replica2state

Lengthtobeagreedon

1 Finalized Finalized Tworeplicasshouldhavethesamelength;Ifnotthesame,

12

thereisanerror,logsitandabortsblockrecovery

2 Finalized rbw Tworeplicasshouldhavethesamelengthbecausetheclientmusthavediedwhenpipelineisbeingsetuportorndown.Iftheyarenotthesame,excludetherbwreplica.

3 Finalized rwr Setnewlengthtobethelengthofthefinalizedreplicaandexcludetherwrreplica.

4 rbw rbw SetnewlengthtobeMIN(len1,len2)whereleniisthelengthofreplicai.

5 rbw rwr Excluderwrreplica.Thisbecomesthesameascase4.

6 rwr rwr SetthenewlengthtobeMIN(len1,len2).Inthiscase,lenimaynotequaltoBRi,sonoguaranteeofvisiblebytessincebothDataNodesdied.

c. Recoverreplicasthatparticipatedinlengthagreementinstepb.iv.

i. PDaskseachDataNodetorecoverthereplica.PDsendstheblockid,newGS,newlength.

ii. IftheDataNodedoesnothavethereplicaintherurstateoritsrecoveryiddoesnotmatchthenewGS,failthereplicarecoveryattheDataNode.

iii. Otherwise,theDataNodechangesthereplica’sGStobethenewGSbothondiskandinmemory.Itthenupdatesinmemoryreplicalengthtobethenewlengthandtruncatestheblockfilesizetothenewlengthandchangecrcfileaccordingly(maycausetruncationand/ormodificationoflast4crcbytes).Itfinalizesthereplicaifthereplicahasnotfinalizedyet.Thereplicarecoverysucceeds.

d. PDcheckstheresultofc.IfnoDataNodesucceeds,blockrecoveryfails.Ifsomesucceedandsomefail,PDgetsanewgenerationstampfromNNandrepeatsblockrecoverywiththesuccessfulDataNodes.IfallDataNodessucceed,PDnotifiesNNthenewGSandlength.NNfinalizestheblockandclosesthefileifallblocksofthefilechangetoCompletestate.NNforcesthefiletocloseafteralimitednumberofcloseretries.

ThisleaserecoveryalgorithmalsoguaranteesthatanybytesthatwerevisibletoaclientdoesnotgetremovedasaresultoftherecoveryifatleastoneDataNodeinthepipelineisstillaliveanditsdataarenotcorrupted.Thisisbecause

13

1. Incases1,2,and3,thereisafinalizedreplica.Theclientmusthavediedawayduringblockconstructionstages1and3.Thealgorithmdoesnotremoveanybyte.

2. Incases4and5,allreplicastoberecoveredareinrbwstate.Theclientmusthavediedduringblockconstructionstage2.Assumethepre‐recoverypipelinehasNDataNodes:DN0,DN1,…,DNN‐1.ThelengthreturnedbyDNistep4.a.iimustbeequaltoBRi.AssumethatasubsetoftheDataNodesSinthepipelineparticipatesthelengthagreement,thenewlengthisMIN(BRi,forallDataNodesinS)>=BRN‐1>=BAN‐1>=..>=BA0.Thisguaranteestheleaserecoverydoesnotremovedatathathavedeliveredtoanyreader.

3. Incase6,thealgorithmdoesnotprovideanyguaranteesinceallDataNodesinthepre‐recoverypipelinehadbeenrestarted.

7. PipelineSetUp7.1. Causesofpipelinesetup

Therearefivecasesthatapipelineneedstobesetup:1. Create:Whenanewblockiscreated,apipelineneedstobeconstructed

beforeanybytesarestreamedtoanyDataNode.2. Append:Whenafileistobeappendedandthelastblockofthefileisnotfull.

ApipelineofallDataNodesthathaveareplicaofthelastblockneedstobesetupbeforeanynewbytesarestreamedtoanyDataNode.

3. Appendrecovery:Whencase2fails,apipelinecontainingtheremainingDataNodesneedstobesetup.

4. Datastreamingrecovery:Ifdatastreamingfails,apipelineoftheremainingDataNodesneedstobesetupbeforethedatastreamingresumes.

5. Closerecovery:Ifpipelineclosefails,apipelineoftheremainingDataNodesneedstobesetupinordertofinalizetheblock.

7.2. Pipelinesetupsteps1. Cases2,3,4,and5buildapipelineonanexistingblock,sotheblock’s

generationstampneedstobebumpedalongwithpipelineconstruction.ThedfsclientasksNNforanewgenerationstamp.

2. ThedfsclientsendsawriteblockrequesttotheDataNodesinthenewpipelinewiththeparameters(notinclusive):blockidwitholdgenerationstamp,blocklength(numberofbytesareplicamusthaveoratleasthave),maxreplicalength,flags,and/oranewgenerationstamp.

CasesBlockid/generationstamp

Blocklen flagsNewgenerationstamp

Maxblocklen

1(create) yes 0 Noflagisset no no

2(Append) yes Pre‐appendblocklen

Appendflagisset yes

no

14

3(AppendRecovery) yes Thesameas2

Appendandrecoveryflagsareset

yes

no

4(Datastreamingrecovery)

yes BAc Recoveryflagisset yes BSc

5(Closerecovery) yes BAc=BSc Close

flagisset yesno

3. ThefollowingtableshowsthebehaviorwhenaDataNodereceivesa

pipelinesetuprequest.Notethatrwrreplicasdonotparticipateinthepipelinerecovery.Wecanrelaxthisrestrictionwithsomespecialhandlingoftherwrreplicas.Butsincethisisaveryrarecase,wechoosenottodoitinthisroundofdesign.

Cases SanityCheck Replicastatechange GSchange

1(Create)Areplicawiththesameblockidshouldnotexist

Createarbwreplicawith(BA,BR)=(0,0) no

2(Append)

Finalizedreplica;itson‐disklenshouldmatchthepre‐appendlen

Openthereplicaforwrite;setthewritestreamoffsetattheendoffile;thereplicabecomesrbw:(BA,BR)=(preAppendLen,preAppendLen)

SettonewGS

3(AppendRecovery)

Finalizedorrbw;replicalength(on‐diskorBR)shouldmatchpre‐appendlen;rbwreplicaGScouldbethesameornewer

Iffinalized,dothesameasabove;Ifrbw,waitforwritertoexit;openblockforwrite;setwritestreamoffsetattheendoffile.

SettonewGS

4(DataStreamingRecovery)

rbwreplica;thesameornewerGS;BAc<=BAi<=BRi<=BSc

Waitforwritertoexit;Openblockforwrite;setwritestreamoffsetattheendoffile.

SettonewGS

5(CloseRecovery)

rbworfinalized;thesameornewerGS;replicalengthshouldbethesameasBAc

Ifrbw,waitforwritertoexitandthenfinalizethereplica;Closepipelinewhenackissentback.

SettonewGS

4. Incases2,3,and4,onasuccessfulpipelinesetup,thedfsclientnotifies

NNofthenewGS,minlength,andtheDataNodesinthenewpipeline.NN

15

thenupdatestheunderconstructionblock’sgenerationstamp,lenandlocations.

5. Ifthepipelinesetupfails,ifatleastoneDataNoderemains,gotostep1witharecoveryflagset.Otherwise,sincethepipelinehasnomoreDataNode,markthispipelineasfailed.Ifauserapplicationisblockedinhflush/write,itwillbeunblockedandgetanEmptyPipelineException.Otherwisethenextwrite/hflush/closewillgetanEmptyPipelineExceptionimmediately.

8. ReportReplica/BlockState/MetaInformationChangetoNN

8.1. ClientReportsAclientinformsNNofanunderconstructionblock’smetainformationchangeorstatechange.Asdiscussedinthepipelinesetupsection,incases2(append),3(appendrecovery),and4(datastreamingrecovery),afteranewpipelineissetup,aclientreportsNNoftheblock’snewGSandtheDataNodesinthepipeline.NNthenupdatestheunderconstructionblock’sGS,length,andlocations.Notethatinthisdesignafterapipelineforcreatingablockissetup,aclientdoesnotreportNNofthenewlycreatedblockanditslocations.InsteadwhentheclientissuesaddBlock/appendtoaskforanewblock,NNputsthenewblockanditslocationsintotheblocksMapbeforeNNreturnstheblockandlocationsbacktoclient.Thisdesignhasaminorflaw.IfanewreaderhappenstoreadthelastblockbetweenthetimetheblockisaddedtoblocksMapatNameNodeandthetimeareplicaoftheblockiscreatedonaDataNode,thereadermaygeta“blockdoesnotexit”error.Butsincethechanceofhavingareaderduringthisshorttimeframeisveryslim,weconsciouslymakethisdesigndecisiontotradeforperformance.AclientdoesnotneedtosendanotificationtoNNafterapipelineissetupforeveryblockcreate.WhenaclientissuesaddBlockorclose(afile),NNwillfinalizethelastblock’sGSandlength.ThelastblockmaymovetotheCompletestateifthelastblockalreadyhasaGS/lenmatchedfinalizedreplica;otherwisethelastblockismovedtotheCommittedstate.Inaddition,ifthenumberofthereplicasofthelastblockislessthanitsreplicationfactor,NNexplicitlyreplicatestheblocktoreachitsreplicationfactor.

8.2. DataNodeReportsADataNodereportsareplica’smetainformationorstatechangebyperiodicallysendingNNablockreportorsendingNNablockReceivedmessagewhenarbwreplicaisfinalized.

16

8.3. BlockReports• Eachblockreportcontainstwolists:oneforfinalizedreplicas,onefor

rbw.Thefinalizedreplicaslistincludefinalizedreplicasandunderrecoveryreplicaswhoseoldstatearefinalized.Therbwreplicalistincludesrbwreplicas,rwrreplicas,andrurreplicaswhoseoldstatearenotfinalized.Anrbwreplica’slengthisitsbytesreceived(BR).Thelengthofanrwrisanegativenumber.

• Noweachreportedreplica’sstateisaquadruple<DataNode,blck_id,blck_GS,blck_len,isRbw>.

• AfterNNreceivesablockreport,compareitwithwhat’sinthememoryandgenerate3lists:(doweneedaupdateStateList?)

o deleteListifblck_idisnotvalid,i.e.noentryinblocksMaporbelongstonofile.

o addStoredBlockListifNNdoesnothavethereplica<DataNode,blck_id>buttheblockreporthasit.

o rmStoredBlockListifNNhasthereplica<DataNode,blck_id>buttheblockreportdoesnothaveit.

o updateStateListifthereplica’sstateinNNisrbwbuttheblockreportsaysitisfinalized.

ToavoidraceconditionbetweenclientreportsandDataNodereports,rbwreplicasareaddedtoonlydeleteListoraddStoredBlockList.

• Addanewreplica1. BlockinNNisComplete

• Ifthereportedreplicaisfinalized,o IfitsGSandlengtharedifferentfromNNrecorded

value,addittotheblocksMapbutmarkitascorrupt.o Otherwise,addthereplica.

• Ifthereportedreplicaisrbw,o Ifthefileisclosed,ifthereplica’sGS/lenisdifferent

fromNN–recordedvalueortheblockhasreacheditsreplicationfactor,instructitsDataNodetodeleteit.

o Otherwise,donothing.2. BlockinNNisCommitted

Thehandlingisverysimilartotheabovecaseexceptthatifthereportedreplicaisfinalizedandmatchestheblock’sGSandlength,NNchangestheblocktothecompletestate.

3. BlockinNNisUnderConstructionorUnderRecovery• Ifthereportedreplicaisfinalizedandthereplica’sGSisequal

toornewerthanNNrecordedGS,addthereplica.AlsomarkthereplicaasfinalizedandkeepstrackofitslengthandGS.

• Ifthereportedreplicaisrbwandthereplicaisvalid(GSnotolderandlengthnotshorter),addthereplica.Ifthereportedreplicaisrbw,markitasrbw;otherwise,markitasrwr.

• Otherwiseignoreit.

17

• Updateareplica’sstateWhenablockreportshowsareplicaischangedfromrbwtofinalized,iftheblockisunderconstruction,NNmarkstheNNstoredreplicaasfinalizedandkeepstrackofthefinalizedreplica’sGSandlength.IftheblockisCommitted,ifthefinalizedreplicamatchestheblock’sGSandlength,NNchangestheblocktotheCompletestate;otherwiseremovethisreplicafromNN.

8.4. blockReceivedDataNodessendblockReceivedtoNNtonotifythatareplicaisfinalized.WhenNNreceivesablockReceivednotification,iteitheraddanewreplicaif(DataNode,block_id)doesnotexistinNN,updatethereplica’sstateifitisrecordedatrbw,oraskaDataNodetodeletethereplicaiftheblockisinvalid.

9. Replica/BlockStateTransition9.1. ReplicaStateTransition

ThefollowingdiagramsummarizesallpossiblereplicastatetransitionsataDataNode.

• Anewreplicaiscreated

18

o Eitherbyaclient.Thenewreplicastartwithareplicabeingwritten(rbw)state.

o OruponaninstructionfromNNtoreplicateorcopyareplicaforthepurposeofbalancing.Thenewreplicaisintemporarystate.

• Anrbwreplicachangestobeareplicawaitingtoberecovered(rwr)whenitsDataNoderestarts.

• Areplicachangestobeaunderrecoveryreplicawhenareplicarecoverystartsinresponsetoleaseexpiration.

• Areplicaisfinalizedwhenaclientissuesaclose,replicarecoverysucceeds,orreplication/copysucceeds.

• Errorrecoveryalwayscausesareplica’sGStobebumped.

9.2. BlockStateTransitionThefollowingdiagramsummarizesallpossibleblockstatetransitionsattheNameNode.

• Ablockiscreated

o EitherwhenaclientissuesaddBlocktoaddanewblocktoafile.o Orwhenaclientissuesappendandthelastblockofthefileisfull.

Thenewlycreatedblockisablockunderconstruction.

Complete Block

Block Under Construction

Block Under Recovery

Committed Block

addBlock

NN restarts if last block of an unclosed

file

lease expires & block recovery

starts

NN restarts

receives a GS/Len matched finalized

replica

Init

addBlock

pipeline recovery succeeds (GS++)

NN restarts if not last block of an unclosed file

block recovery

fails

lease expires &

block recovery

starts(Recovery#

++)

lease expires

block recovery succeeds but no GS/len matched finalized replica

(GS++)

removed

zero length block

append pipeline is setup (GS++)

closelease expires and

force file close

lease expires

NN restarts if not last block of an unclosed file

append if last block is full

addBlock or close

Append or NN restarts

recovery succeeds

19

• AppendmayalsocauseaCompleteblocktobechangedtoablockUnderConstructionifthelastblockispartial.

• WhenaddBlockorcloseisissued,o thelastblockbecomeseitherCompleteiftheblockalreadyhasa

GS/lenmatchedfinalizedreplicaorCommittedotherwise.o addBlockwaitsuntilthepenultimateblocktobecomeComplete.o Afilewon’tbecloseduntilthelasttwoblocksofthefileareComplete.

• Whenleaseexpires,aleaserecoverychangesablockunderconstructiontobeablockunderrecovery.

o Ablockrecoverymaychangeablockunderrecoverytobe Removedifallitsreplicasareoflength0; Committediftherecoverysucceedsandtheblockhasno

GS/lenmatchedfinalizedreplica; CompleteiftherecoverysucceedsandtheblockhasaGS/len

matchedfinalizedreplica.o AleaserecoverymayforceaCommittedblocktobeComplete.

• Blockstatesdonotpersistondisk.WhenaNameNoderestarts,thelastblockofanunclosedfilebecomesunderconstructionandtherestbecomeComplete.

o NoteaCompleteorCommittedblockmaychangetobeanUnderConstructionblockafterNNrestartsifitisthelastblockofafile.Iftheclientisstillalive,theclientwillfinalizeitagain.Otherwisewhenleaseexpires,ablockrecoverywillfinalizeitagain.

• NotethatonceablockbecomesCommittedorComplete,allitsreplicasshouldhavethesameGSandarefinalized.WhenablockisUnderConstruction,itmayhavemultiplegenerationsoftheblockcoexistinginthecluster.


Recommended