19
1 Append/Hflush/Read Design Hairong Kuang, Konstantin Shvachko, Nicholas Sze, Sanjay Radia, Robert Chansler Yahoo! HDFS team 08/06/2009 1. Design challenges With hflush, HDFS needs to make the last block of an unclosed file visible to readers. This presents two challenges: 1. Read consistency. At a given time different replicas of the last block may have different number of bytes. What read consistency should HDFS provide and how to guarantee the consistency even in case of failures. 2. Data durability. When any error occurs, the recovery cannot simply throw the last block away. Instead the recovery needs to preserve at least the hflushed bytes while maintaining the read consistency. 2. Replica/Block States This document will call a block at a DataNode a replica to differentiate it from a block at the NameNode. 2.1. Need for new states Pre‐append/hflush a replica at a DataNode is either finalized or temporary. When a replica is first created, it is in the temporary state on DataNode. A temporary replica becomes finalized upon the close request of a client when no more byte will be written to this replica. On a DataNode restart, temporary replicas are removed. This is acceptable pre‐append/hflush because HDFS provides best effort durability for under‐construction blocks. This is not acceptable after append/hflush are supported. HDFS needs to support strong durability for under‐construction blocks that contain pre‐append data and best effort durability for hflushed data. So some temporary replicas need to be preserved across DataNode restarts. 2.2. Replica states (DataNode) At a DataNode, this design introduces a replica being written (rbw) state and other states for handling errors. In a DataNode’s memory, a replica could be in any of the following state: Finalized: A finalized replica has finalized its bytes. No new byte will be written to this replica unless it is reopened for append. Its data and meta data match. The other replicas of the same block id have the same bytes as this replica. But the generation stamp (GS) of a finalized replica does not remain constant. It may be bumped up as a result of error recovery.

Append/Hflush/Read Design - cnblogs.com

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

1

Append/Hflush/ReadDesignHairongKuang,KonstantinShvachko,NicholasSze,SanjayRadia,RobertChansler

Yahoo!HDFSteam08/06/2009

1. DesignchallengesWithhflush,HDFSneedstomakethelastblockofanunclosedfilevisibletoreaders.Thispresentstwochallenges:

1. Readconsistency.Atagiventimedifferentreplicasofthelastblockmayhavedifferentnumberofbytes.WhatreadconsistencyshouldHDFSprovideandhowtoguaranteetheconsistencyevenincaseoffailures.

2. Datadurability.Whenanyerroroccurs,therecoverycannotsimplythrowthelastblockaway.Insteadtherecoveryneedstopreserveatleastthehflushedbyteswhilemaintainingthereadconsistency.

2. Replica/BlockStatesThisdocumentwillcallablockataDataNodeareplicatodifferentiateitfromablockattheNameNode.

2.1. NeedfornewstatesPre‐append/hflushareplicaataDataNodeiseitherfinalizedortemporary.Whenareplicaisfirstcreated,itisinthetemporarystateonDataNode.Atemporaryreplicabecomesfinalizeduponthecloserequestofaclientwhennomorebytewillbewrittentothisreplica.OnaDataNoderestart,temporaryreplicasareremoved.Thisisacceptablepre‐append/hflushbecauseHDFSprovidesbesteffortdurabilityforunder‐constructionblocks.Thisisnotacceptableafterappend/hflusharesupported.HDFSneedstosupportstrongdurabilityforunder‐constructionblocksthatcontainpre‐appenddataandbesteffortdurabilityforhflusheddata.SosometemporaryreplicasneedtobepreservedacrossDataNoderestarts.

2.2. Replicastates(DataNode)AtaDataNode,thisdesignintroducesareplicabeingwritten(rbw)stateandotherstatesforhandlingerrors.InaDataNode’smemory,areplicacouldbeinanyofthefollowingstate:Finalized:Afinalizedreplicahasfinalizeditsbytes.Nonewbytewillbewrittentothisreplicaunlessitisreopenedforappend.Itsdataandmetadatamatch.Theotherreplicasofthesameblockidhavethesamebytesasthisreplica.Butthegenerationstamp(GS)ofafinalizedreplicadoesnotremainconstant.Itmaybebumpedupasaresultoferrorrecovery.

2

Rbw(ReplicaBeingWrittento):Onceareplicaiscreatedorappended,itisintherbwstate.Bytesarebeingwrittentothisreplica.Itisalwaysareplicaofthelastblockofanunclosedfile.Itslengthisnotfinalizedyet.Itson‐diskdataandmetadatamaynotmatch.Otherreplicasofthesameblockidmayhavemoreorlessbytesthanthisone.Bytes(maynotall)inanrbwreplicaarevisibletoreaders.Incaseofanyfailure,bytesinanrbwreplicawilltrytobepreserved.Rwr(ReplicaWaitingtobeRecovered):IfaDataNodediesandrestarts,allitsrbwreplicaschangetobeintherwrstate.Rwrreplicaswillnotbeinanypipelineandthereforewillnotreceiveanynewbytes.Theywilleitherbecomeoutdatedorwillparticipateinaleaserecoveryiftheclientalsodies.rur(ReplicaUnderRecovery):Areplicachangestobeintherurstatewhenareplicarecoverystartsasaresultofleaseexpiration.Moredetailswillbediscussedintheleaserecoverysection.Temporary:atemporaryreplicaisalsoareplicaunderconstructionbutiscreatedforthepurposeofblockreplicationorclusterbalancing.Itsharesmanyofthepropertiesasanrbwreplica,butitsdataareinvisibletoanyreader.IfthereplicaconstructionfailsoritsDataNoderestarts,atemporaryreplicawillbedeleted.OnaDataNode’sdisk,eachdatadirectorywillhavethreesubdirectories:currentcontainsfinalizedreplicas,tmpcontainstemporaryreplicas,rbwcontainsrbw,rwr,andrurreplicas.WhenareplicaisfirstcreatedbyarequestfromaDFSclient,itisputintherbwdirectory.Whenareplicaisfirstcreatedforthepurposeofreplicationorclusterbalancing,itisputinthetmpdirectory.Onceareplicaisfinalized,itismovedtothecurrentdirectory.WhenaDataNoderestarts,thereplicasinthetmpdirectoryareremoved,thereplicasintherbwdirectoryareloadedasrwrreplicas,andthereplicasinthecurrentdirectoryareloadedasfinalizedreplicas.DuringDataNodeupgrade,allreplicasindirectoriescurrentandrbwneedtobekeptinasnapshot.

2.3. Blockstates(NameNode)NameNodealsointroducesmanynewstatesforablock.Ablockcouldbeinanyofthefollowingstate:UnderConstruction:Onceablockiscreatedorappended,itisintheUnderConstructionstate.Bytesarebeingwrittentothisblock.Itisthelastblockofanunclosedfile.ItslengthandGShasnotfinalizedyet.Data(maynotall)inablockunderconstructionarevisibletoreaders.Ablockunderconstructionkeepstrackofitswritepipeline(i.e.,locationsofvalidrbwreplicas)andthelocationsofitsrwrreplicasiftheclientdies.UnderRecovery:Whenafile’sleaseexpires,ifthelastblockisUnderConstruction,itischangedtobeUnderRecoverystateonceblockrecoverystarts.

3

Committed:Acommittedblockhasfinalizeditsbytesandgenerationstamp(GS),buthasnotseenatleastoneGS/lengthmatchedfinalizedreplicafromDataNodesyet.NonewbytewillbewrittentothisblockanditsGSwillnotbebumpedunlessitisreopenedforappend.Inordertoservereadrequests,acommittedblockstillneedstokeepthelocationsofrbwreplicas.ItalsoneedstotracktheGSandlengthofitsfinalizedreplicas.AnunderconstructionblockofanunclosedfileiscommittedwhenNNisaskedbytheclienttoaddanewblocktothefileorclosethefile.Ifthelastblockisinthecommittedstate,thefilecannotbeclosedandtheclienthastoretry.AddBlockandclosewillbeextendedtoincludethelastblock’sGSandlength.Complete:AcompleteblockisablockwhoselengthandGSarefinalizedandNameNodehasseenaGS/lenmatchedfinalizedreplicaoftheblock.Acompleteblockkeepsonlyfinalizedreplicas’locations.Onlywhenallblocksofafilebecomecomplete,afilecouldbeclosed.Differentfromreplica’sstates,ablock’sstatedoesnotpersistonanydisk.WhenNameNoderestarts,thelastblockofanunclosedfileisloadedasUnderConstruction.AlltherestoftheblocksareloadedasComplete.

Moredetailsaboutabovereplica/blockstateswillbediscussedintherestofthedocument.Areplicastatetransitiondiagramandablockstatetransitiondiagramwillbesummarizedinthelastsection.

3. Write/hflush

3.1. BlockConstructionPipelineAnHDFSfileconsistsofmultipleblocks.Eachblockisconstructedthroughawritepipeline.Bytesarepushedtothepipelinepacketbypacket.Ifnoerroroccurs,ablockconstructiongoesthroughthreestagesasshowninthefollowingpictureillustratedbyapipelineof3DataNodes(DN)andablockof5packets.Inthepicture,boldlinesrepresentdatapackets,dottedlinesrepresentackmessages,andregularlinesrepresentcontrolmessages(setup/close).Fromt0tot1isthepipelinesetupstage.T1tot2isthedatastreamingstage,wheret1isthetimewhenthefirstdatapacketgetssentandt2isthetimethattheacktothelastpacketgetsreceived.Notepacket2isanhflushedpacket.T2tot3isthepipelineclosestage.

client DN0 DN1 DN2

pipeline setup

Data streaming

close

packet 0

packet 1

packet 2

packet 3

packet 4

set up

close

t0

t1

t2

t3

4

Stage1. Setupapipeline

AWrite_Blockrequestissentbyaclientdownstreamalongthepipeline.AfterthelastDataNodereceivestherequest,anackissentbytheDataNodeupstreamalongthepipelinebacktotheclient.Asaresultofthis,networkconnectionsalongthepipelinearesetupandeachDataNodehascreatedoropenedareplicaforwriting.

Stage2. DatastreamingUserdatafirstbufferattheclientside.Afterapacketisfilledup,thedatathengetpushedtothepipeline.Nextpacketcanbepushedtothepipelinebeforereceivingtheackforthepreviouspacket.Thenumberofoutstandingpacketsislimitedbytheoutstandingpacketswindowsizeattheclientside.Iftheuserapplicationexplicitlycallshflush,apacketispushedtothepipelinebeforeitisfilledup.Hflushisasynchronousoperationandnodatacanbewrittenbeforeanacknowledgementfortheflushedpacketcomesback.

Stage3. Close(finalizeablockandshutdownpipeline)Theclientsendsacloserequestonlyafterallpacketshavebeenacknowledgedattheclientside.Thisensuresthatifdatastreamingfails,therecoverydoesnotneedtohandlethecasethatsomereplicashavebeenfinalizedandsomedonothaveallthedata.

3.2. PackethandlingataDataNode

Foreachpacket,aDataNodeinthepipelinehastodo3things.

1. Streamdataa. ReceivedatafromtheupstreamDataNodeortheclientb. PushthedatatothedownstreamDataNodeifthereisany

2. Writethedata/crctoitsblockfile/metafile.3. Streamack

a. ReceiveanackfromthedownstreamDataNodeifthereisanyb. SendanacktotheupstreamDataNodeortheclient

Notethatthenumbersabovedonotindicatetheorderthatthethreethingsmustbeexecutedin.Streamingack(3)isdoneafterstreamingdata(1)bythedefinitionofapipeline.Butintheorywritingdatatodisk(2)couldbedoneanytimeafter1.a.Thisalgorithmchoosestodoitrightafter1.bandbeforereceivingthenextpacket.EachDataNodehastwothreadsperpipeline.Thedatathreadisresponsiblefordatastreaminganddiskwriting.Foreachpacket,itdoes1.a,1.b,and2insequence.Onceapacketisflushedtothedisk,itcanberemovedfromthein‐memorybuffer.Theackthreadisresponsibleforackstreaming.Foreachpacket,itdoes3.aand3.binsequence.Sincethedatathreadandtheackthreadrunconcurrently,thereisno

5

guaranteeontheorderof(2)and(3).Theackofapacketmightbesentbeforethepacketisflushedtothedisk.Thisalgorithmprovidesatradeoffonthewriteperformance,datadurability,andalgorithmsimplicity.It

1. Improvesdatadurabilityagainstfailuresbywritingdatatodisksoonerthanlaterwhentheackisreceived;

2. Parallelizesdata/ackstreamingindownstreampipelineandon‐diskwriting;3. Simplifiesbuffermanagementsincethereisatmostonepacketin‐memory

perpipeline.

3.3. Consistencysupport• Whenaclientreadsbytesfromanrbwreplica,theDataNodethatitreads

frommaynotmakeallthebytesthatitreceivedvisibletotheclient.• Eachrbwreplicamaintainstwocounters:

1. BA:numberofbytesthathavebeenacknowledgedbythedownstreamDataNodes.ThosearethebytesthattheDataNodemakesvisibletoanyreader.Intherestofthedocument,wemayinterchangeablycallitthereplica’svisiblelength.

2. BR:numberofbytesthathavebeenreceivedforthisblock,includingthebyteswrittentoitsblockfileandin‐DataNode‐bufferbytes.

• AssumeinitiallyallDataNodesinthepipelinehave(BA,BR)=(a,a).Thenaclientpushesapacketofbbytestothepipelineandnootherpacketsarepushedtothepipelinebeforetheclientreceivesanackforthepacket.

1. ADataNodechangesits(BA,BR)tobe(a,a+b)rightafterstep1.a.2. ADataNodechangesits(BA,BR)tobe(a+b,a+b)rightafterstep

2.a.3. Whenasuccessackissentbacktotheclient,allDataNodesinthe

pipelinehave(BA,BR)=(a+b,a+b).• ApipelineofNDataNodesDN0,…,DNi,…,DNN‐1,whereDN0isthefirstin

thepipeline,i.e.,theclosesttothewriter,hasthefollowingproperty:atanygiventimet,

where isBAoftheblockatDNiattimetand isBRoftheblockatDNiattimet.

NotethatthispropertyguaranteesthatonceabytebecomesvisibleallDataNodesinthepipelinehasthebyte.• Assume isthenumberofbytesthataclienthassenttothepipeline

attimetand isthenumberofbytesthattheclienthasreceivedackfor.Wehave

6

4. ReadWhenanunclosedfileisopenedforread,thechallengeishowtoprovidetheconsistencyguaranteeifthelastblockisunderconstruction.ThealgorithmneedstomakesurethatabytereadatDataNodeDNicanalsobereadatanotherDataNodeDNj,evenifBAi>BAj.Algorithm1:• Whenareaderreadsanunderconstructionblock,itfirstasksoneofthe

replicasforitsBAbysendingarequesttotheDataNode.• IfanapplicationtriestoreadabytebeyondBAoftheblock,thedfsclient

throwsanEOFException.• Onlyareadrequestthatreadfromapositionlessthanthevisiblelength

ofthelastblockwillbeforwardedtoaDataNode.WhenaDataNodegetsareadrequestfromarangeofbytesthatarelessthanitsBR,returnthebytes.

• Assumethatareadrequestisatriple(blck,off,len),whereblckcontainsablockidanditsgenerationstamp,offisthestartingoffsetintheblockfromwhichtoreadtheblock,andlenisthenumberofbytestoread.

• ADataNodecanservetherequestiftheDataNodehasareplicawithanequalornewerGS.

• ThesummationofoffandlenmustbeequalorlessthanBAj,assumingDNjistheDataNodewherethedfsclientfetchedtheblocklength.

• AssumethatthereadrequestissenttoDataNodeDNiandthereplica’sstateis(BAi,BRi).

1. Ifoff+len<=BAi,DNicansafelysendlenbytesbacktothedfsclientstartingatoff.

2. Ifoff+len>BAi,becauseoff+len<=BAj,BAj>=BAi.DNimustbeintheupstreaminthepipelinetoDNj,i.e.,isclosertothewriterthanDNjis.SoBRi>=BRj>=BAj.ThusBRi>BAj,andthereforeBRi>off+len.ThismeansDNimusthavethebytesthatthedfsclientwantstoread.DNideliversthebytestotheclient.

3. Off+lenshouldneverbegreaterthanBRi.Ifthiseverhappens,theDataNodelogtheerrorandrejectstherequest.

• IfDNigoesdownwhileservingtherequest,thedfsclientcanswitchtoreadfromanyotherDataNodecontainingareplicaoftheblock.

• Thisalgorithmissimplebutitrequiresareopenofafiletogetnewdatabecausethelengthofthelastblockisfetchedbeforereadandadfsclientcannotreadbeyondthelengthofthelastblock.

Algorithm2:• Thisalgorithmletsadfsclient,i.e.,areader,performtheconsistency

control,andDataNodesdeliverbytes.• Areadrequestisatriple(blck,off,len),whereblckcontainsablockid

anditsgenerationstamp,offisthestartingoffsetinablockfromwhichtoreadtheblock,andlenisthenumberofbytestoread.

7

• ADataNodecanservetherequestiftheDataNodehasareplicawithanequalornewerGS.

• Assumethattheblockhasastate(BAi,BRi),DNisendsbytes[off,MIN(off+len,BRi))totheclientalongwithitsBAi.

• Theclientreceivesandbuffersthedata.ItalsokeepstrackofthemaximumBAthatithasseenandonlydeliversbytestotheapplicationuptothemaximumBA.

• IfthereadfromDNifails,thedfsclientcanswitchtoreadfromanyotherDataNodecontainingareplicaoftheblock.

• Howreadconsistencyisguaranteed?AssumewehaveapipelineofNDataNodesDN0,…,DNi,…,DNN‐1,whereDN0isthefirstinthepipeline.Assumethatthenumberofbytesthataclientdeliverstoanapplicationattimetis .Wehave

SonomatterwhichDataNodeitreadsfrom,theDataNodeshouldhavethebyteitreadbefore.

• Thisalgorithmrequiresachangeofthereadprotocolandadfsclientismorecomplicatedsinceitneedstocontrolreadconsistency.Butthealgorithmdoesnotrequireareopeninordertoreadthenewdata.

ForHADOOP0.21,wearestilldiscussingwhethertoimplementalgorithm1oralgorithm2.

5. Append5.1. AppendAPIsupport

1. ClientsendsanappendrequesttoNN.2. NNchecksthefileandmakessurethatitisclosed.ThenNNchecksthe

file’slastblock.Ifitisnotfullandhasnoreplica,failappend.Otherwise,changethefiletobeunderconstruction.Ifthelastblockisfull,NNallocatesanewlastblock.Ifthelastblockisnotfull,NNchangesthisblocktobeanunderconstructionblock,withitsfinalizedreplicasasitsinitialpipeline.Itreturnstheblockid,generationstamp,length,anditslocations.Ifthelastblockisnotfull,italsoneedstoreturnanewgenerationstamp.

3. Setupapipelineforappendiflastblockisnotfull.Otherwisesetupapipelineforcreate.Checkpipelinesetupsectionformoredetails.

4. Ifthelastblockdoesnotendatachecksumchunkboundary,readthelastpartialcrcchunk.Thisisforthepurposeofcalculatingchecksums.

5. Therestisthesameasaregularwrite.

5.2. Durabilitysupport• NNmakessurethatthenumberofreplicasfortheCompleteblocksthat

containpre‐appenddatameetsthefile’sreplicationfactor.

8

• ThedurabilityofanUnderConstructionblockthatcontainspre‐appenddataisomittedinthisdesignfornow.

6. ErrorHandling6.1. PipelineRecovery

Whenablockisunderconstruction,errormayoccuratStage1whenapipelineisbeingsetup,atStage2whendataarestreamingtothepipeline,orStage3whenthepipelineisbeingclosed.ThepipelinerecoveryhandlesthecasewhenoneofDataNodesinthepipelinehasanerror.

6.1.1. RecoverfrompipelinesetupfailureIfaDataNodedetectsafailurewhenapipelineisbeingsetup,aDataNodeclosestheblockfileandclosesallitstcp/ipconnectionsafterafailureacknowledgementissenttotheupstreamDataNode.Oncetheclientdetectsthefailure,ithandlesthefailuredifferentlydependingonthepurposeofsettingupthepipeline.

• Ifthepipelinewasbuiltforcreatinganewblock,theclientsimplyabandonstheblockandasksNameNodeforanewblock.Itthenstartstobuildapipelineforthenewblock.

• Ifthepipelinewasbuiltforappendingtoablock,itrebuildsapipelinewiththeremainingDataNodesandbumpstheblock’sgenerationstamp.Seesection7(pipelinesetup)formoredetails.

Onespecialcaseofpipelinesetupfailuresisaccesstokenerror:oneoftheDataNodecomplainsthattheaccesstokenisinvalidwhenusinganaccesstokentosetupapipeline.Ifapipelinesetupfailureiscausedbyanexpiredaccesstoken,thedfsclientshouldrebuildthepipelinewithalltheDataNodesinthepreviouspipeline.Currenttrunk(0.21)avoidsthisspecialhandlingbyalwaysfetchinganewaccesstokenfromNameNoderightbeforesettingupapipeline.Thisdocumentwillkeepthesamedesign.

6.1.2. Recoverfromdatastreamingfailure• AtaDataNodeanerrormayoccurateither1.a,1.b,2,3.a,or3.basexplained

inSection3.2.Wheneveranerroroccurs,aDataNodetakesitselfoutofthewritepipeline:itclosesallthetcp/ipconnections,writesallbufferedbytesontodiskiftheerrordoesnotoccurat3,andclosestheon‐diskfiles.

• Whenthedfsclientdetectsafailure,stopssendingdatatothepipeline.• ThedfsclientreconstructsawritepipelineusingtheremainingDataNodes.

Seesection7(pipelinesetup)formoredetails.Asaresultofthis,allreplicasoftheblockarebumpedtoanewgenerationstamp.

• ThedfsclientresumessendingdatawiththenewgenerationstampstartingfromBAc.NoteanoptimizationcouldbethattheclientresumessendingbytesstartingatMIN(BRi,forallDataNodesDNiinthenewpipeline).

• WhenaDataNodereceivesapacket,ifitalreadyhasthepacket,thedatastreamsimplypushesthedatadownstreamwithoutwritingittothedisk.

9

Thisrecoveryalgorithmhasaniceproperty:anybytesthatwerevisibletoanyclient,evenfromadownDataNodewiththelargestBAoftheoldpipeline,continuetobevisibletoanyreaderduringandafterapipelinerecovery.ThisisbecausethepipelinerecoverydoesnotdecreaseanyDataNode’sBAandBR.Furthermoreanytimeduringthepipelinerecoverythenewpiplelinemaintainsthepropertydescribedinsection3.3(consistencysupport).

6.1.3. RecoverfromaclosefailureOncetheclientdetectsthefailure,itrebuildsapipelinewiththeremainingDataNodes.EachDataNodebumpstheblock’sgenerationstampandfinalizesthereplicaifitisnotfinalizedyet.Thenetworkconnectionistorndownafteranackissent.Seesection7(pipelinesetup)formoredetails.

6.2. DataNodeRestart• WhenaDataNoderestarts,itreadseachreplicaunderdirectoryrbwand

loadsthereplicainmemoryasWaitingToBeRecovered.Itslengthissettobethemaximumnumberofbytesthatmatchitscrc.

• AnyWaitingToBeRecoveredreplicadoesnotserveanyreadanddoesnotparticipateinapipelinerecovery.

• AWaitingToBeRecoveredreplicawilleitherbecomeoutdatedandbedeletedbyNNiftheclientisstillaliveorbechangedtobefinalizedasaresultofleaserecoveryiftheclientdies.

6.3. NameNodeRestart• Noneofblockstatesarepersistedondisk.SowhenNameNoderestarts,it

needstorestoreeachblock’sstate.ThelastblockofanunclosedfilebecomesUnderConstructionnomatterwhatitspre‐lifestatewas.OtherblocksbecomeComplete.

• AskeachDataNodetoregisterandsenditsblockreportincludingfinalized,rbw,rwr,andrurreplicas.

• NameNodedoesexitsafemodeunlessthenumberofcompleteandunderconstructionblocksthathavereceivedatleastonereplicareachesthepre‐definedthreshold.

6.4. LeaseRecoveryWhenafile’sleaseisexpired,NNneedstoclosethefileforthesakeoftheclient.Therearetwoissues:(1)Concurrencycontrol:whatifaleaserecoveryisperformedwhiletheclientisstillaliveeitherintheprocessofsettinguppipeline,writing,close,orrecovery.Whatiftherearemultipleconcurrentleaserecoveries?(2)Consistencyguarantee:Ifthelastblockisunderconstruction,allitsreplicasneedtorollbacktoaconsistentstate:allreplicashavethesameon‐disklengthandthesamenewgenerationstamp.

1. NNrenewslease,changesthefile’sleaseholdertobedfsandpersiststhechangetoitseditlog.Soiftheclientisstillalive,anyofthewrite‐relatedrequestslikeaskingforanewgenerationstamp,gettinganewblock,orclosingthefile,willberejectedbecausetheclientisnottheleaseholderany

10

more.ThispreventstheclientfromconcurrentlychanginganunclosedfileifitevercontactstheNameNode.

2. NNchecksthestateofthelasttwoblocksofitsfile.Otherblocksshouldbeinthecompletestate.Thefollowingtableshowsallthepossiblecombinationsandtheactiontotakeforeachcombination.

Penultimateblock

Lastblock Actions

Complete Complete ClosethefileComplete CommittedCommitted CompleteCommitted Committed

Retryclosingthefilewhenleaseexpiresnexttime;Forcetoclosethefileafteracertainnumberofretries

Complete UnderConstructionCommitted UnderConstruction

Startsblockrecoveryforthelastblock

Complete UnderRecoveryCommitted UnderRecovery

Startsanewblockrecoveryforthelastblock;stoprecoveryafteracertainnumberofretries

6.5. BlockRecovery1. NNchoosesaprimaryDataNode(PD)toworkastheproxyofNameNodeto

performblockrecovery.PDcouldbeaDataNodewhereoneofitsreplicasresides.Ifnoneofitsreplicasareknown,blockrecoveryaborts.

2. NNgetsanewGS,whichmarksthegenerationthattheblockisgoingtobebumpedtowhentherecoverysuccessfullyfinishes.Itthenchangesthelastblock,ifitisUnderConstruction,tobeUnderRecovery.TheUnderRecoveryblockisstampedwithauniquerecoveryid,whichisnewGSthattheblockisgoingtobebumpedto.AnycommunicationsfromaPDtoNNneedstomatchthisrecoveryid.Thisishowconcurrentblockrecoveriesarehandled.Thebasicruleisthatthelatestrecoveryalwayspreemptspreviousrecoveries.

3. NNthenasksPDtorecovertheblock.NNsendsPDthenewGS,blockidanditsgenerationstamp,andallitsreplicalocationsincludingfinalizedreplicas,replicasbeingwrittento,andreplicaswaitingtoberecovered.

4. PDperformsblockrecovery:a. PDaskseachDataNode,whereonereplicaislocated,toperform

replicarecovery.i. PDsendseachDataNodetherecoveryid,blockidandgenerationstamp;

ii. EachDataNodechecksitsreplicastate:1. Checkexistence:IftheDataNodedoesnothavethe

replicaorthereplicaisolderthantheblock’sGSintherequest,ornewerthantherecoveryid(thisisnotsupposedtohappen),throwsaReplicaNotExistsException.

2. Stopwriter:Ifitisareplicabeingwrittentoandthereisanongoingwriterthread,interruptsthewriterand

11

waitsforthewritertoexit.Whenawriterthreadisinterrupted,ifthethreadisinthemiddleofreceivingapacket,stopsandthrowsawaythepartialpacket.Beforethethreadexits,itmakessurethaton‐diskbytesarethesameasBRandthenclosestheblockandcrcfiles.ThishandlesconcurrentclientwritesandblockrecoveryatDataNodes.Blockrecoverypreemptsclientwrites,resultinginpipelinefailure.SubsequentpipelinerecoverywillfailbecausethedfsclientcannotgetanewgenerationstampfromNNforablockunderrecovery.

3. Stoppreviousblockrecovery:Ifthereplicaisalreadyintherurstate,throwsaRecoveryInProgressExceptionifitsrecoveryidisgreaterthanorequaltothenewrecoveryid.IfthenewGSisgreater,stamptherurreplica’srecoveryidtobethenewone.

4. Statechange:Otherwise,changethereplicatoberur.Setitsrecoveryidtobethenewrecoveryidandareferencetoitsoldstate.AnycommunicationsfromaPDtoitselfneedstomatchthisrecoveryid.Note3and4handleconcurrentblockrecoveriesatDataNodes.Thelatestrecoveryalwayspreemptspreviousrecoveryandnotworecoveriescanbeinterleaved.

5. Crccheck:ThenperformaCRCcheckfortheblockfile.Ifthereisamismatch,throwsCorruptedReplicaExceptionifthereplicaisrbworfinalized.Ifreplicaisrwr,truncatetheblockfiletothelastmatchedbyte.

iii. Ifnoexceptionisthrown,eachDataNodereturnsPDitsreplicastatus<replicaid,replicaGS,replicaon‐disklen,pre‐recoverystate>.

b. AfterreceivingareplyfromeachDataNode,PDdecidestheblocklengththatallreplicasshouldagreeon.

i. IfoneDataNodethrowsRecoveryInProgressException,PDabortsblockrecovery.

ii. IfallDataNodesthrowanexception,abortsblockrecovery.iii. Ifmax(LeniforallreportedDNi)==0,asksNNtoremovethis

block.iv. Otherwise,checkreturnedstateofthereplicaswithnon‐zero

length.Thefollowingtableshowsallthepossiblecombinationofstatesinanexampleoftworeplicasandthelengthtoagreeonforeachcombination.

Cases Replica1state

Replica2state

Lengthtobeagreedon

1 Finalized Finalized Tworeplicasshouldhavethesamelength;Ifnotthesame,

12

thereisanerror,logsitandabortsblockrecovery

2 Finalized rbw Tworeplicasshouldhavethesamelengthbecausetheclientmusthavediedwhenpipelineisbeingsetuportorndown.Iftheyarenotthesame,excludetherbwreplica.

3 Finalized rwr Setnewlengthtobethelengthofthefinalizedreplicaandexcludetherwrreplica.

4 rbw rbw SetnewlengthtobeMIN(len1,len2)whereleniisthelengthofreplicai.

5 rbw rwr Excluderwrreplica.Thisbecomesthesameascase4.

6 rwr rwr SetthenewlengthtobeMIN(len1,len2).Inthiscase,lenimaynotequaltoBRi,sonoguaranteeofvisiblebytessincebothDataNodesdied.

c. Recoverreplicasthatparticipatedinlengthagreementinstepb.iv.

i. PDaskseachDataNodetorecoverthereplica.PDsendstheblockid,newGS,newlength.

ii. IftheDataNodedoesnothavethereplicaintherurstateoritsrecoveryiddoesnotmatchthenewGS,failthereplicarecoveryattheDataNode.

iii. Otherwise,theDataNodechangesthereplica’sGStobethenewGSbothondiskandinmemory.Itthenupdatesinmemoryreplicalengthtobethenewlengthandtruncatestheblockfilesizetothenewlengthandchangecrcfileaccordingly(maycausetruncationand/ormodificationoflast4crcbytes).Itfinalizesthereplicaifthereplicahasnotfinalizedyet.Thereplicarecoverysucceeds.

d. PDcheckstheresultofc.IfnoDataNodesucceeds,blockrecoveryfails.Ifsomesucceedandsomefail,PDgetsanewgenerationstampfromNNandrepeatsblockrecoverywiththesuccessfulDataNodes.IfallDataNodessucceed,PDnotifiesNNthenewGSandlength.NNfinalizestheblockandclosesthefileifallblocksofthefilechangetoCompletestate.NNforcesthefiletocloseafteralimitednumberofcloseretries.

ThisleaserecoveryalgorithmalsoguaranteesthatanybytesthatwerevisibletoaclientdoesnotgetremovedasaresultoftherecoveryifatleastoneDataNodeinthepipelineisstillaliveanditsdataarenotcorrupted.Thisisbecause

13

1. Incases1,2,and3,thereisafinalizedreplica.Theclientmusthavediedawayduringblockconstructionstages1and3.Thealgorithmdoesnotremoveanybyte.

2. Incases4and5,allreplicastoberecoveredareinrbwstate.Theclientmusthavediedduringblockconstructionstage2.Assumethepre‐recoverypipelinehasNDataNodes:DN0,DN1,…,DNN‐1.ThelengthreturnedbyDNistep4.a.iimustbeequaltoBRi.AssumethatasubsetoftheDataNodesSinthepipelineparticipatesthelengthagreement,thenewlengthisMIN(BRi,forallDataNodesinS)>=BRN‐1>=BAN‐1>=..>=BA0.Thisguaranteestheleaserecoverydoesnotremovedatathathavedeliveredtoanyreader.

3. Incase6,thealgorithmdoesnotprovideanyguaranteesinceallDataNodesinthepre‐recoverypipelinehadbeenrestarted.

7. PipelineSetUp7.1. Causesofpipelinesetup

Therearefivecasesthatapipelineneedstobesetup:1. Create:Whenanewblockiscreated,apipelineneedstobeconstructed

beforeanybytesarestreamedtoanyDataNode.2. Append:Whenafileistobeappendedandthelastblockofthefileisnotfull.

ApipelineofallDataNodesthathaveareplicaofthelastblockneedstobesetupbeforeanynewbytesarestreamedtoanyDataNode.

3. Appendrecovery:Whencase2fails,apipelinecontainingtheremainingDataNodesneedstobesetup.

4. Datastreamingrecovery:Ifdatastreamingfails,apipelineoftheremainingDataNodesneedstobesetupbeforethedatastreamingresumes.

5. Closerecovery:Ifpipelineclosefails,apipelineoftheremainingDataNodesneedstobesetupinordertofinalizetheblock.

7.2. Pipelinesetupsteps1. Cases2,3,4,and5buildapipelineonanexistingblock,sotheblock’s

generationstampneedstobebumpedalongwithpipelineconstruction.ThedfsclientasksNNforanewgenerationstamp.

2. ThedfsclientsendsawriteblockrequesttotheDataNodesinthenewpipelinewiththeparameters(notinclusive):blockidwitholdgenerationstamp,blocklength(numberofbytesareplicamusthaveoratleasthave),maxreplicalength,flags,and/oranewgenerationstamp.

CasesBlockid/generationstamp

Blocklen flagsNewgenerationstamp

Maxblocklen

1(create) yes 0 Noflagisset no no

2(Append) yes Pre‐appendblocklen

Appendflagisset yes

no

14

3(AppendRecovery) yes Thesameas2

Appendandrecoveryflagsareset

yes

no

4(Datastreamingrecovery)

yes BAc Recoveryflagisset yes BSc

5(Closerecovery) yes BAc=BSc Close

flagisset yesno

3. ThefollowingtableshowsthebehaviorwhenaDataNodereceivesa

pipelinesetuprequest.Notethatrwrreplicasdonotparticipateinthepipelinerecovery.Wecanrelaxthisrestrictionwithsomespecialhandlingoftherwrreplicas.Butsincethisisaveryrarecase,wechoosenottodoitinthisroundofdesign.

Cases SanityCheck Replicastatechange GSchange

1(Create)Areplicawiththesameblockidshouldnotexist

Createarbwreplicawith(BA,BR)=(0,0) no

2(Append)

Finalizedreplica;itson‐disklenshouldmatchthepre‐appendlen

Openthereplicaforwrite;setthewritestreamoffsetattheendoffile;thereplicabecomesrbw:(BA,BR)=(preAppendLen,preAppendLen)

SettonewGS

3(AppendRecovery)

Finalizedorrbw;replicalength(on‐diskorBR)shouldmatchpre‐appendlen;rbwreplicaGScouldbethesameornewer

Iffinalized,dothesameasabove;Ifrbw,waitforwritertoexit;openblockforwrite;setwritestreamoffsetattheendoffile.

SettonewGS

4(DataStreamingRecovery)

rbwreplica;thesameornewerGS;BAc<=BAi<=BRi<=BSc

Waitforwritertoexit;Openblockforwrite;setwritestreamoffsetattheendoffile.

SettonewGS

5(CloseRecovery)

rbworfinalized;thesameornewerGS;replicalengthshouldbethesameasBAc

Ifrbw,waitforwritertoexitandthenfinalizethereplica;Closepipelinewhenackissentback.

SettonewGS

4. Incases2,3,and4,onasuccessfulpipelinesetup,thedfsclientnotifies

NNofthenewGS,minlength,andtheDataNodesinthenewpipeline.NN

15

thenupdatestheunderconstructionblock’sgenerationstamp,lenandlocations.

5. Ifthepipelinesetupfails,ifatleastoneDataNoderemains,gotostep1witharecoveryflagset.Otherwise,sincethepipelinehasnomoreDataNode,markthispipelineasfailed.Ifauserapplicationisblockedinhflush/write,itwillbeunblockedandgetanEmptyPipelineException.Otherwisethenextwrite/hflush/closewillgetanEmptyPipelineExceptionimmediately.

8. ReportReplica/BlockState/MetaInformationChangetoNN

8.1. ClientReportsAclientinformsNNofanunderconstructionblock’smetainformationchangeorstatechange.Asdiscussedinthepipelinesetupsection,incases2(append),3(appendrecovery),and4(datastreamingrecovery),afteranewpipelineissetup,aclientreportsNNoftheblock’snewGSandtheDataNodesinthepipeline.NNthenupdatestheunderconstructionblock’sGS,length,andlocations.Notethatinthisdesignafterapipelineforcreatingablockissetup,aclientdoesnotreportNNofthenewlycreatedblockanditslocations.InsteadwhentheclientissuesaddBlock/appendtoaskforanewblock,NNputsthenewblockanditslocationsintotheblocksMapbeforeNNreturnstheblockandlocationsbacktoclient.Thisdesignhasaminorflaw.IfanewreaderhappenstoreadthelastblockbetweenthetimetheblockisaddedtoblocksMapatNameNodeandthetimeareplicaoftheblockiscreatedonaDataNode,thereadermaygeta“blockdoesnotexit”error.Butsincethechanceofhavingareaderduringthisshorttimeframeisveryslim,weconsciouslymakethisdesigndecisiontotradeforperformance.AclientdoesnotneedtosendanotificationtoNNafterapipelineissetupforeveryblockcreate.WhenaclientissuesaddBlockorclose(afile),NNwillfinalizethelastblock’sGSandlength.ThelastblockmaymovetotheCompletestateifthelastblockalreadyhasaGS/lenmatchedfinalizedreplica;otherwisethelastblockismovedtotheCommittedstate.Inaddition,ifthenumberofthereplicasofthelastblockislessthanitsreplicationfactor,NNexplicitlyreplicatestheblocktoreachitsreplicationfactor.

8.2. DataNodeReportsADataNodereportsareplica’smetainformationorstatechangebyperiodicallysendingNNablockreportorsendingNNablockReceivedmessagewhenarbwreplicaisfinalized.

16

8.3. BlockReports• Eachblockreportcontainstwolists:oneforfinalizedreplicas,onefor

rbw.Thefinalizedreplicaslistincludefinalizedreplicasandunderrecoveryreplicaswhoseoldstatearefinalized.Therbwreplicalistincludesrbwreplicas,rwrreplicas,andrurreplicaswhoseoldstatearenotfinalized.Anrbwreplica’slengthisitsbytesreceived(BR).Thelengthofanrwrisanegativenumber.

• Noweachreportedreplica’sstateisaquadruple<DataNode,blck_id,blck_GS,blck_len,isRbw>.

• AfterNNreceivesablockreport,compareitwithwhat’sinthememoryandgenerate3lists:(doweneedaupdateStateList?)

o deleteListifblck_idisnotvalid,i.e.noentryinblocksMaporbelongstonofile.

o addStoredBlockListifNNdoesnothavethereplica<DataNode,blck_id>buttheblockreporthasit.

o rmStoredBlockListifNNhasthereplica<DataNode,blck_id>buttheblockreportdoesnothaveit.

o updateStateListifthereplica’sstateinNNisrbwbuttheblockreportsaysitisfinalized.

ToavoidraceconditionbetweenclientreportsandDataNodereports,rbwreplicasareaddedtoonlydeleteListoraddStoredBlockList.

• Addanewreplica1. BlockinNNisComplete

• Ifthereportedreplicaisfinalized,o IfitsGSandlengtharedifferentfromNNrecorded

value,addittotheblocksMapbutmarkitascorrupt.o Otherwise,addthereplica.

• Ifthereportedreplicaisrbw,o Ifthefileisclosed,ifthereplica’sGS/lenisdifferent

fromNN–recordedvalueortheblockhasreacheditsreplicationfactor,instructitsDataNodetodeleteit.

o Otherwise,donothing.2. BlockinNNisCommitted

Thehandlingisverysimilartotheabovecaseexceptthatifthereportedreplicaisfinalizedandmatchestheblock’sGSandlength,NNchangestheblocktothecompletestate.

3. BlockinNNisUnderConstructionorUnderRecovery• Ifthereportedreplicaisfinalizedandthereplica’sGSisequal

toornewerthanNNrecordedGS,addthereplica.AlsomarkthereplicaasfinalizedandkeepstrackofitslengthandGS.

• Ifthereportedreplicaisrbwandthereplicaisvalid(GSnotolderandlengthnotshorter),addthereplica.Ifthereportedreplicaisrbw,markitasrbw;otherwise,markitasrwr.

• Otherwiseignoreit.

17

• Updateareplica’sstateWhenablockreportshowsareplicaischangedfromrbwtofinalized,iftheblockisunderconstruction,NNmarkstheNNstoredreplicaasfinalizedandkeepstrackofthefinalizedreplica’sGSandlength.IftheblockisCommitted,ifthefinalizedreplicamatchestheblock’sGSandlength,NNchangestheblocktotheCompletestate;otherwiseremovethisreplicafromNN.

8.4. blockReceivedDataNodessendblockReceivedtoNNtonotifythatareplicaisfinalized.WhenNNreceivesablockReceivednotification,iteitheraddanewreplicaif(DataNode,block_id)doesnotexistinNN,updatethereplica’sstateifitisrecordedatrbw,oraskaDataNodetodeletethereplicaiftheblockisinvalid.

9. Replica/BlockStateTransition9.1. ReplicaStateTransition

ThefollowingdiagramsummarizesallpossiblereplicastatetransitionsataDataNode.

• Anewreplicaiscreated

18

o Eitherbyaclient.Thenewreplicastartwithareplicabeingwritten(rbw)state.

o OruponaninstructionfromNNtoreplicateorcopyareplicaforthepurposeofbalancing.Thenewreplicaisintemporarystate.

• Anrbwreplicachangestobeareplicawaitingtoberecovered(rwr)whenitsDataNoderestarts.

• Areplicachangestobeaunderrecoveryreplicawhenareplicarecoverystartsinresponsetoleaseexpiration.

• Areplicaisfinalizedwhenaclientissuesaclose,replicarecoverysucceeds,orreplication/copysucceeds.

• Errorrecoveryalwayscausesareplica’sGStobebumped.

9.2. BlockStateTransitionThefollowingdiagramsummarizesallpossibleblockstatetransitionsattheNameNode.

• Ablockiscreated

o EitherwhenaclientissuesaddBlocktoaddanewblocktoafile.o Orwhenaclientissuesappendandthelastblockofthefileisfull.

Thenewlycreatedblockisablockunderconstruction.

Complete Block

Block Under Construction

Block Under Recovery

Committed Block

addBlock

NN restarts if last block of an unclosed

file

lease expires & block recovery

starts

NN restarts

receives a GS/Len matched finalized

replica

Init

addBlock

pipeline recovery succeeds (GS++)

NN restarts if not last block of an unclosed file

block recovery

fails

lease expires &

block recovery

starts(Recovery#

++)

lease expires

block recovery succeeds but no GS/len matched finalized replica

(GS++)

removed

zero length block

append pipeline is setup (GS++)

closelease expires and

force file close

lease expires

NN restarts if not last block of an unclosed file

append if last block is full

addBlock or close

Append or NN restarts

recovery succeeds

19

• AppendmayalsocauseaCompleteblocktobechangedtoablockUnderConstructionifthelastblockispartial.

• WhenaddBlockorcloseisissued,o thelastblockbecomeseitherCompleteiftheblockalreadyhasa

GS/lenmatchedfinalizedreplicaorCommittedotherwise.o addBlockwaitsuntilthepenultimateblocktobecomeComplete.o Afilewon’tbecloseduntilthelasttwoblocksofthefileareComplete.

• Whenleaseexpires,aleaserecoverychangesablockunderconstructiontobeablockunderrecovery.

o Ablockrecoverymaychangeablockunderrecoverytobe Removedifallitsreplicasareoflength0; Committediftherecoverysucceedsandtheblockhasno

GS/lenmatchedfinalizedreplica; CompleteiftherecoverysucceedsandtheblockhasaGS/len

matchedfinalizedreplica.o AleaserecoverymayforceaCommittedblocktobeComplete.

• Blockstatesdonotpersistondisk.WhenaNameNoderestarts,thelastblockofanunclosedfilebecomesunderconstructionandtherestbecomeComplete.

o NoteaCompleteorCommittedblockmaychangetobeanUnderConstructionblockafterNNrestartsifitisthelastblockofafile.Iftheclientisstillalive,theclientwillfinalizeitagain.Otherwisewhenleaseexpires,ablockrecoverywillfinalizeitagain.

• NotethatonceablockbecomesCommittedorComplete,allitsreplicasshouldhavethesameGSandarefinalized.WhenablockisUnderConstruction,itmayhavemultiplegenerationsoftheblockcoexistinginthecluster.