Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1©Cloudera,Inc.Allrightsreserved.
Openproblemsindistributedstorageandresourcemanagement
Karthik Kambatla |[email protected]|[email protected]
Next-GenerationApacheHadoop
2©Cloudera,Inc.Allrightsreserved.
???
3©Cloudera,Inc.Allrightsreserved.
4©Cloudera,Inc.Allrightsreserved.
Clouderaperspective
• Hadoopsoftwarestackisrelativelymature• Seenbroaduptakeinmanyindustries•Widervarietyofworkloads• Largerandlargeramountsofdata
• Newdatacenterhardwaretrendsonthehorizon
• Goodtimetorevisitoriginaldesignassumptions• Collaboratewithacademicsontheseproblems
5©Cloudera,Inc.Allrightsreserved.
Scalability
6©Cloudera,Inc.Allrightsreserved.
Master
Worker
Worker
Worker
Worker
GlobalviewofclusterCoordinatesworkers
SimpleDoworkScale-out
7©Cloudera,Inc.Allrightsreserved.
NN
DN
DN
DN
DN
FilemetadataFile->blockmapping
StoreblocksServedatareads/writes
Bottleneck
8©Cloudera,Inc.Allrightsreserved.
Project Improvement Cost(months)
Multiple volumesperNN Operational 6Splitnamespace andblockmanagementlocking
2xRPC 12
Fine-grained lockingofnamespace 2xRPC 6Pageable namespace 2x objectcount 6Persistentblockspace Operational 6Blockmanagement asaservice 2xobjectcount 12+Volumemigration Operational 12
VerticallyscalingHDFS
9©Cloudera,Inc.Allrightsreserved.
Project Improvement Cost(months)
Multiple volumesperNN Operational 6Splitnamespace andblockmanagementlocking
2xRPC 12
Fine-grained lockingofnamespace 2xRPC 6Pageable namespace 2x objectcount 6Persistentblockspace Operational 6Blockmanagement asaservice 2xobjectcount 12+Volumemigration Operational 12
Scarychanges
10©Cloudera,Inc.Allrightsreserved.
Project Improvement Cost(months)
Multiple volumesperNN Operational 6Splitnamespace andblockmanagementlocking
2xRPC 12
Fine-grained lockingofnamespace 2xRPC 6Pageable namespace 2x objectcount 6Persistentblockspace Operational 6Blockmanagement asaservice 2xobjectcount 12+Volumemigration Operational 12
Incremental
11©Cloudera,Inc.Allrightsreserved.
Project Improvement Cost(months)
Multiple volumesperNN Operational 6Splitnamespace andblockmanagementlocking
2xRPC 12
Fine-grained lockingofnamespace 2xRPC 6Pageable namespace 2x objectcount 6Persistentblockspace Operational 6Blockmanagement asaservice 2xobjectcount 12+Volumemigration Operational 12
Yearsofwork
12©Cloudera,Inc.Allrightsreserved.
Hardwaretrendsonthehorizon
2006 2016 2021
HDDcapacity(TB) 0.2 2 20HDDspeed(MB/s) 90 110 140Networkspeed(Gb/s) 0.1 10 40
FewerIOPS/GB
HDDlocalityirrelevant
13©Cloudera,Inc.Allrightsreserved.
Afreshlook
• Designedforanalyticworkloads• Scaleshorizontally(exabyte scale)• Operationallyrobust• Designedforfuturehardwaretrends
14©Cloudera,Inc.Allrightsreserved.
Blobstore
• Usersthinkindatasets,notdirectoriesandfiles• Spectrumofblobstore vs.filesystemfunctionality•WhatistheequivalentofthePOSIXAPIforascalablestoragesystem?•Whatsetofoperationsarerequired?•Whataretheirsemantics?•Whatcanandcannotbesupportedscalably?
15©Cloudera,Inc.Allrightsreserved.
Otherconsiderations
• Erasurecoding• Requiredtobecostcompetitive
•Multi-datacenterreplication• Importantforbusiness-criticalanalytics
• 3DXpoint• Newadditiontostoragehierarchy• Couldchangehowwewritesoftwareandthinkaboutpersistence
16©Cloudera,Inc.Allrightsreserved.
RM
NM
NM
NM
NM
QueueoftasksAvailableresources
Runtasks
17©Cloudera,Inc.Allrightsreserved.
Oneclustertorulethemall
• Exabyte-scalestoragemeansexabyte-scaleprocessing• Current:10,000nodeYARNclusters• Goal:1,000,000 nodes• Oneclusterforallcompute ataninternet-scalecompany• ThinkMicrosoft orTwitter
18©Cloudera,Inc.Allrightsreserved.
YarnFederation
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
Router
Policy
Client
Admin
<Tenant,Sub-cluster>
19©Cloudera,Inc.Allrightsreserved.
Fair-SharingandFederation
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
Andrew
Karthik
50
50
50
50
20©Cloudera,Inc.Allrightsreserved.
Fair-SharingandFederation
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
Andrew
Karthik
99
1
1
99
FeedbackloopPer-clusterweights
21©Cloudera,Inc.Allrightsreserved.
Fair-SharingandFederation
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
99
1
1
99
22©Cloudera,Inc.Allrightsreserved.
Scheduling
23©Cloudera,Inc.Allrightsreserved.
Duration SchedulingLatency “Tasks” Tenant
ScalePlacementQuality
Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low
InteractiveSQL Seconds Milliseconds 100s Users (100s) Medium
Streamprocessing Months Minutes 10s Jobs(10s) High
Long-runningservices Months Minutes #Nodes Services(10s) High
Varietyofworkloads
24©Cloudera,Inc.Allrightsreserved.
Schedulinglatency
Duration SchedulingLatency “Tasks” Tenant
ScalePlacementQuality
Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low
InteractiveSQL Seconds Milliseconds 100s Users (100s) Medium
Streamprocessing Months Minutes 10s Jobs(10s) High
Long-runningservices Months Minutes #Nodes Services(10s) High
25©Cloudera,Inc.Allrightsreserved.
Lowlatencyschedulingfordistributedsystems
• Stateoftheart• Low-latencyscheduling:Sparrow• Second-levelschedulerthatneedspre-allocatedresources
• Operational• Staticpartitioning:setasideresources• Semi-static:Maintainaper-usercacheofresources• Downside:lowutilization
Canwedesignscalablealgorithmsforlow-latencyscheduling?
26©Cloudera,Inc.Allrightsreserved.
Schedulinglatency
Duration SchedulingLatency “Tasks” Tenant
ScalePlacementQuality
Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low
InteractiveSQL Seconds Milliseconds 100s Users (100s) Medium
Streamprocessing Months Minutes 10s Jobs(10s) High
Long-runningservices Months Minutes #Nodes Services(10s) High
27©Cloudera,Inc.Allrightsreserved.
Schedulinglatency
Duration SchedulingLatency “Tasks” Tenant
ScalePlacementQuality
Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low
InteractiveSQL Seconds Minutes 100s Users (100s) Medium
Streamprocessing Months Minutes 10s Jobs(10s) High
Long-runningservices Months Minutes #Nodes Services(10s) High
28©Cloudera,Inc.Allrightsreserved.
JobsvsServices
Duration SchedulingLatency “Tasks” Tenant
ScalePlacementQuality
Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low
InteractiveSQL Seconds Minutes 100s Users (100s) Medium
Streamprocessing Months Minutes 10s Jobs(10s) High
Long-runningservices Months Minutes #Nodes Services(10s) High
29©Cloudera,Inc.Allrightsreserved.
JobsvsServices
Duration SchedulingLatency “Tasks” Tenant
ScalePlacementQuality
Jobs Mins- hours Seconds < 400,000 ~ 10,000 Low
Services Seconds Minutes #Nodes <100 High
30©Cloudera,Inc.Allrightsreserved.
Scalability- Tenants
Duration SchedulingLatency “Tasks” Tenant
ScalePlacementQuality
Jobs Mins- hours Seconds < 400,000 ~10,000 Low
Services Seconds Minutes #Nodes <100 High
31©Cloudera,Inc.Allrightsreserved.
Scalability– TenantsvsNodes
• Schedulingisallocatingresourcesfortenantsonclusternodes•Matching/joinbetweentwosets• Schedulinglatency=|Tenants|x|Nodes|
Canwelowertheboundonschedulinglatency?
32©Cloudera,Inc.Allrightsreserved.
Qualityofplacement
Duration SchedulingLatency “Tasks” Tenant
ScalePlacementQuality
Jobs Mins- hours Seconds < 400,000 ~ 10,000 Low
Services Seconds Minutes #Nodes <100 High
33©Cloudera,Inc.Allrightsreserved.
Placementrequirements
SOFTHARDDatalocalitySoftware
e.g.Database
Hardwaree.g.GPU
Intra-jobaffinity
Inter-jobaffinity
34©Cloudera,Inc.Allrightsreserved.
Multi-tenancyandscalability
Tenants,Schedulingthroughput
Nodes,Placementquality
ServiceScheduler
JobScheduler
UnifiedScheduler
35©Cloudera,Inc.Allrightsreserved.
Utilization
36©Cloudera,Inc.Allrightsreserved.
Productionclusters
[1]ApacheYARNatSOCC‘13[2]Anecdotalfromthecommunity
CPUUtilization% MemoryUtilization%
MapReducev1 <20[1] <20[1]
YARN /MapReducev2 50[1] 30[2]
37©Cloudera,Inc.Allrightsreserved.
Potentialforimprovement
• Atask’sresourceusagevariesovertime.• Resourceusagevariesacrosstasksofthesamejob
020040060080010001200
Terasort Wordcount
Mean Peak Allotted
38©Cloudera,Inc.Allrightsreserved.
Over-subscribingnodes
• Allocateunusedresourcestopendingtasks• Challenges• Handlesuddenspikesinresourceusagegracefully• Performanceoftaskscannotdeteriorate• Contentiononnon-isolatedresources
39©Cloudera,Inc.Allrightsreserved.
Conclusion
ApacheHadoopismatureandverywidelydeployed.
Theunderlyingassumptionsare10yearsoldandneedrevisiting.
Lotsofinterestingandhardresearchproblemsinthespace.
40©Cloudera,Inc.Allrightsreserved.
Thankyou
41©Cloudera,Inc.Allrightsreserved.
OpenProblems
• Storagescalability• Blobstore APIforanalyticworkloads• GlobalfairnessinafederatedYARNcluster• Low-latencyscheduling• Jobsandservicesonthesamecluster
• Schedulerscalabilityintenantsandnodes• Improvingqualityofplacementwithalatencyupperbound
• Clusterutilizationimprovements• I/OschedulingforpredictabilityandQoS
42©Cloudera,Inc.Allrightsreserved.
NodeManager
NodeManager
Greedyplacementisnotoptimal
GPU
CPUAndrew
Karthik
NodeManagerCPU
NodeManagerCPU
43©Cloudera,Inc.Allrightsreserved.
Multi-tenancy
43
MapReduce Spark Impala HBase
HDFS