Cluster Computing OverviewCluster Computing Overview
CS241 Winter 01CS241 Winter 01© 1999-2001 Armando Fox© 1999-2001 Armando Fox
[email protected]@cs.stanford.edu
© 2001
Stanford
Today’s OutlineToday’s Outline
Clustering: the Holy GrailClustering: the Holy Grail The Case For NOWThe Case For NOW
Clustering and Internet ServicesClustering and Internet Services
Meeting the Cluster ChallengesMeeting the Cluster Challenges
Cluster case studiesCluster case studies GLUnixGLUnix
SNS/TACCSNS/TACC
DDS?DDS?
© 2001
Stanford
Cluster Prehistory: Tandem Cluster Prehistory: Tandem NonStopNonStop
Early (1974) foray into transparent fault tolerance Early (1974) foray into transparent fault tolerance through redundancythrough redundancy Mirror everything (CPU, storage, power supplies…), can Mirror everything (CPU, storage, power supplies…), can
tolerate any single fault (later: processor duplexing)tolerate any single fault (later: processor duplexing)
““Hot standby” process pair approachHot standby” process pair approach
What’s the difference between What’s the difference between high availabilityhigh availability and and fault fault tolerance?tolerance?
NoteworthyNoteworthy ““Shared nothing”--why?Shared nothing”--why?
Performance and efficiency costs?Performance and efficiency costs?
Later evolved into Tandem Himalaya, which used Later evolved into Tandem Himalaya, which used clustering for clustering for bothboth higher performance and higher higher performance and higher availabilityavailability
© 2001
Stanford
Pre-NOW Clustering in the 90’sPre-NOW Clustering in the 90’s
IBM Parallel Sysplex and DEC OpenVMSIBM Parallel Sysplex and DEC OpenVMS Targeted at conservative (read: mainframe) customersTargeted at conservative (read: mainframe) customers
Shared disks allowed under both (why?)Shared disks allowed under both (why?)
All devices have cluster-wide names (shared everything?)All devices have cluster-wide names (shared everything?)
1500 installations of Sysplex, 25,000 of OpenVMS Cluster1500 installations of Sysplex, 25,000 of OpenVMS Cluster
Programming the clustersProgramming the clusters All System/390 and/or VAX VMS subsystems were All System/390 and/or VAX VMS subsystems were
rewritten to be cluster-awarerewritten to be cluster-aware
OpenVMS: cluster support exists even in single-node OS!OpenVMS: cluster support exists even in single-node OS!
An advantage of locking into proprietary interfaceAn advantage of locking into proprietary interface
© 2001
Stanford
Networks of Workstations: Holy Networks of Workstations: Holy GrailGrail
Use clusters of workstations instead of a supercomputerUse clusters of workstations instead of a supercomputer ..
The case for NOWThe case for NOW difficult for custom designs to track technology trends (e.g. uproc difficult for custom designs to track technology trends (e.g. uproc
perf. increases at 50%/yr, but design cycles are 2-4 yrs)perf. increases at 50%/yr, but design cycles are 2-4 yrs)
No economy of scale in 100s => +$No economy of scale in 100s => +$
Software incompatibility (OS & apps) => +$$$$Software incompatibility (OS & apps) => +$$$$
““Scale makes availability affordable” (Pfister)Scale makes availability affordable” (Pfister)
““systems of systems” can aggressively use off-the-shelf hardware systems of systems” can aggressively use off-the-shelf hardware and OS softwareand OS software
New challenges (“the case against NOW”):New challenges (“the case against NOW”): performance and bug-tracking vs. dedicated systemperformance and bug-tracking vs. dedicated system
underlying system is changing underneath youunderlying system is changing underneath you
underlying system is poorly documentedunderlying system is poorly documented
© 2001
Stanford
Clusters: “Enhanced Standard Clusters: “Enhanced Standard Litany”Litany”
Hardware redundancyHardware redundancy
Aggregate capacityAggregate capacity
Incremental scalabilityIncremental scalability
Absolute scalabilityAbsolute scalability
Price/performance Price/performance sweet spotsweet spot
Software engineeringSoftware engineering
Partial failure Partial failure managementmanagement
Incremental scalabilityIncremental scalability
System administrationSystem administration
HeterogeneityHeterogeneity
© 2001
Stanford
Clustering and Internet ServicesClustering and Internet Services
Aggregate capacityAggregate capacity TB of disk storage, THz of compute power (if we can TB of disk storage, THz of compute power (if we can
harness in parallel!)harness in parallel!)
RedundancyRedundancy Partial failure behavior: only small fractional degradation Partial failure behavior: only small fractional degradation
from loss of one nodefrom loss of one node
Availability: industry average across “large” sites during Availability: industry average across “large” sites during 1998 holiday season was 97.2% availability (source: 1998 holiday season was 97.2% availability (source: CyberAtlas)CyberAtlas)
Compare: mission-critical systems have “four nines” Compare: mission-critical systems have “four nines” (99.99%)(99.99%)
© 2001
Stanford
Clustering and Internet WorkloadsClustering and Internet Workloads
Internet vs. “traditional” workloadsInternet vs. “traditional” workloads e.g. Database workloads (TPC benchmarks)e.g. Database workloads (TPC benchmarks)
e.g. traditional scientific codes (matrix multiply, simulated e.g. traditional scientific codes (matrix multiply, simulated annealing and related simulations, etc.)annealing and related simulations, etc.)
Some characteristic differencesSome characteristic differences Read mostlyRead mostly
Quality of service (best-effort vs. guarantees)Quality of service (best-effort vs. guarantees)
Task granularityTask granularity
““Embarrasingly parallel”…why?Embarrasingly parallel”…why? HTTP is stateless with short-lived requestsHTTP is stateless with short-lived requests
Web’s architecture has already forced app designers to Web’s architecture has already forced app designers to work around this! (not obvious in 1990)work around this! (not obvious in 1990)
© 2001
Stanford
Meeting the Cluster ChallengesMeeting the Cluster Challenges
Software & programming modelsSoftware & programming models
Partial failure and application semanticsPartial failure and application semantics
System administrationSystem administration
Two case studies to contrast programming modelsTwo case studies to contrast programming models GLUnix goal: support “all” traditional Unix apps, providing GLUnix goal: support “all” traditional Unix apps, providing
a single system imagea single system image
SNS/TACC goal: simple programming model for Internet SNS/TACC goal: simple programming model for Internet services (caching, transformation, etc.), with good services (caching, transformation, etc.), with good robustness and easy administrationrobustness and easy administration
© 2001
Stanford
Software ChallengesSoftware Challenges
What is the programming model for clusters?What is the programming model for clusters? Explicit message passing (e.g. Active Messages)Explicit message passing (e.g. Active Messages)
RPC (but remember the problems that make RPC hard)RPC (but remember the problems that make RPC hard)
Shared memory/network RAM (e.g. Yahoo! directory)Shared memory/network RAM (e.g. Yahoo! directory)
Traditional OOP with object migration (“network Traditional OOP with object migration (“network transparency”): not relevant for Internet workload?transparency”): not relevant for Internet workload?
Programming model should support decent failure Programming model should support decent failure semantics and exploit inherent modularity of clusterssemantics and exploit inherent modularity of clusters Traditional uniprocessor programming idioms/models don’t Traditional uniprocessor programming idioms/models don’t
seem to scale up to clustersseem to scale up to clusters
Question: Is there a “natural to use” cluster model that scales Question: Is there a “natural to use” cluster model that scales down to uniprocessors, at least for Internet-like workloads?down to uniprocessors, at least for Internet-like workloads?
Later in the quarter we’ll take a shot at thisLater in the quarter we’ll take a shot at this
© 2001
Stanford
Partial Failure ManagementPartial Failure Management
What does What does partial failure partial failure mean for…mean for… a transactional database?a transactional database?
A read-only database striped across cluster nodes?A read-only database striped across cluster nodes?
A compute-intensive shared service?A compute-intensive shared service?
What are appropriate “partial failure What are appropriate “partial failure abstractions”?abstractions”? Incomplete/imprecise results?Incomplete/imprecise results?
Longer latency?Longer latency?
What current programming idioms make partial What current programming idioms make partial failure hard?failure hard? Hint: remember the original RPC papers?Hint: remember the original RPC papers?
© 2001
Stanford
System Administration on a System Administration on a ClusterCluster
Thanks to Eric Anderson (1998) for some of this material.Thanks to Eric Anderson (1998) for some of this material.
Total cost of ownership (TCO) way high for clusters Total cost of ownership (TCO) way high for clusters due to administration costsdue to administration costs
Previous SolutionsPrevious Solutions Pay someone to watchPay someone to watch
Ignore or wait for someone to complainIgnore or wait for someone to complain
““Shell Scripts From Hell” (not general Shell Scripts From Hell” (not general vast repeated vast repeated work)work)
Need an extensible and scalable way to automate Need an extensible and scalable way to automate the gathering, analysis, and presentation of datathe gathering, analysis, and presentation of data
© 2001
Stanford
System Administration, cont’d.System Administration, cont’d.
Extensible Scalable Monitoring For Clusters of Extensible Scalable Monitoring For Clusters of Computers Computers (Anderson & Patterson, UC Berkeley)(Anderson & Patterson, UC Berkeley)
Relational tables allow properties & queries of Relational tables allow properties & queries of interest to evolve as the cluster evolvesinterest to evolve as the cluster evolves
Extensive visualization support allows humans to Extensive visualization support allows humans to make sense of masses of datamake sense of masses of data
Multiple levels of caching decouple data collection Multiple levels of caching decouple data collection from aggregationfrom aggregation
Data updates can be “pulled” on demand or Data updates can be “pulled” on demand or triggered by pushtriggered by push
© 2001
Stanford
Visualizing Data: ExampleVisualizing Data: Example
Display aggregates of various interesting Display aggregates of various interesting machine properties on the NOW’smachine properties on the NOW’s
Note use of aggregation, colorNote use of aggregation, color
© 2001
Stanford
Case Study: The Berkeley NOWCase Study: The Berkeley NOW
History and History and PicturesPictures of an early research cluster of an early research cluster NOW-0: four HP-735’sNOW-0: four HP-735’s
NOW-1: 32 headless Sparc-10’s and Sparc-20’sNOW-1: 32 headless Sparc-10’s and Sparc-20’s
NOW-2: 100 UltraSparc 1’s, Myrinet interconnectNOW-2: 100 UltraSparc 1’s, Myrinet interconnect
inktomi.berkeley.edu: four Sparc-10’sinktomi.berkeley.edu: four Sparc-10’s
www.hotbot.com: 160 Ultra’s, 200 CPU’s totalwww.hotbot.com: 160 Ultra’s, 200 CPU’s total
NOW-3: eight 4-way SMP’sNOW-3: eight 4-way SMP’s
Myrinet interconnectionMyrinet interconnection In addition to commodity switched EthernetIn addition to commodity switched Ethernet
Originally Sparc SBus, now available on PCIbusOriginally Sparc SBus, now available on PCIbus
© 2001
Stanford
The Adventures of NOW: The Adventures of NOW: ApplicationsApplications
AlphaSort: 8.41 GB in one minute, 95 UltraSparcsAlphaSort: 8.41 GB in one minute, 95 UltraSparcs runner up: Ordinal Systems runner up: Ordinal Systems nSort nSort on SGI Origin, 5 GB)on SGI Origin, 5 GB)
pre-1997 record, 1.6 GB on an SGI Challengepre-1997 record, 1.6 GB on an SGI Challenge
40-bit DES key crack in 3.5 hours40-bit DES key crack in 3.5 hours ““NOW+”: headless and some headed machinesNOW+”: headless and some headed machines
inktomi.berkeley.edu (now inktomi.com)inktomi.berkeley.edu (now inktomi.com) now fastest search engine, largest aggregate capacitynow fastest search engine, largest aggregate capacity
TranSend proxy & Top Gun Wingman Pilot browserTranSend proxy & Top Gun Wingman Pilot browser ~15,000 users, 3-10 machines~15,000 users, 3-10 machines
© 2001
Stanford
NOW: GLUnixNOW: GLUnix
Original goals:Original goals: High availability through redundancyHigh availability through redundancy
Load balancing, self-managementLoad balancing, self-management
Binary compatibilityBinary compatibility
Both batch and parallel-job supportBoth batch and parallel-job support
I.e., single system image for NOW usersI.e., single system image for NOW users Cluster abstractions == Unix abstractionsCluster abstractions == Unix abstractions
This is both good and bad…what’s missing compared to This is both good and bad…what’s missing compared to early 90’s proprietary cluster systems?early 90’s proprietary cluster systems?
For portability and rapid development, build on top For portability and rapid development, build on top of off-the-shelf OS (Solaris)of off-the-shelf OS (Solaris)
© 2001
Stanford
GLUnix ArchitectureGLUnix Architecture
Master collects load, status, etc. info from daemonsMaster collects load, status, etc. info from daemons Repository of cluster state, centralized resource allocationRepository of cluster state, centralized resource allocation
Pros/cons of this approach?Pros/cons of this approach?
Glib app library talks to GLUnix master as app proxyGlib app library talks to GLUnix master as app proxy Signal catching, process mgmt, I/O redirection, etc.Signal catching, process mgmt, I/O redirection, etc.
Death of daemon is treated as a SIGKILL by masterDeath of daemon is treated as a SIGKILL by master
GLUnixMaster
NOW node
glud daemon
NOW node
glud daemon
NOW node
glud daemon
1 per cluster
© 2001
Stanford
GLUnix RetrospectiveGLUnix Retrospective Trends that changed the assumptionsTrends that changed the assumptions
SMP’s have replaced MPP’s, and are tougher to compete with for SMP’s have replaced MPP’s, and are tougher to compete with for MPP workloadsMPP workloads
Kernels have become extensibleKernels have become extensible
Final features vs. initial goalsFinal features vs. initial goals Tools: Tools: glurun, glumake glurun, glumake (2nd most popular use of NOW!)(2nd most popular use of NOW!), ,
glups/glukill, glustat, glureserveglups/glukill, glustat, glureserve
Remote execution--but not total transparencyRemote execution--but not total transparency
Load balancing/distribution--but not transparent Load balancing/distribution--but not transparent migration/failovermigration/failover
Redundancy for high availability--but not for the “GLUnix master” Redundancy for high availability--but not for the “GLUnix master” nodenode
Philosophy:Philosophy: Did GLUnix ask the right question (for our Did GLUnix ask the right question (for our purposes)?purposes)?
© 2001
Stanford
TACC/SNSTACC/SNS
Specialized cluster runtime to host Web-like Specialized cluster runtime to host Web-like workloadsworkloads TACC: transformation, aggregation, caching and TACC: transformation, aggregation, caching and
customization--elements of an Internet servicecustomization--elements of an Internet service
Build apps from composable modules, Unix-pipeline-styleBuild apps from composable modules, Unix-pipeline-style
Goal: complete separation of Goal: complete separation of *ility*ility concerns from concerns from application logicapplication logic Legacy code encapsulation, multiple language supportLegacy code encapsulation, multiple language support
Insulate programmers from nasty engineeringInsulate programmers from nasty engineering
© 2001
Stanford
TACC ExamplesTACC Examples
Simple search engineSimple search engine Query crawler’s DBQuery crawler’s DB
Cache recent searchesCache recent searches
Customize UI/presentationCustomize UI/presentation
Simple transformation proxySimple transformation proxy On-the-fly lossy compression of inline On-the-fly lossy compression of inline
images (GIF, JPG, etc.)images (GIF, JPG, etc.)
Cache original & transformedCache original & transformed
User specifies aggressiveness, User specifies aggressiveness, “refinement” UI, etc.“refinement” UI, etc.
C TT
$$AA
TT
$$
C
DBDB
htmlhtml
© 2001
Stanford
Cluster-Based TACC ServerCluster-Based TACC Server Component replication for scaling and availabilityComponent replication for scaling and availability High-bandwidth, low-latency interconnectHigh-bandwidth, low-latency interconnect Incremental scaling: commodity PC’sIncremental scaling: commodity PC’s
C$
LB/FT
Interconnect
FE
$ $
WWWT
FE
FE
WWWA
GUI
Front EndsFront EndsFront EndsFront Ends CachesCachesCachesCaches User ProfileUser ProfileDatabaseDatabase
User ProfileUser ProfileDatabaseDatabase
WorkersWorkersWorkersWorkersLoad Balancing &Load Balancing &Fault ToleranceFault Tolerance
Load Balancing &Load Balancing &Fault ToleranceFault Tolerance
AdministrationAdministrationInterfaceInterface
AdministrationAdministrationInterfaceInterface
© 2001
Stanford
““Starfish” Availability: LB DeathStarfish” Availability: LB Death
FE detects via broken pipe/timeout, restarts LBFE detects via broken pipe/timeout, restarts LB
C$
Interconnect
FE
$ $
WWWT
FE
FE
LB/FT
WWWA
© 2001
Stanford
““Starfish” Availability: LB DeathStarfish” Availability: LB Death
FE detects via broken pipe/timeout, restarts LBFE detects via broken pipe/timeout, restarts LB
C$
Interconnect
FE
$ $
WWWT
FE
FE
LB/FT
WWWA
LB/FT
New LB announces itself (multicast), contacted by workers, New LB announces itself (multicast), contacted by workers, gradually rebuilds load tablesgradually rebuilds load tables
If partition heals, extra LB’s commit suicideIf partition heals, extra LB’s commit suicideFE’s operate using cached LB info during failureFE’s operate using cached LB info during failure
© 2001
Stanford
““Starfish” Availability: LB DeathStarfish” Availability: LB Death
FE detects via broken pipe/timeout, restarts LBFE detects via broken pipe/timeout, restarts LB
C$
Interconnect
FE
$ $
WWWT
FE
FE
LB/FT
WWWA
New LB announces itself (multicast), contacted by workers, New LB announces itself (multicast), contacted by workers, gradually rebuilds load tablesgradually rebuilds load tables
If partition heals, extra LB’s commit suicideIf partition heals, extra LB’s commit suicideFE’s operate using cached LB info during failureFE’s operate using cached LB info during failure
© 2001
Stanford
SNS Availability MechanismsSNS Availability Mechanisms
Soft state everywhereSoft state everywhere Multicast based announce/listen to refresh the stateMulticast based announce/listen to refresh the state
Idea stolen from multicast routing in the Internet!Idea stolen from multicast routing in the Internet!
Process peers watch each otherProcess peers watch each other Because of no hard state, “recovery” == “restart”Because of no hard state, “recovery” == “restart”
Because of multicast level of indirection, don’t need a Because of multicast level of indirection, don’t need a location directory for resourceslocation directory for resources
Load balancing, hot updates, migration are “easy”Load balancing, hot updates, migration are “easy” Shoot down a worker, and it will recoverShoot down a worker, and it will recover
Upgrade == install new software, shoot down oldUpgrade == install new software, shoot down old
Mostly graceful degradationMostly graceful degradation
© 2001
Stanford
SNS Availability Mechanisms, SNS Availability Mechanisms, cont’d.cont’d.
Orthogonal mechanismsOrthogonal mechanisms Composition without interfacesComposition without interfaces
Example: Scalable Reliable Multicast (SRM) group state Example: Scalable Reliable Multicast (SRM) group state management with SNSmanagement with SNS
Eliminates O(nEliminates O(n22) complexity of composing modules) complexity of composing modules
State space of failure mechanisms is easy to reason State space of failure mechanisms is easy to reason aboutabout
What’s the cost?What’s the cost?
More on orthogonal mechanisms laterMore on orthogonal mechanisms later
© 2001
Stanford
Administering SNSAdministering SNS
Multicast meansMulticast meansmonitor can runmonitor can runanywhere on clusteranywhere on cluster
Extensible via self-describing data structures and mobile code in Tcl
© 2001
Stanford
Clusters SummaryClusters Summary
Many approaches to clustering, software Many approaches to clustering, software transparency, failure semanticstransparency, failure semantics An end-to-end problem that is often application-specificAn end-to-end problem that is often application-specific
We’ll see this again at the application level in harvest vs. We’ll see this again at the application level in harvest vs. yield discussionyield discussion
Internet workloads are a particularly good match Internet workloads are a particularly good match for clustersfor clusters What software support is needed to mate these two What software support is needed to mate these two
things?things?
What new abstractions do we want for writing failure-What new abstractions do we want for writing failure-tolerant applications in light of these techniques?tolerant applications in light of these techniques?