18
Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell Sandia National Laboratories

Scalable Cluster Management: Frameworks, Tools, and Systems

  • Upload
    eldora

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Scalable Cluster Management: Frameworks, Tools, and Systems. David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell. Sandia National Laboratories. Lilith: a tool framework for very large clusters. - PowerPoint PPT Presentation

Citation preview

Page 1: Scalable Cluster Management: Frameworks, Tools, and Systems

Scalable Cluster Management:Frameworks, Tools, and Systems

David A. EvenskyAnn C. GentilePete Wyckoff

Robert C. ArmstrongRobert L. ClayRon Brightwell

Sandia National Laboratories

Page 2: Scalable Cluster Management: Frameworks, Tools, and Systems

Lilith: a tool framework for very large clusters

• Most current tools for clusters are designed as monolithic programs, to do one task well.

• If you need a new task, you need a new tool.

• The Lilith framework allows users to easily construct new tools using a component framework.

Page 3: Scalable Cluster Management: Frameworks, Tools, and Systems

Control of large distributed systems

• System administration• Auditing & job control by users• Interrogation of processes• Simple Applications

1 sec program on 1000 nodes

16min10sec

Page 4: Scalable Cluster Management: Frameworks, Tools, and Systems

Lilith: Scalable component framework

C lientData

D istribution

ExecutionC lient

ResultCollectionC lient

• Lilith spans a tree of machines executing user-defined code.

• User code (Lilim/Lilly) provides component functionality on a single node

• Provides scalable distribution, result collection

Page 5: Scalable Cluster Management: Frameworks, Tools, and Systems

Component Methods

• MO[] distributeOnTree(MO, int[])– data distribution down the tree

• MO onTree(MO)– component action on the node

• MO collateOnTree(MO[])– result collection and condensation

Page 6: Scalable Cluster Management: Frameworks, Tools, and Systems

Security

Uses purely Java 2 mechanisms atthis time….

User sendscredential with call

LilithHost createsProtectionDomain fromuser credential

LilithHost calls checkPermission

LilithHost

PolicyKeys

Method invocation

Sandbox setup similarly usingthe User credential and PolicyFile

Page 7: Scalable Cluster Management: Frameworks, Tools, and Systems

Prototypical tools

System monitoring toolto track the state of acluster of machines

PS-tool to get sortable processinformation from selected nodesof the cluster.

Page 8: Scalable Cluster Management: Frameworks, Tools, and Systems

Lilith Lights tool

• Snake toy app– demo that draws a

snake over front panel

– no global repository for state --- all info distributed

– Snake’s movement was limited to left half of machine

• program error in declaration of drand48() biased results

Page 9: Scalable Cluster Management: Frameworks, Tools, and Systems

Who serves who?

• Programmers adapt to:– The OS that runs on the machine,– The system configuration chosen by the admins– Changing system environments

• economically driven to heterogeneous distributed computing

• Why can’t the user dictate the software environment as a resource request?

Page 10: Scalable Cluster Management: Frameworks, Tools, and Systems

DASE

• Dynamically Adaptive Software Environment• Provide multi-OS/multi-environment

capability• Manage multiple SW environments• “save” user environment for reuse later• Integration with SW component architectures

Page 11: Scalable Cluster Management: Frameworks, Tools, and Systems

DASE Service Object Model

Physical systemLogical partitioning

“system”model

PartitionerApp Object- resource spec- data/map objects

Solver

Visualizer

MesherScheduler

ResourceRequest

Page 12: Scalable Cluster Management: Frameworks, Tools, and Systems

Flexible Resource Management

RM/VM

S chedu ler/R esource M anagem ent

V irtua l M ach ine

A pp lica tion E nvironm ent

D A S E S ession M anager

H ierarch ica l N et B ooting

RM/VM RM/VM

DASEClient

TFlopsPRE

HPVMCustom

Lin

ux

NT

ComponentsFramew orks

con

tro

l

info

rma

tion

App Environment Specification

Page 13: Scalable Cluster Management: Frameworks, Tools, and Systems

Scalable Unit

power serial Ethernet Myrinet

To

syst

em s

uppo

rt n

etw

ork

100BaseT hub

16 p

ort M

yrin

et s

wit

ch

compute

compute

compute

compute

compute

compute

compute

service

8 Myrinet LAN cables

sss0

Ter

min

al s

erve

r

Pow

er c

ontr

olle

r

100BaseT hub

16 p

ort M

yrin

et s

wit

ch

compute

compute

compute

compute

compute

compute

compute

service

Ter

min

al s

erve

r

Pow

er c

ontr

olle

r

Page 14: Scalable Cluster Management: Frameworks, Tools, and Systems

System Support Hierarchy

sss1

Admin access

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

Master copyof systemsoftware

Page 15: Scalable Cluster Management: Frameworks, Tools, and Systems

Hardware Management

• Discovery and Control– Perl scripts that

• control individual devices (power controller, terminal server, machine, switch)

• build a database of configuration info (MAC and IP addresses, serial numbers, etc.)

• Roles– database is augmented with each components role

in the system (compute, sss0, terminal server, etc.)

Page 16: Scalable Cluster Management: Frameworks, Tools, and Systems

“Virtual Machines”

• Allows arbitrary grouping of scalable units that use the same system software

• Operations to update system software and boot nodes, scalable units, or machines

• Updates system software on an SU in 1 min.• Update system software on 24 SUs in 1.5 min.• Boot an SU in 5 min. (staged for power drain)• Boot 24 SUs in 10 min.

Page 17: Scalable Cluster Management: Frameworks, Tools, and Systems

“Virtual Machines”

sss1Uses rdist topush system softwaredown

sss0nodenodenode

nodeScalable

Unit

In-use copyof systemsoftwareNFS mountroot fromSSS0

sss0nodenodenode

nodeScalable

Unit

In-use copyof systemsoftwareNFS mountroot fromSSS0

sss0nodenodenode

nodeScalable

Unit

In-use copyof systemsoftwareNFS mountroot fromSSS0

Linux 2.3Beta

AlphaProduction SU configuration

database

Page 18: Scalable Cluster Management: Frameworks, Tools, and Systems

http://dancer.ca.sandia.govhttp://www.cplant.ca.sandia.govhttp://www.cs.sandia.gov/cplant