25
Atlas: An Atlas: An Infrastructure for Infrastructure for Global Computing Global Computing

Atlas: An Infrastructure for Global Computing

  • Upload
    arav

  • View
    23

  • Download
    1

Embed Size (px)

DESCRIPTION

Atlas: An Infrastructure for Global Computing. People. Eric Baldeschwieler (UC Berkeley) Bobby Blumofe (UT Austin) Eric Brewer (UC Berkeley). Outline. Introduction Programming model Architecture Examples Discussion Limitations & Conclusion. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Atlas: An Infrastructure for Global Computing

Atlas: An Infrastructure for Atlas: An Infrastructure for Global ComputingGlobal Computing

Page 2: Atlas: An Infrastructure for Global Computing

PeoplePeople

Eric Baldeschwieler (UC Berkeley)Eric Baldeschwieler (UC Berkeley)

Bobby Blumofe (UT Austin)Bobby Blumofe (UT Austin)

Eric Brewer (UC Berkeley)Eric Brewer (UC Berkeley)

Page 3: Atlas: An Infrastructure for Global Computing

OutlineOutline

IntroductionIntroduction Programming modelProgramming model ArchitectureArchitecture ExamplesExamples DiscussionDiscussion Limitations & ConclusionLimitations & Conclusion

Page 4: Atlas: An Infrastructure for Global Computing

IntroductionIntroduction

Properties of a Internet computing Properties of a Internet computing infrastructureinfrastructure

ScalabilityScalability: to 10: to 1066 nodes nodes HeterogeneityHeterogeneity: of machines & OSs: of machines & OSs Fault toleranceFault tolerance: : completion probability completion probability

comparable to sequential programcomparable to sequential program Adaptive parallelismAdaptive parallelism: dynamic set of : dynamic set of

resourcesresources

Page 5: Atlas: An Infrastructure for Global Computing

Properties ...Properties ... SafetySafety: Hosts must be secure: Hosts must be secure

AnonymityAnonymity: Secure privacy of client: data & : Secure privacy of client: data &

programprogram

HierarchyHierarchy: Locality of communication (local : Locality of communication (local

bandwidth typically is higher)bandwidth typically is higher)

Ease of useEase of use: Minimize “costs” of participating.: Minimize “costs” of participating.

Reasonable performanceReasonable performance: Low overhead : Low overhead Benefit Benefit

from a small set of machines.from a small set of machines.

Page 6: Atlas: An Infrastructure for Global Computing

Introduction ...Introduction ...

Atlas combines mechanisms from:Atlas combines mechanisms from:– CilkCilk– JavaJava– with new mechanisms.with new mechanisms.

Java “ensures”:Java “ensures”:– heterogeneityheterogeneity– safetysafety

Page 7: Atlas: An Infrastructure for Global Computing

Introduction ...Introduction ...

Atlas:Atlas:

extendsextends Cilk’s work-stealing scheduler Cilk’s work-stealing scheduler

to a hierarchical Internet settingto a hierarchical Internet setting

usesuses Cilk-NOW’s mechanisms for: Cilk-NOW’s mechanisms for:

– adaptive parallelismadaptive parallelism

– fault tolerancefault tolerance

Page 8: Atlas: An Infrastructure for Global Computing

Programming ModelProgramming Model Applications are written in JavaApplications are written in Java

When a native library is used, heterogeneity When a native library is used, heterogeneity

is is limitedlimited to platforms that support it. to platforms that support it.

Programming model is: Programming model is: – a Java-based implementation of Cilk:a Java-based implementation of Cilk:

Non-blocking, explicit continuation passing threadsNon-blocking, explicit continuation passing threads

– a Unix-like URL-based file system & local caching a Unix-like URL-based file system & local caching

with coherence.with coherence.

Page 9: Atlas: An Infrastructure for Global Computing

ArchitectureArchitecture

ClientClient ManagerManager

ComputeServer

ComputeServer

ComputeServer

ComputeServer

ComputeServer

ComputeServer

Application (Java)

Runtime library

Java interpreter

Native libraries (C or C++)

Application (Java)

Runtime library

Java interpreter

Native libraries (C or C++)

Compute ServerBasic architecture

Page 10: Atlas: An Infrastructure for Global Computing

Architecture ...Architecture ...

Client is a Java Client is a Java applicationapplication

– connects to compute servers on machines connects to compute servers on machines

other than its manager’s.other than its manager’s.

Idle servers steal work from busy ones.Idle servers steal work from busy ones.

Page 11: Atlas: An Infrastructure for Global Computing

ArchitectureArchitecture

Compute server: Compute server:

– relinquishes control when there is non-relinquishes control when there is non-

Atlas work (a screensaver?)Atlas work (a screensaver?)

– Runs as a daemon:Runs as a daemon: workingworking

pings manager & siblings for work to stealpings manager & siblings for work to steal

Page 12: Atlas: An Infrastructure for Global Computing

Architecture: Porting AtlasArchitecture: Porting Atlas

A Java runtime systemA Java runtime system

Port:Port:

– natively written URL-based file natively written URL-based file

systemsystem

– some support routines. some support routines.

Page 13: Atlas: An Infrastructure for Global Computing

Hierarchical Work StealingHierarchical Work Stealing

ManagerManager

ComputeServer

ComputeServer

ComputeServer

ComputeServer

ComputeServer

ComputeServer

ManagerManager

ManagerManager

ManagerManager

ManagerManager

Page 14: Atlas: An Infrastructure for Global Computing

Hierarchical Work Stealing Hierarchical Work Stealing ......

Manager keeps track of when its subtree Manager keeps track of when its subtree is idleis idle

If manager’s subtree is idle,If manager’s subtree is idle,

manager steals work from its siblingsmanager steals work from its siblings If a subtree has “too much” work,If a subtree has “too much” work,

it “allows” work stealing from aboveit “allows” work stealing from aboveWhat is definition & implementation of “too What is definition & implementation of “too

much”?much”?

Page 15: Atlas: An Infrastructure for Global Computing

Hierarchical Work StealingHierarchical Work Stealing

The authors claim that proven The authors claim that proven properties of Cilk hold in this properties of Cilk hold in this hierarchical setting.hierarchical setting.

Goals:Goals:– Localize communicationLocalize communication

– Sub-trees map to domain hierarchySub-trees map to domain hierarchyAdministrators can control thread migration: Administrators can control thread migration:

– Outflow: Privacy Outflow: Privacy

– Inflow: Host securityInflow: Host security

Page 16: Atlas: An Infrastructure for Global Computing

ExamplesExamples

Fib: fine grained threadsFib: fine grained threads POV-Ray: coarse grained threadsPOV-Ray: coarse grained threads

Base 1 Node 3 Nodes 8 Nodes

Fib (24) 1.3 80 40 (2.0) 31 (2.6)

POV-Ray 20700 21000 - 2700 (7.8)

Numbers in ( ) are speedups over 1-node case.

Page 17: Atlas: An Infrastructure for Global Computing

Examples ...Examples ...

POV-Ray is POV-Ray is notnot written in Java written in Java

Partitioning Partitioning isis done in Java done in Java

8 nodes: only 2% overhead. 8 nodes: only 2% overhead.

What about larger P?What about larger P?

Page 18: Atlas: An Infrastructure for Global Computing

DiscussionDiscussion

ScalableScalable: Yes.: Yes.

HeterogeneityHeterogeneity: Incomplete until : Incomplete until

divorces itself from divorces itself from allall native libraries. native libraries.

SafetySafety: :

– Java: OK.Java: OK.

– Native libraries: ?Native libraries: ?

Page 19: Atlas: An Infrastructure for Global Computing

Discussion ...Discussion ...

Fault toleranceFault tolerance: A : A timed outtimed out thread is thread is

recomputed from a recomputed from a checkpointcheckpoint maintained maintained

by subtree by subtree (manager?)(manager?)

– What is affect on performance of What is affect on performance of

checkpointing?checkpointing?

Subtree rooted at a thread is its Subtree rooted at a thread is its

subcomputationsubcomputation..

Page 20: Atlas: An Infrastructure for Global Computing

Fault Tolerance ...Fault Tolerance ...

Subcomputations are Subcomputations are transactions:transactions:

Authors claim: side effects can be Authors claim: side effects can be

undoneundone

How does this relate to hierarchical How does this relate to hierarchical

work stealing?work stealing?

Page 21: Atlas: An Infrastructure for Global Computing

Discussion ...Discussion ...

AnonymityAnonymity: A host executing a stolen : A host executing a stolen

subtree cannot determine client.subtree cannot determine client.

– Managers are Managers are assumedassumed to be to be

trustworthytrustworthy

HierarchyHierarchy: Yes, via manager hierarchy.: Yes, via manager hierarchy.

Ease of useEase of use: Interface incomplete.: Interface incomplete.

– clients submit jobs via a special “shell”clients submit jobs via a special “shell”

Page 22: Atlas: An Infrastructure for Global Computing

Discussion ...Discussion ...

Adaptive parallelismAdaptive parallelism: :

– ““Owner” (?) of compute server sets a Owner” (?) of compute server sets a

policy that defines when server is idle.policy that defines when server is idle.

– How?How?

– When compute server becomes When compute server becomes

unavailable for Atlas work, all its sub-unavailable for Atlas work, all its sub-

computations are moved to another computations are moved to another

computer server.computer server.

Page 23: Atlas: An Infrastructure for Global Computing

Adaptive Parallelism ...Adaptive Parallelism ...

Moving a subcomputation requires updating Moving a subcomputation requires updating

information linking subcomputation to its:information linking subcomputation to its:

– parentparent

– childrenchildren

– How long does it take to retreat?How long does it take to retreat?

– Is sub-computation restarted? From checkpoint?Is sub-computation restarted? From checkpoint?

Page 24: Atlas: An Infrastructure for Global Computing

LimitationsLimitations

Atlas inherits tree-structured Atlas inherits tree-structured program limitation from Cilk. program limitation from Cilk. – But this is still a rich set!But this is still a rich set!

Generalizing to non-tree-structured Generalizing to non-tree-structured programs seems hard.programs seems hard.

No shared variables among threads.No shared variables among threads. Global file system is read-only.Global file system is read-only.

Page 25: Atlas: An Infrastructure for Global Computing

ConclusionConclusion

Jicos design goals = those for Atlas.Jicos design goals = those for Atlas.

Use JXTA to give Jicos a “file system”Use JXTA to give Jicos a “file system”

– Then, Jicos becomes Atlas’s heir.Then, Jicos becomes Atlas’s heir.