42
Nov 24,2000 r.innocente 1 Automatic load balancing and transparent process migration Roberto Innocente Roberto Innocente [email protected] [email protected] November 24,2000 November 24,2000 Download postscript from : Download postscript from : mosix.ps mosix.ps or gzipped postscript from: or gzipped postscript from: mosix.ps.gz mosix.ps.gz or pdf from: or pdf from: mosix.pdf mosix.pdf

automatic load balancing and migration : Mosix

Embed Size (px)

Citation preview

Page 1: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 1

Automatic load balancing and transparent process migration

Roberto InnocenteRoberto Innocenterinnocentehotmailcomrinnocentehotmailcom

November 242000November 242000Download postscript from Download postscript from mosixpsmosixps

or gzipped postscript from or gzipped postscript from mosixpsgzmosixpsgzor pdf from or pdf from mosixpdfmosixpdf

Nov 242000 rinnocente 2

Introduction

Purpose of this presentation is an introductionPurpose of this presentation is an introductionto the most important features of MOSIXto the most important features of MOSIX

-- automatic load balancingautomatic load balancing-- transparent process migrationtransparent process migrationgiving an overview of related projects and of thegiving an overview of related projects and of thedifferent possible implementations of thesedifferent possible implementations of these

mechanismsmechanisms

Nov 242000 rinnocente 3

Scheme of the presentation

11 Overview of some Distributed Operating Overview of some Distributed Operating SystemsSystems

22 Load balancing mechanismsLoad balancing mechanisms33 Process migration mechanismsProcess migration mechanisms44 focus on Mosixfocus on Mosix

Nov 242000 rinnocente 4

Overview of somedistributed OS

Nov 242000 rinnocente 5

Distributed OS

Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering

job migration)job migration)

Nov 242000 rinnocente 6

Sprite (DouglisOusterhout 1987)

Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations

of TclTk fameof TclTk fame

Nov 242000 rinnocente 7

Charlotte (Artsy Finkel 1989)

Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems

Nov 242000 rinnocente 8

Accent (Zayas 1987)

developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs

based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by

Zayas(1987)Zayas(1987) no load balancing no load balancing

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 2: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 2

Introduction

Purpose of this presentation is an introductionPurpose of this presentation is an introductionto the most important features of MOSIXto the most important features of MOSIX

-- automatic load balancingautomatic load balancing-- transparent process migrationtransparent process migrationgiving an overview of related projects and of thegiving an overview of related projects and of thedifferent possible implementations of thesedifferent possible implementations of these

mechanismsmechanisms

Nov 242000 rinnocente 3

Scheme of the presentation

11 Overview of some Distributed Operating Overview of some Distributed Operating SystemsSystems

22 Load balancing mechanismsLoad balancing mechanisms33 Process migration mechanismsProcess migration mechanisms44 focus on Mosixfocus on Mosix

Nov 242000 rinnocente 4

Overview of somedistributed OS

Nov 242000 rinnocente 5

Distributed OS

Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering

job migration)job migration)

Nov 242000 rinnocente 6

Sprite (DouglisOusterhout 1987)

Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations

of TclTk fameof TclTk fame

Nov 242000 rinnocente 7

Charlotte (Artsy Finkel 1989)

Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems

Nov 242000 rinnocente 8

Accent (Zayas 1987)

developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs

based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by

Zayas(1987)Zayas(1987) no load balancing no load balancing

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 3: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 3

Scheme of the presentation

11 Overview of some Distributed Operating Overview of some Distributed Operating SystemsSystems

22 Load balancing mechanismsLoad balancing mechanisms33 Process migration mechanismsProcess migration mechanisms44 focus on Mosixfocus on Mosix

Nov 242000 rinnocente 4

Overview of somedistributed OS

Nov 242000 rinnocente 5

Distributed OS

Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering

job migration)job migration)

Nov 242000 rinnocente 6

Sprite (DouglisOusterhout 1987)

Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations

of TclTk fameof TclTk fame

Nov 242000 rinnocente 7

Charlotte (Artsy Finkel 1989)

Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems

Nov 242000 rinnocente 8

Accent (Zayas 1987)

developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs

based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by

Zayas(1987)Zayas(1987) no load balancing no load balancing

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 4: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 4

Overview of somedistributed OS

Nov 242000 rinnocente 5

Distributed OS

Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering

job migration)job migration)

Nov 242000 rinnocente 6

Sprite (DouglisOusterhout 1987)

Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations

of TclTk fameof TclTk fame

Nov 242000 rinnocente 7

Charlotte (Artsy Finkel 1989)

Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems

Nov 242000 rinnocente 8

Accent (Zayas 1987)

developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs

based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by

Zayas(1987)Zayas(1987) no load balancing no load balancing

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 5: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 5

Distributed OS

Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering

job migration)job migration)

Nov 242000 rinnocente 6

Sprite (DouglisOusterhout 1987)

Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations

of TclTk fameof TclTk fame

Nov 242000 rinnocente 7

Charlotte (Artsy Finkel 1989)

Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems

Nov 242000 rinnocente 8

Accent (Zayas 1987)

developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs

based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by

Zayas(1987)Zayas(1987) no load balancing no load balancing

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 6: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 6

Sprite (DouglisOusterhout 1987)

Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations

of TclTk fameof TclTk fame

Nov 242000 rinnocente 7

Charlotte (Artsy Finkel 1989)

Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems

Nov 242000 rinnocente 8

Accent (Zayas 1987)

developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs

based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by

Zayas(1987)Zayas(1987) no load balancing no load balancing

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 7: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 7

Charlotte (Artsy Finkel 1989)

Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems

Nov 242000 rinnocente 8

Accent (Zayas 1987)

developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs

based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by

Zayas(1987)Zayas(1987) no load balancing no load balancing

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 8: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 8

Accent (Zayas 1987)

developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs

based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by

Zayas(1987)Zayas(1987) no load balancing no load balancing

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 9: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 9

Amoeba (Tanenbaum et al 1990)

Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)

ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)

doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of

processors is more than the of processesprocessors is more than the of processes

of python fame of python fame

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 10: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 10

V system (Theimer 1987)

developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks

V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 11: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 11

MOSIX (Barak 19901995)

its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95

the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface

enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 12: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 12

Condor (LitzkowLivny 1988)

Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 13: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 13

Load balancingmechanisms

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 14: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 14

Load balancing methods taxonomy

job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this

NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 15: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 15

Load balancing algorithms

Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new

job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)

PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 16: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 16

Non pre-emptive load balancing (aka job placement)

centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects

distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs

random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 17: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 17

Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random

another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain

load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer

neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken

broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes

if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 18: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 18

Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)

std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time

but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 19: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 19

Process migrationmechanisms

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 20: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 20

Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)

no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration

User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 21: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 21

Process migration factors

transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration

residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration

performanceperformancehow much the migration affects how much the migration affects performanceperformance

complexitycomplexity how much complex the migration is how much complex the migration is

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 22: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 22

Transparent

Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 23: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 23

System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is

just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely

transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes

A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 24: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 24

Process migration parts

execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 25: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 25

Execution state migration

Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 26: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 26

Communication statemigration

It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel

during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned

sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the

migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 27: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 27

Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 28: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 28

Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space

is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))

prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time

lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred

shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 29: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 29

focus on Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 30: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 30

MOSIX important points

Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 31: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 31

Probabilistic info dissemination

each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes

each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 32: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 32

MOSIX load balancing algorithm

Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that

each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 33: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 33

Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 34: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 34

Memory ushering algorithm

Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes

to create sufficient space on a single nodeto create sufficient space on a single node

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 35: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 35

MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses

UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes

These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by

standard tools eg netstat)standard tools eg netstat)

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 36: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 36

MOSIX process migration

Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)

fully transparent fully transparent the process involved doesrsquont have any need to know about the migration

strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 37: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 37

Mosix migration transparency

Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )

The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)

The migrated process is called the The migrated process is called the remoteremote

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 38: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 38

MOSIX deputyremotemechanism

Deputy

Link layer

Remote

Link layerMNP(MosixNetworkProtocol)

UHN(UniqueHomeNode)

Runningnode

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 39: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 39

Mosix performance1 While a CPU bound process obtains clear advantages from Mosix

being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration

A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 40: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 40

Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1

cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 41: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 41

MOSIX dfsa (direct file system access)

Deputy

Link layer

Remote

Link layerMNP (MosixNetworkProtocol)

DFSA access

UHN(UniqueHome Node)

Runningnode

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems
Page 42: automatic load balancing and migration : Mosix

Nov 242000 rinnocente 42

Cache coherent global distributed file systems

To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std

Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS

  • Automatic load balancing and transparent process migration
  • Introduction
  • Scheme of the presentation
  • Overview of somedistributed OS
  • Distributed OS
  • Sprite (DouglisOusterhout 1987)
  • Charlotte (Artsy Finkel 1989)
  • Accent (Zayas 1987)
  • Amoeba (Tanenbaum et al 1990)
  • V system (Theimer 1987)
  • MOSIX (Barak 19901995)
  • Condor (LitzkowLivny 1988)
  • Load balancingmechanisms
  • Load balancing methods taxonomy
  • Load balancing algorithms
  • Non pre-emptive load balancing (aka job placement)
  • Pre-emptive load balancing
  • Processor load measures
  • Process migrationmechanisms
  • Userkernel level migration
  • Process migration factors
  • Transparent
  • System call interposing
  • Process migration parts
  • Execution state migration
  • Communication statemigration
  • Migratable sockets
  • Memory space migration
  • focus on Mosix
  • MOSIX important points
  • Probabilistic info dissemination
  • MOSIX load balancing algorithm
  • Mosix info variables
  • Memory ushering algorithm
  • MOSIX network protocols
  • MOSIX process migration
  • Mosix migration transparency
  • MOSIX deputyremote mechanism
  • Mosix performance1
  • Mosix performance2
  • MOSIX dfsa (direct file system access)
  • Cache coherent global distributed file systems