TreadMarks Presented By: Jason Robey. Cool pic from last semester

TreadMarksPresented By: Jason Robey

Cool pic from last semester

TreadMarks AuthorsRice UniversityChristiana AmzaAlan CoxKeleher committeeEyal DelaranewSnadhya DwarkadasCharlie HunewPete KeleherPh.D. Thesis authorHonghui LuMessage Passing counter-examplesKarthick RajamaniWeimin YuWilly ZwaenepoelKeleher committee

OverviewConsistency ModelAPIProtocols and ImplementationApplications and PerformanceResults AnalysisConclusion

Whats the Problem?We want to use multiple COTS processors to do our work more quicklyShared memory is closer to our normal model of programming than message passingDSM systems usually spend too many resources ensuring bad programs will work reasonably wellWe should provide the programmer with the ability to specify coherence requirements

Lazy Release Consistency (RC)Eager RC acknowledges that there are programming synchronization points for valid parallel programsProcessors acquire a data region, work on the data region, and make it available to other processorsUpon completion of work, valid copies are sent to all concerned processorsLazy waits for data to be accessed

Ordering and Correct ProgramsPartial orderinghb1Maintain sequential consistency per processor Release and acquire happen in order so that all releases are visible to a subsequent acquireOrdering is transitiveLazy?Updates are not made until access

Ordering and Correct ProgramsCorrect programNo data race conditionsProgrammer handles synchronizationSynchronization events can be used to denote releases and acquiresWhat is required is to provide the programmer a model which they can make deterministic with synchronization primitives, not to guess how an update will need to transpire

APISetupFixed number of processors during runtimeStartup and exitFeels similar to MPISynchronizationBarriers and Locks (acquire, release)Integer based with fixed number of supported locks and barriersMemoryTmk_malloc/Tmk_freeTmk_distribute (new since paper)

Manual Examplestruct shared {int sum;int turn;int* array;} *shared;

main(int argc, char **argv) {/**/if (Tmk_proc_id==0) {shared = (struct shared *) Tmk_malloc(sizeof(shared));if (shared==NULL)Tmk_exit(-1);/* share common pointer with all procs */Tmk_distribute(&shared, sizeof(shared));

shared->array = (int *) Tmk_malloc(arrayDim*sizeof(int));if (shared->array==NULL)Tmk_exit(-1);shared->turn = 0;shared->sum = 0;}/* */if (Tmk_proc_id == 0) {Tmk_free(shared->array);Tmk_free(shared);/**/ }}

Paper ExampleBarriers on p. 6, Locks on p. 8Excessively simplified, but shows the use of barriers and locksBarrier = wait until all processors hold on the same barrier before continuingLock = make sure no other processor accesses a region protected by this lock until I release

Protocols and ImplementationDo not assume specialized hardwareDo not assume light-weight processesUse only one process per processorRegister signal handlers for asynchronous messaging and shared memory access

Protocols and ImplementationInitCreate Requested number of processes on remote machinesSet up full duplex sockets between each processRegister SIGIO handler for messagingAllocate 1 large block for shared memory at the same (VM) address on each machine and mark as non-accessible using mprotectChoose a processor in round-robin fashion to be the manager for each page of the block and for each lock and barrierRegister SEGV handler for shared memory access

Protocols and ImplementationMemory (p. 20[2])4 states (UNMAPPED, READ-ONLY, READ-WRITE, INVALID)if (p READ_ONLY) thenAllocate twinUpdate p to READ-WRITEelseif (cold miss) then get copy from managerif (write notices) thenretrieve difsif (write miss) thenallocate twinchange p to READ-WRITEelse change p to READ-ONLYend

Protocols and ImplementationLocksLock = Acquire, Unlock = ReleaseLock has local and held flagsIf local lock request, set flag if not heldOtherwise request it from the managerManager keeps flag status and current owner pointer if held

Protocols and ImplementationBarriersArrive = acquire for manager, release for workersExit = release for manager, acquire for workersCentralized barrier scheme, so manager listens for processors getting to barrier and send release when all present

Protocols and ImplementationMultiple WritersAvoid ping-pong (tech) effect of other VM page level DSM systemsMaintain a diff of current shared version and processor versionWhen needed, send diffs to other processors to update shared memory regionMultiple writes to same pageavoids false sharingIf same memory written, then race-condition

Protocols and ImplementationLazy diffsDiffing can be an expensive operationWorst case is modification on every-other byteInstead of sending diffs on releases (eager) or acquires (lazy), send only invalidate messagesUpon access, SEGV handler will request diffsdo diff at this timeMultiple diffs may then be taken care of with a single delayed diffOnce diff has been sent, memory eligible for gcTypically, diffs are needed from only one processor in lock situations

Protocols and ImplementationComms (over best-effort protocols)Send Kernel trapinterrupt current processSend messageWait for appropriate response or requestIf Timeout, retransmitRestart processReceiveInterrupt process, pull up SIGIO handlerPerform requested operationSend responseRestart process

Applications and PerformanceOnly two major applications ever done with thisMixed Integer Programming (MIP)ILINKgenetic tracing through family treesTested from 1 to 8 processorsSpeedups from 4 to 7 by 8 processorsAround 10 universities have purchased

Results AnalysisStarting from an efficient serial solution the amount of modification to arrive at an efficient parallel code proved to be relatively minorUsually only the case for systems bordering on trivially parallelTwo major applications appear to be in this classEven on these, speed-ups are significantly decreased by the time we reach only 8 processorsSeems to be a stretch to claim scalability to larger problems and clusters

Results AnalysisWith this system, some things that you typically do in the message passing paradigm happen automaticallyThis is at a cost (diffs and other overhead), and the message passing can typically be made more efficientSounds similar to an argument about high-level programming vs. assembly programmingShared memory does seem to make some things nice

ConclusionThis work optimized on a lot of the shared memory problemResults are worse than one would like for as small as 8 processorsDo not expect good speedup for 16, 32, processorsMessage passing may be better suited for NOWs

ReferencesTreadMarks: Shared Memory Computing on Networks of Workstations, C. Amza et. al., Rice University, 1994Distributed Shared Memory Using Lazy Release Consistency, P. Keleher, PhD thesis, Rice University, December 1994TreadMarks API documentation of versions 0.9.8 and 0.10.1The TreadMarks Distributed Shared Memory (DSM) System, http://www.cs.rice.edu/~willy/TreadMarks/overview.html, website

Questions, sil vous plait?Non?Questions de connaissances gnrales?--En Anglais, sil vous plait

Documents

TreadMarks Presented By: Jason Robey. Cool pic from last semester