Transcript
Page 1: Revoke / Incarnation #s / Matching

Revoke / Incarnation #s / Matching• Discussion around how to reclaim context IDs (resources that are a part of message matching)

after an MPI_Comm_revoke• Basic problem: revoke is one-sided and can be called by multiple processes in the communicator

– There is a race between calling revoke and when all correct processes update their local state to revoked

– Need to ensure that all processes have revoked the communicator before context ID can be reused• Scenario:

– Communicator with correct processes A, B, and C is revoked– A and B free revoked communicator and create a new communicator using the old context ID– C calls revoke on the old communicator -- what happens at A and B?– OR -- C sends a message to A/B who has posted an ANY_SOURCE receive -- does it match?

• Several solutions were discussed:1. Incarnation number -- An additional number on each context ID that becomes a part of the matching2. Group guards -- Check incoming messages to ensure that the sender is in the group of the

communicator3. Fault tolerant MPI_Comm_free/create -- Enhance create/free algorithms to quiesce context IDs before

they are used

Page 2: Revoke / Incarnation #s / Matching

RMA Semantics

• Pavan raised a concern about the definition of RMA window memory in the context of shared memory windows

• It may be impossible to guarantee that only locations updated in the window are invalid

• Suggested weakening the semantic to the entire window being undefined

• Requires further discussion

Page 3: Revoke / Incarnation #s / Matching

Shared Memory

• What happens if a process with shared memory goes down and another process has posted messages using its shared memory?– Yes this is an implementation issue, but is it

possible to do anything?