19
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

Embed Size (px)

Citation preview

Page 1: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona)

Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

Page 2: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

•TM framework should be able to execute transactions as efficiently as possible even if defined in a coarse-grain fashion. Usually programmers will define transactions in this manner on a large piece of code.

• An analysis shows that many variables accessed inside a transaction are not truly shared across multiple threads. Rather they are completely local to an individual thread.

Page 3: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

ALGORITHM AND IMPLEMENTATIONS

Page 4: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

ALGORITHM AND IMPLEMENTATIONS

• We assume programmer has a-priori knowledge about some data structures which are thread-local and we require that programmer use a dual version of malloc, named local_malloc() for such structures.

• To filter out stack accesses of the transaction ( and any function call made from within a transaction) we use stack pointer and frame pointer register.

Page 5: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

• Implemented HTM is modeled after TCC (ISCA, 2004) in M5 (a full-system simulator from Umichgan, Ann-arbor).

• It belongs to Lazy-Lazy class of TM, where conflict detection and global memory updation occur at commit time.

• Aborts are cheap, Commits are expensive.

IMPLEMENTATION OF HTM

Page 6: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

• Cache line is modified to track readset and writeset of a transaction.

• Each individual thread is identified by their unique Process Control Block Base register value. (This is alpha specific).

• Cache coherence protocol is modified to allow multiple updated copies of the same cache line.

• At each store the address and value are inserted into a queue called commit queue.However if SL bit of that word is set, it does not get included.(explained later).

• During commit, each store in the queue is replayed.

IMPLEMENTATION OF HTM (cont..)

Page 7: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

• Modifications in coherence protocol are following:

A) Whenever a processor makes a write, it does not invalidate other copies. Hence a processor write does not generate bus write.

B) Whenever a processor wants to read a value for the first time, it is forced to go to bus. But here other should not reply with their own modified value. Hence response to Bus Read request is deactivated. This means request ultmately gets staisfied by the non-spec level of memory.

IMPLEMENTATION OF HTM (cont..)

Page 8: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

• When a thread wants to commit, it locks the bus (first come, first serve mode) for the entire commit duration.

• Other thread can execute their transaction or else they have to spin waiting for commit permission.

• However as each store passes the common bus during commit, other thread snnops the address and invalidate themselves if there is a conflict.

• A transaction is retried immediately if it is doomed.

IMPLEMENTATION OF HTM (cont..)

Page 9: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

IMPLEMENTATION OF HTM (cont..)

• Cache line is augmented with new bits. R- denotes if the cache line is read in a transaction. W- denotes update to the cache line SL (Speculative Local)- One bit per word denotes if the word read or written is local to thread.

Page 10: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

IMPLEMENTATION OF HTM (cont..)

• TLB structure is not modified. However an earlier unused bit in the protection filed is used to hold the locality information of a page. (Bit number 21 in case of Alpha)

Page 11: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

IMPLEMENTATION OF HTM (cont..)

• To filter out stack access two new registers are added. They hold the stack bounds for current executing transaction.

Page 12: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

Source of speed-up

• During commit a substantial amount of bus-bandwidth is saved which would otherwise be wasted on commiting local variables.

• For local variables, commit is done by clearing SL bits in the corresponding cache line.

Page 13: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

RESULTS

Filtered vs Unfiltered Read/Write set size (in bytes)

Page 14: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

SPEED-UP Numbers

Page 15: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

SPEED-UP Numbers

Page 16: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

SPEED-UP Numbers

Page 17: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

SPEED-UP Numbers

Average speed-up of 1.14x across STAMP benchmarks is observed for scalable TCC type of HTM.

Page 18: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

SEED-UP Numbers(Cont..)

Average speed-up of 1.24x across STAMP benchmarks is observed for conventional TCC type of HTM.

Page 19: Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

Commit Expedition

Reduction in commit cycle time