21
Using Partial Tag Using Partial Tag Comparison in Low-Power Comparison in Low-Power Snoop-based Chip Snoop-based Chip Multiprocessors Multiprocessors Ali Shafiee Narges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria 1

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

Embed Size (px)

Citation preview

Page 1: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

Using Partial Tag Comparison in Using Partial Tag Comparison in Low-Power Snoop-based Chip Low-Power Snoop-based Chip

MultiprocessorsMultiprocessors

Ali Shafiee Narges Shahidi Amirali Baniasadi

Sharif University of TechnologyUniversity of Victoria

1

Page 2: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

Goal: Improving energy efficiency in snoop-based CMPs.

Motivation: Broadcasting/processing entire tag is inefficient.

Our Solution: Using Partial Tag Comparison (PTC) prior to snoop.

Key Results Performance (2.9%)

Tag array power (52%) Bandwidth utilization (78.5%)

2

This Work: Improving Snoop Coherency This Work: Improving Snoop Coherency

Page 3: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

Our Solution (PTC) vs. Conventional Our Solution (PTC) vs. Conventional

3

D$D$

Interconnect Interconnect

Upper Level CacheUpper Level Cache

….D$D$ D$D$ D$D$

Upper Level Cache

….D$D$ D$D$

InterconnectInterconnect

Conventional Our solution

Fast +Power & Bandwidth −

Fast ++ (early miss detection)

Power & Bandwidth Efficient +

Page 4: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

Conventional Snooping

4

Address BusAddress Bus Snoop Bus Snoop Bus

Command BusCommand Bus

D$CPUCPU

D$

D$D$

CPU CPU

21

3

33

controller54 4

4

Redundant (miss): ~

70%

Page 5: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

Snoop Filters

5

Goal: Eliminate redundant snoop requests.Example: RegionScout (ISCA’05), CGCT(ISCA’05), SSP

(ASPLOS’08)

PTC:(1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided.

How often is that possible?

Page 6: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

6

How often using n bits is enough to detect a miss?

95+% of misses can be detected using 8 bits.

Page 7: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

7

D$

Address BusAddress Bus

LSB

LSB

LSB

misshit

Avoid Snoop Access Upper Level

Snoop Potential Targets

PTC-Filter

PTC-Filter

Page 8: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

PTC-Filter

8

4-way D$

4-way D$

4-way D$

4-way D$

4-way D$

4-way D$

4-way D$

4-way D$

PTC-FilterPTC-Filter FilterFilter FilterFilter FilterFilter

0 1 2 3

Core1’s LSB Core2’s LSB Core3’s LSB

VDLSB

8 bits

Page 9: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

PTC: Filter Miss

9

Address BusAddress Bus Snoop Bus Snoop Bus

Command BusCommand Bus

D$CPUCPU

D$

D$D$

CPU CPU

32

controller

1

Page 10: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

PTC: Filter Hit

10

Address BusAddress Bus Snoop Bus Snoop Bus

Command BusCommand Bus

D$CPUCPU

D$

D$D$

CPU CPU

2

4

controller6

5

✗ ✗

✓1 ✗✗

3

Page 11: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

Filter Maintenance

11

PTC- FilterPTC- Filter

CPUCPU

1

B F D E

Request =A

33

Address Bus

Core 0

….. …..

Core i

Addr.

C W D

Snoop Controller

4

Command Bus5

6

6

miss A. place it in position of tag F

22

Pending Request Table

{Address=A, C=0,W=1, D=1}

A 0 1 1

Place A, insert in Way 1 of core 0

Page 12: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

12

Methodology

• SESC simulator 4-way CMP• SPLASH-2 benchmarks• CACTI 6.0

4 MB 4-banked 16-way 10 cycle latency L2

6 cycle arbitration + 2 cycle core to controller latency + Crossbar data network+ MESI protocol

DL1/IL1 4-way/2-way 64KB/32KB 3 cycle latency

64 B cache line+ 500 cycle Memory access

Page 13: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

13

Performance

Average: 2.9%

Page 14: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

14

Bandwidth

Average: 78.5%

Page 15: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

15

Tag Power

Average: 52%

Page 16: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

Why do benchmarks show different performance improvement? Different cache miss frequency Different early miss detection frequency Not all cache misses are on the critical path

Filter overhead: Timing: 1 cycle Power: 78.5% of single tag array access

16

Discussion

Page 17: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

PTC: Using subset of tag bits to improve

bandwidth/power efficiency.

Results: Performance: 2.9% Tag Power: 52% Bandwidth: 78.5%

17

Summary

Page 18: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

18

Page 19: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

19

Global vs. Local Miss

D$D$

Interconnect Interconnect

Upper Level CacheUpper Level Cache

….D$D$ D$D$

Have B? NO NO

D$D$

interconnect interconnect

Upper Level CacheUpper Level Cache

….D$D$ D$D$

Have B? NO YES

D$D$

NO

Global Miss Local Miss

local miss detection better power/bandwidth profile Remote miss detection (source-based approach) vs.

(destination-based filter)

Page 20: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

20

Partial tag lookup: global miss

Page 21: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology

21

Partial tag lookup: local miss