Click here to load reader

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali

Embed Size (px)

Citation preview

Methods

ImanFarajiTime-based Snoop Filtering in Chip MultiprocessorsAmirkabir University of TechnologyTehran, IranUniversity of VictoriaVictoria, CanadaAmirali Baniasadi1This work: Reducing redundant snoops in chip multiprocessors2Our Goal Improving energy efficiency of WT-based CMPOur MotivationThere are long time intervals where snooping fails, wasting energy and bandwidth.Our SolutionDetect such intervals and avoid snoopsKey Results

Memory Energy 18% Snoop Traffic 93% Performance 3.8% InterconnectConventional SnoopingD$CPUCPUD$D$D$CPUCPU21444 controller6555Redundant (miss): ~70%33What should we say before the stamp?3WB vs. WT4Write-through configurationWrite-back configurationHigh memory trafficLow memory trafficSimple coherency mechanismSophisticated coherency mechanism

Relative memory energy consumption Optimize4Previous Work: Snoop Filters5Good snoop filterFast & simpleAccurate and effectiveEliminate redundant snoop (local & global) requests.Local: one core fails to provide dataGlobal: all cores fail.

Examples:

RegionScout: Detects Memory Regions Not Shared (Moshovos)

Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi)

Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti) Our WorkTime-based Snoop Filtering

Motivation: There are long intervals where snooping fails consecutively

But how long & how often?

6Our Work (Cont.)7

Less than 3

7Our Work (Cont.)Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail

Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails

8Distribution9(b) GRM distribution

(a) LRM distribution for different processorsPeriods of Data Scarcity are usually longTime-based Global Miss predictor (TGM)10TGM Types:

TGM-First: First processor that has failed snooping survives.

TGM-Last: Last processor that has failed snooping survives.

TGM Goals:Detect GRM intervals Shutting down snooping in all processors but one (surviving node).TGM implementation11TGM-enhanced CMP

TGM12(a) Coverage (b) Accuracy

Darker Blue12Time-based Local Miss predictor (TLM)13Goal: Detect LRMs

How?

Count consecutive snoop misses in a node

Disable snoop when exceeds a threshold

Restart snooping after a number of cycles

TLM implementation14TGM-enhanced CMPProcessing Unit (PU)First Level CachePredictorRedundant SNoop (RSN) CounterReStarT (RST) CounterEach Processor

Snoop miss Counter14TLM features15(a)Coverage (b) Accuracy

3bit restart counter4bit15Methodology16Our Simulator: SESCBenchmarks: Splash-2To evaluate energy: Cacti 6.5System used:Quad-Core CMP

BenchmarksInput ParametersBarnes16K ParticlesCholeskytk29.OFFT1024k complex data pointsOcean258x258 oceanVolrendHeadWater-Nsqrd512 moleculesWater-spatial512 moleculesProcessorInterconnection NetworkMemoryFrequency: 5 GHzTechnology: 68 nmBranch Predictor: 16K entrybimodal and gshareFetch/Issue/Commit4/4/5Branch Penalty : 17 cycles RAS: 32 entriesBTB: 2k Entries, 2 wayData Interconnect: crossbarInterconnect Width: 64 BIL1: 64KB/ 2 wayDL1: 64KB/4way/Write ThroughAccess Time: 1 cycleBlock Size: 64Cache line size: 32L2:512KB/8way/Write ThroughAccess Time: 11 cyclesBlock Size: 64Memory: 1GBAccess Time: 70 cyclesPage Size: 4 KbitSPLASH-2 Benchmarks and INPUT parametersSystem ParametersRelative Snoop Traffic Reduction17

TGM-F: 58%TGM-L: 57%TLM: 77%17less benchmarksbetter colorsRelative Memory Energy18

TGM-F: 8%TGM-L: 8.5%TLM: 11%Relative Memory Delay19

TGM-F: 1.1%TGM-L: 2.1%TLM: 1.7%Relative Performance20

TGM-F: No ChangeTGM-L: 0.4% TLM: 0.3%Summary21We showed:Long data scarcity period (DSP) exist during workload runtimeDuring DSPs redundant snoops happen frequently and consecutivelyOur solutionsTGM: uses snoop behavior on all processors to detect and filter redundant snoopsShutdown snoop on as much processor as possibleTLM: Redundant snoops are filtered in a single nodeCounts recent redundant snoops to detect data scarcity periods and filter upcoming redundant snoopsSimulation Results:Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77%Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11%Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7%Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3%

Thanks for your attention22

22Backup Slides

23Discussion24How Characteristics of the benchmarks affect memory energy/delay reduced by our solution?2. Share of Redundant Snoops

1. True detection of redundant snoops

24Memory Energy.Delay25

Memory Energy = Energy consumed to provide the requested dataMemory Delay = time required to provide the requested dataVolrend Benchmark26Volrend while running rarely send snoop requestsThis application renders a three-dimensional volume. It renders several frames from changing viewpoints

consecutive frames in rotation sequences often vary slightly in viewpoint

High Temporal Locality

Volrend does Load Distribution very well

High Spatial Locality