27
Motivation Composite Cores: Heterogeneity MorphCore: Exploit ILP and TLP Summary Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

  • Upload
    buikiet

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Energy-Efficient, High-PerformanceHeterogeneous Core Design

Raj Parihar

Core Design Session, MICRO - 2012

Advanced Computer Architecture Lab, UofR, Rochester

April 18, 2013

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design

Page 2: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

References

Composite Cores: Pushing Heterogeneity into a CoreA. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R.

Dreslinski, T. F. Wenisch, and S. Mahlke

University of Michigan, Ann Arbor

MorphCore: An Energy-Efficient Microarchitecture for HighPerformance ILP and High Throughput TLPKhubaib, M. A. Suleman, M. Hashemi, C. Wilkerson, Y. N. Patt

UT Austin, HPS Lab, Intel Labs - Hillsboro

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 2

Page 3: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Motivation

Workload and applications exhibit different phases

Some phases are constrained by fundamental ILP limit

In an inherently low ILP phase a simple in-order, instead ofout-of-order, core can be usedIn-order core saves energy w/o degrading overall performance

Phases also have varying degree of exploitable ILP and TLP

An out-of-order engine is more efficient in the high ILP phasesA highly threaded in-order SMT is more beneficial in TLP phases

Overall idea is to identify the phase behavior and change the

architecture on-the-fly to suit the need

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 3

Page 4: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Outline

Motivation

Composite Cores: Heterogeneity

MorphCore: Exploit ILP and TLP

Summary

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 4

Page 5: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Composite Cores: Heterogeneity within a Single Core

Heterogeneous multicore systems, capable of achieving either

high-performance or energy-efficiency, are quite prominent

Often migrate applications/phases to specific core which favors itIssues with conventional heterogeneous system

Slow migrations, requires large phases (100s of millions insts)Often coarse-grain and the fine-grain opportunities are lostSwitching and migration has significant performance overhead

Proposed solutions: a single core microarchitecture whichintegrates – big and little compute µEngines together

An online controller can map 25% code to little µEngineAchieves 18% energy efficiency at performannce loss ≤ 5%

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 5

Page 6: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Conventional Heterogeneous CMP: ARM’s big.LITTLE

Incorporates two different kind of cores on same chip

big: Cortex-A15(3-way OoO), deeply pipelined (15-25 stages)LITTLE: Cortex-A7(2-way in-order), short pipeline (8-10 stages)

How do these fare against each other?

Performance: Cortex-A15 is 2-3x faster than Cortex-A7Energy: Cortex-A7 is 3-4x more energy-efficient than Cortex-A15

These two kind of cores are utilized, through migration, when anappropriate phase arrives

Migration happens through coherent L2 caches, costs about 20 µsRequires large phases to amortize the cost of slow migration

Composite cores: modify single core to suit both the needs

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 6

Page 7: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Fine-Grain Switching Interval

Conventional heterogeneous CMP requires large phases

To amortize the cost of switching, typically few millions insts

The migration overhead precludes fine-grained switching in

traditional heterogeneous core designs

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 7

Page 8: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Composite Cores: Architecture

Each core consists of two tightly coupled compute µEnginesAchieves high-performance and energy efficiency by switchingthe µEngines in response to changes in application performance

Shared: Front-end, branch predictor, data and inst cachesExtra component: A reactive online controller to perform switching

Switching requires only the register file transfer and some stalling

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 8

Page 9: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Reactive Online ControllerOnline controller tries to maximize the energy savings subject toa configurable maximum performance degradation, or slowdown

Estimates dynamic performance loss using a liner modelSwitching happens when loss is more than the acceptable threshold

Performance estimator is the most crucial, complex, trickiest

component and involves many approximations

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 9

Page 10: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Performance EstimatorGoal of this module is to provide an estimate of the performance

of both the µEngines in the previous quantum and overallPerformance estimation of the non-active core is challenging

Uses a linear performance estimating model: y = a0 +∑

aixi

Various stats are collected: L2 miss, ILP, L2 hit, MLP etc.Utilize ridge regression analysis to determine the coefficients

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 10

Page 11: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Overall Energy Savings

Implementable regression model saves about 18% energy

Reduction in energy-delay-product is 21%

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 11

Page 12: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Switching Impact on Performance

Subject to 5% slowdown, accptable margin in performance

mcf : is memory bound, decrease in branch misprediction latency

actually causes a small performance improvement

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 12

Page 13: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Little Core UtilizationOn an average about 25% of code can be mapped to little core

Given the oracle knowledge about 37% code can be mapped

Applications like mcf can be completely mapped to little core

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 13

Page 14: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Average Little Core Power

Little µEngine consumes little extra power compared to little core

because of over-provisioned shared resources

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 14

Page 15: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Performance Energy Sensitivity

Allowing only 1% slowdown saves upto 4% of the energy

20% performance drop can save upto 44% of the energyGood feature to have where maintaining usability is essential

Low-battery levels in laptops and cell phones

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 15

Page 16: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

MorphCore: Motivation

In general, industry builds two types of cores:

Large out-of-order cores: Intel’s Sandybridge, IBM’s Power 7Small cores: Intel’s Larrabee, Sun’s Niagara, ARM’s A15

OoO cores provide high single-thread performance by exploiting

ILP but are power inefficient for multi-threaded programs

Key insight: Highly-threaded in-order SMT core can achieve the

instruction issue throughput similar to an OoO (Hily, Seznec)MorphCore is built on two key insights: above observation and

In-order SMT core can be built using subset of the OoO hardware

MorphCore: Start with a traditional OoO core and make minimal

changes to transform it to highly-threaded in-order SMT

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 16

Page 17: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

In-order SMT vs Out-of-order Superscalar

Hily & Seznec: Highly-threaded in-order core can achieve similar

throughput to an OoO core on multi-threaded apps (HPCA’99)

In high TLP applications, high-performance and low energy

consumption can be achieved with in-order SMT execution

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 17

Page 18: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

MorphCore Microarchitecture

Two modes of execution: OutOfOrder and InOrderBased on a traditional OoO core and also supports

Additional in-order SMT threads, in-order scheduling, execution andcommit of simultaneously running threads

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 18

Page 19: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Details of MicroarchitectureFetch: using hardware muxes 2 front-ends can be configured

InOrder SMT mode - 8 threads, OutOfOrder mode - 2 threads

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 19

Page 20: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Real Details: Too Specific

Hw mux, reconfigurable logic to “transform” OoO to in-order SMT

Modified rename stage: details are too involved!

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 20

Page 21: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Wakeup and Selection Logic

After all these modifications they claim that only 2.5% of extra

critical delay is added in the design – 2.5% slower frequency

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 21

Page 22: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

MorphCore Mode Switching

No switching overhead on OS – Hardware does it itself

Not mentioned clearly (most of it is future work!)

General idea is that when OS schedules more threads you are in

parallel region so enable in-order SMT – threshold: >2 threads

When the number of active threads is ≤ 2, enable OoO engine

Assumes thread library uses MONITOR/MWAIT insts such that

MorphCore hardware can detect a thread becoming inactiveClaims that since no migration of instruction and data needs tohappen on mode switches, the penalty is minimum

Pipeline flushing and stallingRegisters and muxes reconfiguration

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 22

Page 23: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Performance ResultsST apps: MorphCore achieves very close to OoO 2-way SMT

MT apps: achieves close to 6-thread in-order SMT (SMALL)

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 23

Page 24: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Overall Speedup, Power and Energy

Performance and Energy combined

MorphCore does better than all other alternative

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 24

Page 25: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Comparison with CoreFusion

Opposite approach: Instead of building a larger core from small

cores (CoreFusion), MorphCore tries to scale down the OoO

design to implement simple in-order SMT core

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 25

Page 26: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Other Metrics compared to CoreFusion

Reduces power by 19%, energy by 29% and energy-delay

squarred product by 29%

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 26

Page 27: Energy-Efficient, High-Performance Heterogeneous Core Designparihar/pres/Pres_CoreDesign.pdf · Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP

Summary

Summary

Both ideas are quite similar to each other

Both proposal bring the notion of heterogeneity within a core

Both designs try to leverage fine-grain phases in runtimeThey also try to reuse (share) as much as hardware possibleBoth designs also try to minimize the migration overhead

Both designs require significant modifications in the coremicroarchitecture

The savings/benefits are only few %ageComplexity is quite high for these new core design

Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 27