34
1 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004 C C hip hip M M ulti ulti P P rocessor rocessor Evgeny Bolotin EE Department, Technion, Israel November 2004 VLSI Architectures Seminar - 048879 VLSI Architectures Seminar - 048879

C hip M ulti P rocessor

  • Upload
    zola

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

EE Department, Technion, Israel November 2004. C hip M ulti P rocessor. VLSI Architectures Seminar - 048879. Evgeny Bolotin. Outline. Intro Single Chip Multiprocessor (IEEE Computer97) IBM Power5 (IEEE Micro 2004) Energy Efficiency (CMP vs. SMT), ICS 2004 - PowerPoint PPT Presentation

Citation preview

Page 1: C hip  M ulti  P rocessor

1 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

CChip hip MMulti ulti PProcessorrocessor

Evgeny Bolotin

EE Department, Technion, Israel

November 2004

VLSI Architectures Seminar - 048879VLSI Architectures Seminar - 048879

Page 2: C hip  M ulti  P rocessor

2 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

Outline

Intro

Single Chip Multiprocessor (IEEE Computer97)

IBM Power5 (IEEE Micro 2004)

Energy Efficiency (CMP vs. SMT), ICS 2004

Intel Network Processor-IXP2800

Niagara – Sun, October 2004

Summary

Page 3: C hip  M ulti  P rocessor

3 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

When are two heads better than one?

Page 4: C hip  M ulti  P rocessor

4 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

CMP Motivation• How to utilize available silicon?

Speculation (aggressive superscalar) Simultaneous Multithreading (SMT, Hyperthreading) Several processors on a single chip

• What is a CMP (Chip MultiProcessor)? Several processors (several masters) Both shared and distributed memory architectures Both homogenous and heterogeneous processor types

• Why? Wire Delays Diminishing of Uniprocessors Very long design and verification times for modern processors

Page 5: C hip  M ulti  P rocessor

5 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

A Reminder: SMT (Simultaneous Multi Threading)

SMT CMP

• Pool of execution units (Wide machine)• Several Logical processors

• Copy of State for each• Mul. Threads are running concurrently• Better utilization and Latency Tolerance

• Simple Cores• Moderate amount of parallelism• Threads are running concurrently on different cores

Page 6: C hip  M ulti  P rocessor

6 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

A Reminder: SMT (Simultaneous Multi Threading)

SMT vs. CMP

Page 7: C hip  M ulti  P rocessor

7 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

A Single Chip Multiprocessor L. Hammond at al. (Stanford), IEEE Computer 97

• For Same area (a billion tr. DRAM area)

Superscalar and SMT: Very Complex• Wide• Advanced Branch prediction• Register Renaming• OOO Instruction Issue• Non-Blocking data caches

Superscalar (SS) SMT

CMP

Page 8: C hip  M ulti  P rocessor

8 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

SS and SMT vs. CMP

CPU Cores: Three main hardware design problems (of SS and SMT):•Area increases quadratically with core complexity

• Number of Registers O(Instruction window size)• Register ports - O(Issue width)CMP solves this problem (~ linear Area to Issue width)

•Longer Cycle Times • Long Wires, many MUXes and crossbars• Large buffers, queues and register filesClustering (decreases ILP) or Deep Pipelining (Branch mispredication penalties)CMP allows small cycle time (with little effort)

Small and fastRelies on software to schedule- Poor ILP

•Complex Design and Verification

Page 9: C hip  M ulti  P rocessor

9 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

SS and SMT vs. CMPMemory:•12 issue SS or SMT require multiport data cache (4-6 ports)

• 2 X 128 Kbyte (2 cycle latency)CMP 16 X 16 Kbyte (single cycle latency), but secondary cache is slower (multiport)Shared memory: write through caches

SMT CMP

Page 10: C hip  M ulti  P rocessor

10 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

Performance comparison

• Compress: (Integer apps) Low ILP and no TLP• Mpeg-2: (MMedia apps) High ILP and TLP and moderate memory requirement (parallelized by hand)

+ SMT utilizes core resources better+ But CMP has 16 issue slots instead of 12

• Tomcatv: (FP applications) Large loop-level parallelism and large memory bandwidth (TLP by compiler) + CMP has large memory bandwidth on primary cache - SMT fundamental problem: unified and slow cache• Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)

Page 11: C hip  M ulti  P rocessor

11 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

A Single Chip Multiprocessor L. Hammond at al. (Stanford), IEEE Computer 97

• TLP and PLP become widespread in future applications• Various Multimedia applications• Compilers and OS Favours CMP

CMP: • Better performance with simple hardware• Higher clock rates, better memory bandwidth• Shorter pipelines

SMT: has better utilizations but CMP has more resources (no wide-issue logic)

Although CMP bad for no TLP and ILP (compress), SMT and SS not much better

Page 12: C hip  M ulti  P rocessor

12 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

IBM Power5 chip: a dual-core multithreaded processor

Ron Kalla at al.;Micro, IEEE , Volume: 24 , Issue: 2 , Mar-Apr 2004

Power4 system (2001) Power5 system(2004)

• 174 Milion trans., 2 cores• L1: 64 K I$, 128KB d$, 128B lines• Shared L2 (1.5MB) and L3 (32 MB)• 5way each (8issue/5 retire)• 100 Instruction window• 15 stage pipe

• 130 nmEnhancements:• Two-way SMT (additional complexity is unjustified…diminishing + cache trashing )

• L3 closer to L2 (less traffic, system scales better to 64 from 32)• Shared L2 (1.875MB) and L3 (36 MB)• Memory controller on chip• Less Sys. Chips, reduced Mem. latency

Page 13: C hip  M ulti  P rocessor

13 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

IBM Power5 chip

• 8 metals• 389 mm^2• L3 directory• on chip MC

Page 14: C hip  M ulti  P rocessor

14 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

IBM Power5 chip

Processor Core:• SMT and Single Thread (ST) operation modes• 8 way fetch and translation (shared resource)• Branch prediction (shared)

• 3 BHT (shared)• Return stack (separate)

• 120 GPRs and 120 FPRs (dynamically shared in SMT and all used in ST)• Shared issue queues and XUs

Page 15: C hip  M ulti  P rocessor

15 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

IBM Power5 chip

Enhanced SMT:• Dynamic resource balancing (i.e monitors L2$ misses and throttles the thread)• Adjustable thread priority 8 levels- affects decode cycles (by software: idle loop, real-time apps. etc.)

Page 16: C hip  M ulti  P rocessor

16 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

IBM Power5 chip

ST operation:• All physical resources to active thread

Dynamic Power Management:• Extensive Dynamic clock gating (on cycle basis)• Dual-Vt for reduced leakage• Low power mode (x32 slower instruction dispatch, see figure)

Page 17: C hip  M ulti  P rocessor

17 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

Idea: More Flexible CMP (?)

Poor Flexibility

Bad for legacy code and low TLP

Page 18: C hip  M ulti  P rocessor

18 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

More Flexible CMP (?)

SMT

Programmable SMTs

Page 19: C hip  M ulti  P rocessor

19 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads, Ruchira Sasanka at al,

from University of Illinois and Intel Arch. Research Lab , ICS’04 France,

For Same Performance :Compare Energy Efficiency of CMP and SMT for MM apps.

• MM applications become important and have high TLP• What about energy?

• SMT might appear more energy efficient•since it utilizes better the hardware resources

WRONG!

Compare for many performance points –

different core architectures and frequency/voltage

Page 20: C hip  M ulti  P rocessor

20 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads

Design Space:

• 2-thread and 4-thread systems checked• Based on out-of-order superscalar (MIPS R10000)• Fetch/decode from vary from 2 to 8

•other parameters, i.e. window size, number of executions units change accordingly

•Frequency vary from 600 MHz to 1.6 GHZ (and scale voltage accordingly)

Page 21: C hip  M ulti  P rocessor

21 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads

Core and Memory parameters:•In SMT:

• Threads share most of the resources• Separate BPT, return address stack and 32 Integer and 32 FP registers• ICOUNT policy to prioritize Instr. Fetch (prefer “unstuck” thread)

• Caches are modeled and the size is chosen to achieve 99% hit-rate for I$ and 98% for D$

• CMP: 16K for I$ and 8K for D$• SMT was given same amount of cache per thread

Page 22: C hip  M ulti  P rocessor

22 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads

Workload:• 8 representing single threaded Multi-Media apps (video and speech codecs)• Given benchmark can be parallelized – each thread process different frame

• Consider two- and four- threaded benchmarks• Run same number of frames on both systems• Compare using Energy per Instruction (EPI) and Time per Instruction (TPI)Simulation Environment:• Modified RSIM simulator-cycle level simulator, models branch and address speculation and contentions.• Use enhanced Wattch for dynamic power modeling• Model Bus power • Static power is modeled by HotLeakage model (2% of dynamic at 0.13 um )

• Assume 90% of clock gating

Page 23: C hip  M ulti  P rocessor

23 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads

Metrics:

Vary freq.

Page 24: C hip  M ulti  P rocessor

24 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads

Results: two-thread workloads

Page 25: C hip  M ulti  P rocessor

25 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads

Results: four-thread workloads

Page 26: C hip  M ulti  P rocessor

26 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads

Summary of simulation results:• CMP always gives least EPI (they assumed enough threads …)• For four-threads workload CMP is significantly better• Difference increases with increased performance•For fixed system: Overall best architecture can be picked up (by minimum deviation from the “best” curve)

•Interesting graph:

Page 27: C hip  M ulti  P rocessor

27 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads

IDEAS: ???• Hybrid much better than SMT and comes very close to CMP

• best of all worlds?

• Maybe adaptive Architecture and DVS?• Adaptive fetch width?• DVS much easier for CMP?• High and medium performance regions? Heterogeneous ( ask Zvika)

Page 28: C hip  M ulti  P rocessor

28 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

Intel Network Processor (IXP 2800)

• Xscale processor (700 MHz)• 16 Microengines (1.4 GHz)

• Power dissipation ~30 Watts • Package area : 38x38 mm^2• Vdd = 1.3 V

Page 29: C hip  M ulti  P rocessor

29 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

Intel Network Processor (IXP 2800)

Microengine:

• 6 stage pipeline• Huge register set• Multiple thread (8)• Memories• Hardware accelerators.

Page 30: C hip  M ulti  P rocessor

30 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

Sun Roadmap

Blades:

Niagara:• 8 Ultra Sparc II x 4 threads• They hope for x15 performance

Page 31: C hip  M ulti  P rocessor

31 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

27 October 2004 (EETimes):Sun Microsystems has manufactured first silicon of its next-generation Niagara processor, which isn't due to ship until 2006. The advanced chip contains eight 64-bit UltraSparc cores and will power new systems, which Sun plans to position as "throughput computing" engines capable of handing network-intensive tasks. "We are now with a working chip," David Yen, Sun's executive vice president for scalable systems, tells VARBusiness. "The 1.0 design is running in the lab. It's running the Solaris 9 operating system on top of it, with 32 application threads on top of Solaris." The eight-core Niagara will dissipate only 60 W of power, according to Yen. That's a fraction of the 100 W or so consumed by today's dual-core UltraSparc IV and is also likely beneath the power figure expected from dual-core processors due out of Intel and AMD in 2005. Niagara will be fabricated in an advanced 90-nm process. It also boasts a host of on-chip features, which make its design highly integrated. The initial version will include an on-board Ethernet controller and a built-in memory controller. Subsequent versions, according to Yen, "will have 10-Gigabit Ethernet and even cryptologic [capability] built on the chip." Sun says the year-long interval until the chip comes to market will be used to bring Sun's partners up to speed.

Page 32: C hip  M ulti  P rocessor

32 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

Conclusions• CMP reduces hardware/power overhead

• SMT can yield better single-thread (at high cost)

• CMP can improve application performance if the compiler can extract thread-level parallelism

• What is the most effective use of on-chip real estate? Depends on the workload Depends on compiler technology

• Hybrid/Reconfigurable?

• Heterogeneous?

Page 33: C hip  M ulti  P rocessor

33 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

References:

• VLSI Architectures Lecture (Uri Weiser) • Hyper-Threading Technology Architecture and Microarchitecture, Intel

• A Single Chip Multiprocessor ,L. Hammond at al. (Stanford), IEEE Computer 97

• IBM Power5 chip: a dual-core multithreaded processor, Kalla, R.; Balaram Sinharoy; Tendler, J.M.; Micro, IEEE , Volume: 24 , Issue: 2 , Mar-Apr 2004 Pages:40 – 47

• The Energy Efficiency of CMP vs. SMT for Multimedia Workloads, Ruchira Sasanka at al (University of Illinois and Intel Arch. Research Lab ), ICS’04

• Network processors, Intel Technology Journal.

•“Sun weaves multi-media future”,K. Krewel Microprocessor report, april 2003

Page 34: C hip  M ulti  P rocessor

34 Evgeny Bolotin – VLSI Arch Seminar, Nov 2004

Energy: CMP vs SMT