38
Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

Kaisen Lin and Michael Conley

Page 2: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

  Simultaneous Multithreading ◦  Instructions from multiple threads run

simultaneously on superscalar processor ◦ More instruction fetching and register state ◦ Commercialized!   DEC Alpha 21464 [Dean Tullsen et al]   Intel Hyperthreading (Xeon, P4, Atom, Core i7)

Page 3: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Web applications ◦ Web, DB, file server ◦ Throughput over latency ◦  Service many users ◦  Slight delay acceptable   404 not

  Idea? ◦ More functional units and caches ◦ Not just storage state

Page 4: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 5: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

  Instruction fetch ◦  Branch prediction and alignment   Similar problems to the trace cache ◦  Large programs: inefficient I$ use

  Issue and retirement ◦ Complicated OOE logic not scalable   Need more wires, hardware, ports

 Execution ◦ Register file and forwarding logic

 Power-hungry

Page 6: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 7: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Replace complicated OOE with more processors ◦  Each with own L1 cache

 Use communication crossbar for a shared L2 cache ◦ Communication still fast, same chip

  Size details in paper ◦  6-SS about the same as 4x2-MP ◦  Simpler CPU overall!

Page 8: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 9: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

  SimOS: Simulate hardware env ◦ Can run commercial operating systems on

multiple CPUs ◦  IRIX 5.3 tuned for multi-CPU

 Applications ◦  4 integer, 4 floating-point, 1 multiprog

 PARALLELLIZED!

Page 10: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Which one is better? ◦ Misses per completed instruction ◦  In general hard to tell what happens

Page 11: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Which one is better? ◦  6-SS isn’t taking advantage!

Page 12: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Actual speedup metrics ◦ MP beats the pants off SS some times ◦  Doesn’t perform so much worse other times

Page 13: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 6-SS better than 4x2-MP ◦ Non-parallelizable applications ◦  Fine-grained parallel applications

 6-SS worse than 4x2-MP ◦ Coarse-grained parallel applications

  Implications ◦ Can tradeoff latency for throughput ◦ Need to parallelize software ◦ Hardware can’t automagically do it anymore

Page 14: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

Niagara

Page 15: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 L2 Cache Management ◦ Cache coherency ◦  Inclusion vs Exclusion

 CMP for complete systems ◦  Interconnects/Communication

 Mixing ILP techniques with TLP   Specialized processing (MPSoC) ◦ Not just server/web, mobile also!

Page 16: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 17: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Designed CMP to be scalable ◦ Network interface on every chip ◦  Separate I/O chip ◦ Cache coherence protocol

 BTB  Virtually-indexed physically-tagged L1  Bypassed datapath

Page 18: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Non-inclusive L2 ◦  L1 aggregate capacity = L2 ◦ Non-inclusive to increase capacity   Behaves live victim cache

◦  L1 tags in L2 to designate owner   L2 access checks both sets of tags

◦ Ownership state in cache line   L2 decides whether to write back

Page 19: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 November 2005   Sun’s answer to CMP  Hybrid processor ◦ Multithreading (strands) ◦ Multicore

 Targeted at commercial servers  32 hardware threads

Page 20: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

  Find out what the consumer wants ◦ High throughput ◦  Low power

  Sell it to them

Page 21: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

  Servers, web applications (integer)  Lots of data ◦ High cache miss rate

 Throughput over latency

Page 22: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 32 threads ◦  8 thread groups ◦  4 threads per group

  Shared L2 between ALL 32 threads  High bandwidth crossbar interconnect  High bandwidth memory ◦  20GB/s

 60W  One(!) FPU for all 32 threads

Page 23: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 24: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 4 threads   Shared processing pipeline ◦  Sparc pipe

 0 cycle thread switch

Page 25: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

  Shared: ◦  L1 I/D caches ◦  I/D TLBs ◦  Functional units ◦  Pipeline registers

 Dedicated: ◦ Data registers ◦  Instruction and store buffers

Page 26: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Pick a thread for Issue ◦ Also fetch from same thread

 Thread priority ◦  Least recently used ◦  Long latency ops ◦  Speculative load-dependent ops

Page 27: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

  Single issue   In-order execution  6 stages with bypass   Fetch 2 instructions per cycle ◦  Second is speculative ◦  Long latency predecode bit

Page 28: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 29: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 30: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

Long latency – P0

Long latency – P1

Speculative

Load hits = Do not squash

Page 31: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 Hardware call stack ◦ No ugly calling conventions

 Window = 8 in, 8 local, 8 out ◦  out in on call

 Thread has 8 windows

Page 32: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

Params/Results

What you see

Page 33: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 16K/8K L1 I/D caches ◦  Large working sets

 L1 tag directory in L2 ◦ Which L1? ◦ Centralized sharing

 L2 updated before L1 ◦ Memory ordering

Page 34: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 October 2007  8 cores, 8 threads each ◦  16/4 too much area

  Increased power efficiency ◦  Stall

  Improved single threaded FP ◦ One FPU per core

Page 35: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 36: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same
Page 37: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 2 INT EXU per pipeline (up from 1)  More cache associativity  Larger TLBs  More L2 banks  Cryptography coprocessor   Integrated networking ◦ Close to memory

  Integrated PCI-E

Page 38: Kaisen Lin and Michael Conley · Replace complicated OOE with more processors Each with own L1 cache Use communication crossbar for a shared L2 cache Communication still fast, same

 2 Thread groups ◦  4 threads each, statically partitioned ◦  1 EXU per

 Pick stage ◦  Independent select within each group