L 2 ILP, DLP AND TLP IN MODERN...

6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE

SPRING 2013

LECTURE 2

ILP, DLP AND TLP IN MODERN MULTICORES

DANIEL SANCHEZ AND JOEL EMER

Review: ILP Challenges

Clock frequency: getting close to pipelining limits

Clocking overheads, CPI degradation

Branch prediction & memory latency limit the practical benefits of out-of-order execution

Power grows superlinearly with higher clock & more OOO logic

Design complexity grows exponentially with issue width

Limited ILP Must exploit TLP and DLP

Thead-Level Parallelism: Multithreading and multicore

Data-Level Parallelism: SIMD

6.888 Spring 2013 - Sanchez and Emer - L02

Review: Memory Hierarchy 3

Caching: Reduce latency, energy, BW of memory accesses

Why multilevel?

Why not just on-chip memories?

How does parallelism impact latency/BW constraints?

Prefetching: Trade-off latency for bandwidth, energy, capacity (pollution)

Main Memory

4 cycles, 72GB/s

8 cycles, 48GB/s

26-31 cycles, 96GB/s

(24GB/s/core)

150 cycles, 21GB/s

Flynn’s Taxonomy 4

Single instruction Multiple instruction

Single data SISD MISD (?)

Multiple data SIMD MIMD

SIMD Processing 5

Same instruction sequence applies to multiple elements

Vector processing Amortize instruction costs (fetch, decode,

…) across multiple operations

Requires regular data parallelism (no or minimal divergence)

Exploiting SIMD:

Explicit & low-level, using vector intrinsics

Explicit & high-level, convey parallel semantics (e.g., foreach)

Implicitly: Parallelizing compiler infers loop dependencies

How easy is this in C++? Java?

SIMD Implementations 6

Modern CPUs: SIMD extensions & wider regs

SSE: 128-bit operands (4x32-bit or 2x64-bit)

AVX (2011): 256-bit operands (8x32-bit or 4x64-bit)

LRB (upcoming): 512-bit operands

Explicit SIMD: Parallelization performed at compile time

GPUs: Architected for SIMD from the ground up

32 to 64 32-bit floats

Implicit SIMD: Scalar binary, multiple instances always run in

lockstep

How to handle divergence?

Multithreading: Options 7

Motivation: Hardware underutilized on stalls TLP to increase utilization

CGMT, SMT typically increase throughput with moderate cost, maintain single-thread performance

FGMT typically trades throughput and simplicity at the expense of single-thread performance

Example 1: SMT (Nehalem) 8

SMT design choices: For each component,

Replicate, partition statically, or share

Tradeoffs? Complexity, utilization, interference & fairness

Example: Intel Nehalem

4-wide superscalar, 2-way SMT

Replicated: Register file, RAS predictor, large-page ITLB

Partitioned: Load buffer, store buffer, ROB, small-page ITLB

Shared: Instruction window, execution units, predictors, caches, DTLBs

SMT policies:

Fetch policies: Utilization vs fairness

Long-latency stall tolerance: Flushing vs stalling

[See: “Exploiting Choice: Instruction

Fetch and Issue on an Implementable

Simultaneous Multithreading

Processor”, Tullsen et al, ISCA 96]

Example 2: FGMT (Niagara) 9

4 threads/core, round-robin scheduling

No branch prediction, minimal bypasses more stalls

Small L1 caches (can tolerate higher L1 miss rates)

But L2 is still large… performance with long-latency stalls?

Example 3: Extreme FGMT (Tera MTA) 10

Use FGMT to hide all instruction latencies

Worst case instruction latency is 128 cycles 128 threads

Benefits: no interlocks, no bypass, and no cache

Problem: single-thread performance

GPUs also exploit high FGMT for latency tolerance (e.g., Fermi, 48-way MT)

Throughput-oriented functional units: Longer latency, deeply pipelined

Throughput-oriented memory system: Small caches, aggressive memory scheduler

InstT1

InstT3

A B C F G H

A B C D E E G H

A B E F G H C D

InstT2

InstT4

A B E F G H C D

InstT5

InstT7

InstT6

InstT8

InstT1

Multithreading: How Many Threads? 11

With more HW threads:

Larger/multiple register files

Replicated & partitioned resources Lower utilization, lower single-thread performance

Shared resources Utilization vs interference and thrashing

Impact of MT/MC on memory hierarchy?

[“Many-Core vs. Many-Thread Machines: Stay

Away From the Valley”, Guz et al, CAL 09]

Amdahl’s Law 12

Amdahl’s Law: If a change improves a fraction f of the

workload by a factor K, the total speedup is:

Not only valid for performance!

Energy, complexity, …

I/D/TLP techniques make different tradeoffs between K

SIMD vs MIMD f and K?

TimeSpeedup

before

Amdahls’ Law in the Multicore Era

[Hill & Marty, CACM 08]

Should we focus on a single approach to extract parallelism?

At what point should we trade ILP for TLP?

Assume a resource-limited multi-core

N base core equivalent (BCEs) due to area or power constraints

A 1-BCE core leads to performance of 1

A R-BCE core leads to performance of perf(R)

Assuming perf(R) = sqrt(R) in following drawings (Pollack’s rule)

How should we design the multi-core?

Select type & number of cores

Assume caches & interconnect are rather constant

Assume no application scaling (or equal scaling for seq/par portions)

Three Multicore Approaches

Large Cores (R BCEs/core)

Simple Cores (1 BCE/core)

Number Performance Number Performance

Symmetric CMP N/R Seq: Perf(R) Par: N/R*Perf(R)

Asymmetric CMP

1 Seq: Perf(R) Par: Perf(R)

N-R Seq: - Par: N-R

Dynamic CMP 1 Seq: Perf(R) Par: -

N Seq: - Par: N

16 1-BCE cores Symmetric:

4 4-BCE cores

Asymmetric:

1 4-BCE core

& 12 1-BCE cores

Dynamic:

Adapt between

16 1-BCEs and 1 16-BCE

Amdahl’s Law x3

Symmetric CMP

Asymmetric CMP

Dynamic CMP

Symmetric Speedup =

+ 1 - F

Perf(R)

Perf(R)*N

Asymmetric Speedup =

+ 1 - F

Perf(R)

Perf(R) + N - R

Dynamic Speedup =

+ 1 - F

Perf(R)

Symmetric Multicore Chip

N = 256 BCEs

1 2 4 8 16 32 64 128 256

R BCEs/core

F=0.999

F=0.99

F=0.975

F=0.5 F=0.9

R=28 (vs. 2)

Cores=9 (vs. 8)

Speedup=26.7 (vs. 6.7)

CORE ENHANCEMENTS!

R=1 (vs. 1)

Cores=256 (vs. 16)

Speedup=204 (vs. 16)

MORE CORES!

F=0.99

R=3 (vs. 1)

Cores=85 (vs. 16)

Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS

& MORE CORES!

Results in parentheses N= 16

Higher N Higher R (more ILP) for fixed f

Asymmetric Multicore Chip

N = 256 BCEs

1 2 4 8 16 32 64 128 256

R BCEs

F=0.999

F=0.99

F=0.975

F=0.5 F=0.9

R=118 (vs. 28)

Cores= 139 (vs. 9)

Speedup=65.6

(vs. 26.7)

F=0.99

R=41 (vs. 3)

Cores=216 (vs. 85)

Results in parentheses N= 16

Better speedups than symmetric

Software complexity?

Dynamic Multicore Chip

N = 256 BCEs

Results in parenthesis refer to N=16

Dynamic offers even higher speedups than asymmetric

SW and HW complexity?

1 2 4 8 16 32 64 128 256

R BCEs

F=0.999

F=0.99

F=0.975

F=0.99

R=256 (vs. 41)

Cores=256 (vs. 216)

#Cores

always

Readings for Wednesday

1. Is Dark Silicon Useful?

2. Dark Silicon and the End of Multicore Scaling

3. Single-Chip Heterogeneous Computing

L 2 ILP, DLP AND TLP IN MODERN...

Documents

Lecture 18: Core Design, Parallel Algoscs7810/pres/14-7810-18.pdf · Lecture 18: Core Design, Parallel Algos • Today: Innovations for ILP, TLP, power and parallel algos • Sign

Stream Processors - University of California, San Diego¦ DLP across stream elements æ TLP across sub-streams and across kernels æ Keeps ºeasyí parallelism easy ï Expose locality

HEVC CODEC ANALYSIS: Exploring the parallelization features of … · The tool looks at data-level (DLP), pipeline-level (PLP) and task-level parallelism (TLP) and applies an internal

TLP management

Instruction Level Parallelism (ILP) · Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP,

TLP 2742psa

Transforming TLP into DLP with the Dynamic Inter-Thread ...Transforming TLP into DLP with the Dynamic Inter-Thread Vectorization Architecture Sajith Kalathingal, Sylvain Collange,

ALP: Efﬁcient Support for All Levels of Parallelism for ...rsim.cs.uiuc.edu/Pubs/sasanka-taco07.pdf2. THE ALP PROGRAMMING MODEL ALP supports conventional threads for TLP. ILP is

ILP-intro-notesbhagiweb/cs211/lectures/ilp-intro.pdf · ILP: Instruction-Level Parallelism • ILP is is a measure of the amount of inter-dependencies between instructions • Average

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

TLP-X3000 - Toshiba · TLP-X3000 TLP-XC3000 (With Document Camera) TLP-X3000. ... - Consult the dealer or an experienced radio/TV technician for help. RESPONSIBLE PARTY: TOSHIBA AMERICA

Tlp Full Report

EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482s/slides/lect01_slides.pdf– DLP across stream elements – TLP across sub-streams and across kernels –

Chapter 1 Fundamentals of Quantitative Design and Analysisharmanani.github.io/classes/csc631/Notes/L1.pdf · nData-level parallelism (DLP) nThread-level parallelism (TLP) nRequest-level

Introducción al Growth Hacking (TLP Tenerife - TLP Innova 2014)

DIGITAL LUMENS ENABLED UFO PARKING GARAGE - 75WLED...ufo parking garage - 75wled outdoor color guide ilp-wht ilp-slv ilp-brz ilp-brn ilp-blk color name and description sheen product

Taking the ILP beyond Compliance. ILP and Accountability Requirements

ILP Slides

1 Before use CONTENTS OWNER’S MANUAL 3LCD DATA PROJECTOR TLP-X10E TLP-X11E TLP-X20E TLP-X21E TLP-X10Y TLP-X11Y TLP-X20Y TLP-X21Y (WITH DOCUMENT IMAGING CAMERA) (WITH DOCUMENT

ILP Scheduling