Parallel Architectures

Parallel Architectures

Martino [email protected]

Why Multicores?The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year).

Diminishing returns from uniprocessor designs

[from Patterson & Hennessy]

Power Wall

• The design goal for the late 1990’s and early 2000’s was to drive the clock rate up. • This was done by adding more transistors to a smaller chip.

• Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques


Roadmap for CPU Clock Speed: Circa 2005

Here is the result of the best thought in 2005. By 2015, the clock speedof the top “hot chip” would be in the 12 – 15 GHz range.


The CPU Clock Speed Roadmap (A Few Revisions Later)

This reflects the practical experience gained with dense chips that were literally“hot”; they radiated considerable thermal power and were difficult to cool.Law of Physics: All electrical power consumed is eventually radiated as heat.


The MultiCore Approach

Multiple cores on the same chip– Simpler– Slower– Less power demanding

The Memory Gap

• Bottom-line: memory access is increasingly expensive and computer architect must devise new ways of hiding this cost

1

10

100

1000

10000

100000

1980

1985

1990

1995

2000

2005

Per

form

ance

Memory CPU


04/19/2023 Spring 2011 -- Lecture #15 8

Transition to Multicore

Sequential App Performance

Parallel Architectures

• Definition: “A parallel architecture is a collection of processing elements that cooperate and communicate to solve large problems fast”

• Questions about parallel architectures:– How many are the processing elements?– How powerful are processing elements?– How do they cooperate and communicate?– How are data transmitted? – What type of interconnection?– What are HW and SW primitives for programmer?– Does it translate into performance?

Flynn Taxonomy of parallel computersData streams

Single Parallel

InstructionStreams

Single SISD SIMD

Multiple MISD MIMD

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

• Flynn's Taxonomy provides a simple, but very broad, classification of computer architectures:

• Single Instruction, Single Data (SISD)• A single processor with a single instruction stream, operating sequentially on a single data

stream.• Single Instruction, Multiple Data (SIMD)

• A single instruction stream is broadcast to every processor, all processors execute the same instructions in lock-step on their own local data stream.

• Multiple Instruction, Multiple Data (MIMD)• Each processor can independently execute its own instruction stream on its own local data

stream.• SISD machines are the traditional single-processor, sequential computers - also known as Von Neumann

architecture, as opposed to “non-Von" parallel computers.• SIMD machines are synchronous, with more fine-grained parallelism - they run a large number parallel

processes, one for each data element in a parallel vector or array.• MIMD machines are asynchronous, with more coarse-grained parallelism - they run a smaller number of

parallel processes, one for each processor, operating on the large chunks of data local to each processor.

Single Instruction/Single Data Stream:SISD

• Sequential computer • No parallelism in either the instruction or

data streams• Examples of SISD architecture are

traditional uniprocessor machines

Processing Unit

Multiple Instruction/Single Data Stream:MISD

• Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized– For example, certain kinds of

array processors• No longer commonly encountered,

mainly of historical interest only

Single Instruction/Multiple Data Stream:SIMD

• Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized

– e.g., SIMD instruction extensions or Graphics Processing Unit (GPU)• Single control unit• Multiple datapaths (processing elements – PEs) running in parallel

– PEs are interconnected and exchange/share data as directed by the control unit– Each PE performs the same operation on its own local data

Multiple Instruction/Multiple Data Streams:MIMD

• Multiple autonomous processors simultaneously executing different instructions on different data.

• MIMD architectures include multicore and Warehouse Scale Computers (datacenters)

Parallel Computing Architectures Memory Model Pro-

cessorPro- cessor

...

Interconnection

Shared Memory

UMA (Uniform Memory Access)(SMP) symmetric multiprocessor

Pro- cessor

Pro- cessor

...

NUMA (Non-Uniform Memory Access)distributed-shared-memory

multiprocessor

Interconnection

LocalMemory

LocalMemory

Pro- cessor

Pro- cessor

...

Interconnection

LocalMemory

LocalMemory

MPP (Massively Parallel Processors)message-passing

(shared-nothing) multiprocessor

send receive

empty

Centrilized memory Physically distributed memory

Priv

ate

add

ress

spa

ces

Shar

ed a

ddre

ss s

pace

Parallel Architecture = Computer Architecture + Communication Architecture

Question: how do we organize and distribute memory in a multicore architecture?

2 classes of multiprocessors WRT memory:1. Centralized Memory Multiprocessor 2. Physically Distributed-Memory

multiprocessor

2 classes of multiprocessors WRT addressing:1. Shared2. Private

Memory Performance Metrics• Latency is the overhead in setting up a connection between processors for

passing data. – This is the most crucial problem for all parallel architectures - obtaining good

performance over a range of applications depends critically on low latency for accessing remote data.

• Bandwidth is the amount of data per unit time that can be passed between processors.

– This needs to be large enough to support efficient passing of large amounts of data between processors, as well as collective communications, and I/O for large data sets.

• Scalability is how well latency and bandwidth scale with the addition of more processors.

– This is usually only a problem for architectures with manycores.

Distributed Shared Memory Architecture:NUMA

• Data set is distributed among processors:

– each processor accesses only its own data from local memory

– if data from another section of memory (i.e. another processor) is required, it is obtained by a remote access.

• Much larger latency for accessing non-local data, but can scale to large numbers (thousands) of processors for many applications.

– Advantage: Scalability– Disadvantage: Locality Problems and Connection congestion

• Aggregated memory of the whole system appear as one single address space.

Communication Network

Host Processor

P 1 P 2 P 3

M 1 M 2 M 3

P Processor

M Local Memory

Distributed Memory—Message Passing Architectures

• Each processor is connected to exclusive local memory

– i.e. no other CPU has direct access to it• Each node comprises at least one network

interface (NI) that mediates the connection to a communication network.

• On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network.

• Non-blocking vs. Blocking communication• MPI Problems:

– All data layout must be handled by software

– Message passing has high software overhead

19

P

Mem

NI

Interconnect Network

P

Mem

NI

P

Mem

NI

P

Mem

NI

P

Mem

NI

P

Mem

NI

P

Mem

NI

P

Mem

NI

Shared Memory Architecture: UMA

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

SecondaryCache

Bus

SecondaryCache

SecondaryCache

SecondaryCache

PrimaryCache

PrimaryCache

PrimaryCache

Global Memory

• Each processor has access to all the memory, through a shared memory bus and/or communication network– Memory bandwidth and latency are the same for all processors and all memory

locations.• Lower latency for accessing non-local data, but difficult to scale to large numbers of

processors, usually used for small numbers (order 100 or less) of processors.

Shared memory candidates

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

SecondaryCache

SecondaryCache

SecondaryCache

SecondaryCache

Global Memory

PrimaryCache

PrimaryCache

PrimaryCache

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

Secondary Cache

Global Memory

PrimaryCache

PrimaryCache

PrimaryCache

Shared-main memory

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

Secondary Cache

Global Memory

Primary Cache

Shared-primary cacheShared-secondary cache

• Caches are used to reduce latency and to lower bus traffic• Must provide hardware to ensure that caches and memory are consistent (cache coherency)• Must provide a hardware mechanism to support process synchronization

Challenge of Parallel Processing

• Two biggest performance challenges in using multiprocessors

– Insufficient parallelism

• The problem of inadequate application parallelism must be attacked primarily in

software with new algorithms that can have better parallel performance.

– Long-latency remote communication

• Reducing the impact of long remote latency can be attacked both by the

architecture and by the programmer.

Amdahl’s Law• Speedup due to enhancement E is

Speedup w/ E = ---------------------- Exec time w/o EExec time w/ E

• Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected

Execution Time w/ E =

Speedup w/ E =

Execution Time w/o E [ (1-F) + F/S]

1 / [ (1-F) + F/S ]

Amdahl’s Law

Speedup =

Example: the execution time of half of the program can be accelerated by a factor of 2.What is the program speed-up overall?

Amdahl’s Law

Speedup = 1

Example: the execution time of half of the program can be accelerated by a factor of 2.What is the program speed-up overall?

(1 - F) + FSNon-speed-up part Speed-up part

10.5 + 0.5

2

10.5 + 0.25

= = 1.33

Amdahl’s LawIf the portion ofthe program thatcan be parallelizedis small, then thespeedup is limited

The non-parallelportion limitsthe performance

04/19/2023 Spring 2011 -- Lecture #1 27

Strong and Weak Scaling

• To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem.– Strong scaling: when speedup can be achieved on a

parallel processor without increasing the size of the problem

– Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors

Needed to amortize sources of OVERHEAD (additional code, not present in the original sequential program, needed to execute the program in parallel)

• Symmetric shared-memory machines usually support the caching of both shared and private data.

• Private data are used by a single processor, while shared data are used by multiple processors.

• When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Since no other processor uses the data, the program behavior is identical to that in a uniprocessor.

• When shared data are cached, the shared value may be replicated in multiple caches. In addition, This replication also provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously.

Caching of shared data, however, introduces a new problem : cache coherence

Symmetric Shared-Memory Architectures

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

SecondaryCache

Bus

SecondaryCache

SecondaryCache

SecondaryCache

PrimaryCache

PrimaryCache

PrimaryCache

Global Memory

Example Cache Coherence Problem

– Cores see different values for u after event 3– With write back caches, value written back to memory depends

on the order of which cache flushes or writes back value– Unacceptable for programming, and it is frequent!

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?

4

u = ?

u :51

u :5

2

u :5

3

u = 7

Keeping Multiple Caches Coherent• Architect’s job: shared memory => keep cache values coherent

• Idea: When any processor has cache miss or writes, notify other processors via interconnection network– If only reading, many processors can have copies– If a processor writes, invalidate all other copies

• Shared written result can “ping-pong” between caches

32

Shared Memory Multiprocessor

Use snoopy mechanism to keep all processors’ view of memory coherent

M1

M2

M3

Snoopy Cache

DMA

Physical Memory

Memory Bus

Snoopy Cache

Snoopy Cache

DISKS

33

Example: Write-thru Invalidate

• Must invalidate before step 3• Write update uses more broadcast medium BW

all recent SMP multicores use write invalidate

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?

4

u = ?

u :51

u :5

2

u :5

3

u = 7

u = 7

Need for a more scalable protocol

• Snoopy schemes do not scale because they rely on broadcast

• Hierarchical snoopy schemes have the root as a bottleneck

• Directory based schemes allow scaling– They avoid broadcasts by keeping track of all CPUs caching

a memory block, and then using point-to-point messages to maintain coherence

– They allow the flexibility to use any scalable point-to-point network

35

Scalable Approach: Directories

• Every memory block has associated directory information– keeps track of copies of cached blocks and their states– on a miss, find directory entry, look it up, and

communicate only with the nodes that have copies if necessary

– in scalable networks, communication with directory and copies is through network transactions

• Many alternatives for organizing directory information

36

Basic Operation of Directory

• k processors • With each cache-block in memory:

k presence-bits, 1 dirty-bit• With each cache-block in cache:

1 valid bit, and 1 dirty (owner) bit

• Read from main memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i] ON; }• if dirty-bit ON then { recall line from dirty proc (downgrade cache

state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

• Write to main memory by processor i:• If dirty-bit OFF then {send invalidations to all caches that have the

block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }

• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

Real Manycore Architectures

• ARM Cortex A9• GPU• P2012

ARM Cortex-A9 processors• 98% of mobile phones use at least on ARM processor• 90% of embedded 32-bit systems use ARM• The Cortex-A9 processors are the highest performance ARM processors

implementing the full richness of the widely supported ARMv7 architecture.

Cortex-A9 CPU

• Superscalar out-of-order instruction execution

– Any of the four subsequent pipelines can select instructions from the issue queue

• Advanced processing of instruction fetch and branch prediction

• Up to four instruction cache line prefetch-pending

– Further reduces the impact of memory latency so as to maintain instruction delivery

• Between two and four instructions per cycle forwarded continuously into instruction decode

• Counters for performance monitoring

The Cortex-A9 MPCore Multicore Processor

• Design-configurable Processor supporting between 1 and 4 CPU• Each processor may be independently configured for their cache sizes, FPU and NEON• Snoop Control Unit• Accelerator Coherence Port

Snoop Control Unit and Accelerator Coherence Port

• The SCU is responsible for managing: – the interconnect, – arbitration, – communication, – cache-2-cache and system memory transfers, – cache coherence

• The Cortex-A9 MPCore processor also exposes these capabilities to other system accelerators and non-cached DMA driven mastering peripherals:

– To increase the performance– To reduce the system wide power consumption by sharing access to the processor’s cache hierarchy

• This system coherence also reduces the software complexity involved in otherwise maintaining software coherence within each OS driver.

What is GPGPU?• The graphics processing unit (GPU) on commodity video cards has evolved into an

extremely flexible and powerful processor– Programmability– Precision– Power

• GPGPU: an emerging field seeking to harness GPUs for general-purpose computation other than 3D graphics

– GPU accelerates critical path of application• Data parallel algorithms leverage GPU attributes

– Large data arrays, streaming throughput– Fine-grain SIMD parallelism– Low-latency floating point (FP) computation

• Applications – see //GPGPU.org– Game effects (FX) physics, image processing– Physical modeling, computational engineering, matrix algebra, convolution, correlation,

sorting

Motivation 1:• Computational Power

– GPUs are fast…– GPUs are getting faster, faster

Motivation 2:

• Flexible, Precise and Cheap:– Modern GPUs are deeply programmable

• Solidifying high-level language support

– Modern GPUs support high precision• 32 bit floating point throughout the pipeline• High enough for many (not all) applications

CPU style cores CPU-“style”

Slimming down

Two cores

Four cores

Sixteen cores

Add ALUs

128 elements in parallel

But what about branches?




Stalls!

• Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.

• Memory access latency = 100’s to 1000’s of cycles• We’ve removed the fancy caches and logic that helps avoid

stalls.• But we have LOTS of independent work items.• Idea #3: Interleave processing of many elements on a single

core to avoid stalls caused by high latency operations.

Hiding stalls

Hiding stalls

Hiding stalls

Hiding stalls

Hiding stalls

Throughput!

NVIDIA Tesla• Three key ideas

– Use many “slimmed down cores” to run in parallel– Pack cores full of ALUs (by sharing instruction stream across groups of work items)– Avoid latency stalls by interleaving execution of many groups of work-items/ threads/ ...

• When one group stalls, work on another group

On-chip memory

• Each multiprocessor has on-chip memory of the four following types:

– One set of local 32-bit registers per processor,– A parallel shared memory that is shared by all scalar processor

cores and is where the shared memory space resides,– A read-only constant cache that is shared by all scalar processor

cores and speeds up reads from the constant memory space, which is a read-only region of device memory,

– A read-only texture cache that is shared by all scalar processor cores and speeds up reads from the texture memory space, which is a read-only region of device memory; each multiprocessor accesses the texture cache via a texture unit that implements the various addressing modes and data filtering.

• The local and global memory spaces are read-write regions of device memory and are not cached.

Shared Memory

• Is on-chip:– much faster than the global memory

– divided into equally-sized memory banks

– as fast as a register when no bank conflicts

• Successive 32-bit words are assigned to successive banks

• Each bank has a bandwidth of 32 bits per clock cycle.

Shared Memory Examples of Shared Memory Access Patterns

without Bank Conflicts

Shared Memory Examples of Shared Memory Access Patterns

with Bank Conflicts

Global Memory: Coalescing• The device is capable of reading 4-byte, 8-byte, or 16-byte words from global memory into registers in a single

instruction. • Global memory bandwidth is used most efficiently when the simultaneous memory accesses can be coalesced into

a single memory transaction of 32, 64, or 128 bytes.

Coalescing Examples

Coalescing Examples

NVIDIA’s Fermi Generation CUDA Compute Architecture:

The key architectural highlights of Fermi are:• Third Generation Streaming Multiprocessor (SM)

– 32 CUDA cores per SM, 4x over GT200– 8x the peak double precision floating

point performance over GT200

• Second Generation ParallelThread Execution ISA

– Unified Address Space with Full C++ Support– Optimized for OpenCL and DirectCompute

• Improved Memory Subsystem– NVIDIA Parallel DataCache hierarchy

with Configurable L1 and Unified L2 Caches – improved atomic memory op performance

• NVIDIA GigaThreadTM Engine– 10x faster application context switching– Concurrent kernel execution– Out of Order thread block execution– Dual overlapped memory transfer engines

Third Generation Streaming Multiprocessor

• 512 High Performance CUDA cores– Each SM features 32 CUDA processors– Each CUDA processor has a fully pipelined

integer arithmetic logic unit (ALU) and floating point unit (FPU)

• 16 Load/Store Units– Each SM has 16 load/store units, allowing

source and destination addresses to be calculated for sixteen threads per clock.

– Supporting units load and store the data at each address to cache or DRAM.

• Four Special Function Units– Special Function Units (SFUs) execute

transcendental instructions such as sin, cosine, reciprocal, and square root.

P2012 Introduction

• The P2012 cluster is the computing node of the P2012 Fabric

• The P2012 cluster has two variants :– An homogeneous computing variant,– An heterogeneous computing variant.

• A single architecture for both variants.

Fabric Controller

SystemBridge

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

SystemBridge

SystemBridge

P2012 Cluster Main Features• Symmetric Multi-processing• Uniform Memory Access within the cluster• Non Uniform Memory Access between clusters• Up to 16 +1 processors per cluster.• Up to 30.6 GOPS peak per cluster (assuming non-SIMD extension) at 600 MHz. • Up to 20.4 GFLOPs (32 bits) peak per cluster at 600 MHz.• 2 DMA channels allowing up to 6.4 GB/s data transfer • HW Support for synchronization:

– Fast barrier (within a cluster only) in ~4 Cycles for 16 processors– Flexible barrier ~20 cycles for 16 processors

• Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements

• High level of customization though:– The number of STxP70 processing elements– The STxP70 extensions (ISA customization)– Up to 32 User-defined HWPEs,– Memory sizes,– Banking factor of the shared memory,

P2012 Cluster OverviewP2012 Cluster Architecture

Multi-coreSub-system

(ENCore <N>)

Glo

bal I

nter

conn

ect

Inte

rfac

e

• N x STxP70 Cores• 2xN-banked Shared Data Memory• N-to-2M Logarithmic interconnect (memory)• Peripheral Logarithmic interconnect • Runtime accelerator (HWS)• Timers• Cluster interfaces (I/O)



(ENCore <N>)

ClusterController

(CC)

Glo

bal I

nter

conn

ect

Inte

rfac

e

• 1 STxP70-based Cluster processor• 16KB P$ & TCDM • CC peripheral (boot, …)• Clock, variability, power controller (CVP)• Cluster Controller Interconnect



(ENCore <N>)

Debug and Test Unit (DTU)

ClusterController

(CC)

Glo

bal I

nter

conn

ect

Inte

rfac

e

• Provides controllability and observability to the application developer • Breakpoint propagation inside the cluster and across the fabric



(ENCore <N>)


ClusterController

(CC)

Glo

bal I

nter

conn

ect

Inte

rfac

e

Custom HW Processing Elements

Steaming Interface (SIF)

• P x HW Processing Elements• Stream Flow Local Interconnect (LIC)• HWPE to/from LIC interfaces (HWPE_WPR)• CC to/from LIC interface (SIF).

P2012 Cluster Overview (Con’d)P2012 Cluster Architecture


…………….

Timers

HWS

Shared Tightly Coupled Data

Memory (TCDM)

Logarithmic interconnect (TCDM)

Mem

ory

bank

#1

Mem

ory

bank

#2

Mem

ory

bank

#3

Mem

ory

bank

#4

………

Perip

hera

l Log

arith

mic

inte

rcon

nect

ENC2EXT

EXT2PER

EXT2MEM

Glo

bal I

nter

conn

ect

Inte

rfac

e

ENCo

re<N

>-CC

in

terf

ace

Local Interconnect (Stream flow)

HWPE#P

HWPE#2

HWPE#1

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF


CVP-CC

CC-Peripherals

DMAChannel

#0

DMAChannel

#1

STxP70#1

16KB-P$

STxP70#2

16KB-P$

STxP70#N

16KB-P$

Mem

ory

bank

#2

xN-1

Mem

ory

bank

#2

xN

STxP70Cluster

Processor(CP)

16KB-P$

TCDM

CC In

terc

onne

ct, C

CI

STxP70+ FPx

#1

16KB-P$

STxP70+FPx

#2

16KB-P$

STxP70+FPx#16

16KB-P$

STxP70Cluster

Processor+ FPx (CP)

16KB-P$

32-KB TCDM

Mem

ory

bank

#31

Mem

ory

bank

#32

• 32-bit RISC processor• 16 KB P$, No local data memory• 600 MHz in 32 nm• Variable length ISA• Up to two instructions executed per cycle• Configurable core• Extendible through its ISA• Complete software development tools chain



…………….

Timers

HWS


Memory (TCDM)


Mem

ory

bank

#1

Mem

ory

bank

#2

Mem

ory

bank

#3

Mem

ory

bank

#4

………

Perip

hera

l Log

arith

mic

inte

rcon

nect

ENC2EXT

EXT2PER

EXT2MEM

Glo

bal I

nter

conn

ect

Inte

rfac

e

ENCo

re<N

>-CC

in

terf

ace


HWPE#P

HWPE#2

HWPE#1

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF


CVP-CC

CC-Peripherals

DMAChannel

#0

DMAChannel

#1

STxP70#1

16KB-P$

STxP70#2

16KB-P$

STxP70#N

16KB-P$

Mem

ory

bank

#2

xN-1

Mem

ory

bank

#2

xN

STxP70Cluster

Processor(CP)

16KB-P$

TCDM

CC In

terc

onne

ct, C

CI

STxP70+ FPx

#1

16KB-P$

STxP70+FPx

#2

16KB-P$

STxP70+FPx#16

16KB-P$

STxP70Cluster

Processor+ FPx (CP)

16KB-P$

32-KB TCDM

Mem

ory

bank

#31

Mem

ory

bank

#32

• Parametric multi-core crossbar with a logarithmic structure• Reduced arbitration complexity• round robin arbitration scheme• Up to N memory accesses per cycle• Test-and-Set support



…………….

Timers

HWS


Memory (TCDM)


Mem

ory

bank

#1

Mem

ory

bank

#2

Mem

ory

bank

#3

Mem

ory

bank

#4

………

Perip

hera

l Log

arith

mic

inte

rcon

nect

ENC2EXT

EXT2PER

EXT2MEM

Glo

bal I

nter

conn

ect

Inte

rfac

e

ENCo

re<N

>-CC

in

terf

ace


HWPE#P

HWPE#2

HWPE#1

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF


CVP-CC

CC-Peripherals

DMAChannel

#0

DMAChannel

#1

STxP70#1

16KB-P$

STxP70#2

16KB-P$

STxP70#N

16KB-P$

Mem

ory

bank

#2

xN-1

Mem

ory

bank

#2

xN

STxP70Cluster

Processor(CP)

16KB-P$

TCDM

CC In

terc

onne

ct, C

CI

STxP70+ FPx

#1

16KB-P$

STxP70+FPx

#2

16KB-P$

STxP70+FPx#16

16KB-P$

STxP70Cluster

Processor+ FPx (CP)

16KB-P$

32-KB TCDM

Mem

ory

bank

#31

Mem

ory

bank

#32

• Supports 1D & 2D transfers• Up to 3.2GB/s peak per DMA • Support up to 16 outstanding transactions• Support of Out of Order (OoO)



…………….

Timers

HWS


Memory (TCDM)


Mem

ory

bank

#1

Mem

ory

bank

#2

Mem

ory

bank

#3

Mem

ory

bank

#4

………

Perip

hera

l Log

arith

mic

inte

rcon

nect

ENC2EXT

EXT2PER

EXT2MEM

Glo

bal I

nter

conn

ect

Inte

rfac

e

ENCo

re<N

>-CC

in

terf

ace


HWPE#P

HWPE#2

HWPE#1

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF


CVP-CC

CC-Peripherals

DMAChannel

#0

DMAChannel

#1

STxP70#1

16KB-P$

STxP70#2

16KB-P$

STxP70#N

16KB-P$

Mem

ory

bank

#2

xN-1

Mem

ory

bank

#2

xN

STxP70Cluster

Processor(CP)

16KB-P$

TCDM

CC In

terc

onne

ct, C

CI

STxP70+ FPx

#1

16KB-P$

STxP70+FPx

#2

16KB-P$

STxP70+FPx#16

16KB-P$

STxP70Cluster

Processor+ FPx (CP)

16KB-P$

32-KB TCDM

Mem

ory

bank

#31

Mem

ory

bank

#32

• Ultrafast frequency adaptation (power control)• Continuous critical path monitoring (dynamic bin sampling)• Continuous thermal sensing (temperature control)



…………….

Timers

HWS


Memory (TCDM)


Mem

ory

bank

#1

Mem

ory

bank

#2

Mem

ory

bank

#3

Mem

ory

bank

#4

………

Perip

hera

l Log

arith

mic

inte

rcon

nect

ENC2EXT

EXT2PER

EXT2MEM

Glo

bal I

nter

conn

ect

Inte

rfac

e

ENCo

re<N

>-CC

in

terf

ace


HWPE#P

HWPE#2

HWPE#1

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF


CVP-CC

CC-Peripherals

DMAChannel

#0

DMAChannel

#1

STxP70#1

16KB-P$

STxP70#2

16KB-P$

STxP70#N

16KB-P$

Mem

ory

bank

#2

xN-1

Mem

ory

bank

#2

xN

STxP70Cluster

Processor(CP)

16KB-P$

TCDM

CC In

terc

onne

ct, C

CI

STxP70+ FPx

#1

16KB-P$

STxP70+FPx

#2

16KB-P$

STxP70+FPx#16

16KB-P$

STxP70Cluster

Processor+ FPx (CP)

16KB-P$

32-KB TCDM

Mem

ory

bank

#31

Mem

ory

bank

#32

• Highly flexible and configurable interconnect,• Asynchronous implementation• Low-area or high-performance targets,• Natural GALS enabler• high robustness to variations



…………….

Timers

HWS


Memory (TCDM)


Mem

ory

bank

#1

Mem

ory

bank

#2

Mem

ory

bank

#3

Mem

ory

bank

#4

………

Perip

hera

l Log

arith

mic

inte

rcon

nect

ENC2EXT

EXT2PER

EXT2MEM

Glo

bal I

nter

conn

ect

Inte

rfac

e

ENCo

re<N

>-CC

in

terf

ace


HWPE#P

HWPE#2

HWPE#1

……………………

HWPE_WPR & SIF

HWPE_WPR & SIF

HWPE_WPR & SIF


CVP-CC

CC-Peripherals

DMAChannel

#0

DMAChannel

#1

STxP70#1

16KB-P$

STxP70#2

16KB-P$

STxP70#N

16KB-P$

Mem

ory

bank

#2

xN-1

Mem

ory

bank

#2

xN

STxP70Cluster

Processor(CP)

16KB-P$

TCDM

CC In

terc

onne

ct, C

CI

STxP70+ FPx

#1

16KB-P$

STxP70+FPx

#2

16KB-P$

STxP70+FPx#16

16KB-P$

STxP70Cluster

Processor+ FPx (CP)

16KB-P$

32-KB TCDM

Mem

ory

bank

#31

Mem

ory

bank

#32

LD/ST and DMA memory transfers• Intra-Cluster:

– LD/ST (UMA)– DMA: From/to TCDM to/from HWPE

• Inter-Cluster:– LD/ST (NUMA)– DMA: L1-to/from-L1

• Cluster to/from L2-Mem:– LD/ST (NUMA)– DMA: L1 to/from L2

• Cluster to/from L3-Mem (though the system bridge):– LD/ST (NUMA)– DMA: L1 to/from L3

Fabric Controller

SystemBridge

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

L2-MEM

SystemBridge

P2012 as GP Accelerator

P2012 Fabric

L2

L3 (DRAM)

Cluster 0

L1

TCD

M

Cluster 1

L1

TCD

M

Cluster 2

L1

TCD

M

Cluster 3

L1

TCD

M

ARM Host

FC

Summary• P2012 Cluster includes up to 16 + 1 x STxP70 cores, delivering up

to 30.6 GOPS and 20.4 GFLOP peak.

• ~7 GB/s DMA transfers

• Symmetric multi-processing in a UMA fashion within a Cluster; shared data memory in a NUMA fashion between Clusters.

• Fast multi processor synchronization thanks to HW support

• Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements

Mobile SOC in 2012…• Features

– TSMC 40nm (LP/G)– Dual core A9 – 1-1.2GHz (G)– GPU, etc. - 330-400MHz (LP)– GEForce ULV (8 shaders)– 2 separate Vdd rails– 1MB L2$– 32b LPDDR2 (600MHz DR)

NVIDIA Tegra II SoC (2011)

A few (2, 4, 8) High-power processors (ARM): we need to handle power peaks

Efficient accelerator Fabrics with many (10s) PEs: we need to improve efficiency

Lots of (cool) memory, but we need more

Documents

Parallel Architectures