55
Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 1 C-DAC hyPACK-2013 Lecture Topic: Multi-Core Processors : Multi-Core Architecture Part-I C-DAC Four Days Technology Workshop ON hyPACK-2013 (Mode-1:Multi-Core) Venue : CMSD, UoHYD ; Date : October 15-18, 2013 Hybrid Computing Coprocessors/Accelerators Power-Aware Computing Performance of Applications Kernels

Lecture Topic - C-DAC

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 1 C-DAC hyPACK-2013

Lecture Topic:

Multi-Core Processors : Multi-Core Architecture Part-I

C-DAC Four Days Technology Workshop

ON

hyPACK-2013

(Mode-1:Multi-Core)

Venue : CMSD, UoHYD ; Date : October 15-18, 2013

Hybrid Computing – Coprocessors/Accelerators Power-Aware Computing – Performance of

Applications Kernels

Page 2: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 2 C-DAC hyPACK-2013

An overview Multi Cores

Understanding of Intel /AMD Multi-Core Architectures

Performance Issues

Lecture Outline

Following Topics will be discussed

Multi-Core Processors

Source : Reference : [4], [6], [14],[17], [22], [28]

Source : http://www.intel.com ; http://www.amd.com

Page 3: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 3 C-DAC hyPACK-2013

Page 4: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 4 C-DAC hyPACK-2013

Photo-real Synthesis

Real-world animation

Ray tracing

Global Illumination

Behavioral Synthesis

Physical simulation

Kinematics

Emotion synthesis

Audio synthesis

Video/Image synthesis

Document synthesis

Large dataset mining

Semantic Web/Grid Mining

Streaming Data Mining

Distributed Data Mining

Content-based Retrieval

Collaborative Filters

Multidimensional Indexing

Dimensionality Reduction

Efficient access to large, unstructured, sparse

datasets

Stream Processing

Multimodal event/object Recognition

Statistical Computing

Machine Learning

Clustering / Classification

Model-based:

Bayesian network/Markov Model

Neural network / Probability networks

LP/IP/QP/Stochastic Optimization

Evolving towards model-based Computing

Source : http://www.intel.com ; Reference : [6]

Page 5: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 5 C-DAC hyPACK-2013

Workload convergence The basic algorithms shared by these high-end workloads

Platform implications How workload analysis guides future architectures

Programmer productivity Optimized architectures will ease the development of

software

Call to Action Benchmark suites in critical need for redress

Killer Apps of Tomorrow

Page 6: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 6 C-DAC hyPACK-2013

Approximate access times

CPU-registers: 0 cycles (that’s where the work is done!)

L1 Cache: 1 cycle (Data and Instruction cache). Repeated

access to a cache takes only 1 cycle

L2 Cache (static RAM): 3-5 cycles?

Memory (DRAM): 10 cycles (Cache miss);

30-60 cycles for Translation Lookaside Buffer (TLB) update

Disk: about 100,000 cycles!

connecting to other nodes - depending on network latency

Icache Dcache

L2 DISK

RAM

CPU

registers

A lot of time is spent accessing/storing data from/to memory. It is

important to keep in mind the relative times for each memory types:

The Memory sub-system : Access time

Access Time

is Important

Page 7: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 7 C-DAC hyPACK-2013

Parallel Code Block or a

section needs multithread

synchronization

. . .

.

.

.

.

.

.

Parallel Code Block

Implementation Source Code

Perform synchronization

operations using parallel

constructs Bi

Perform synchronization

operations using parallel

constructs Bj

T1 T2 Tn . . .

T1 T2 Tn . . .

T1 …. p

Operational Flow of Threads

Operational Flow of Threads for an Application

Source : Reference : [6]

Page 8: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 8 C-DAC hyPACK-2013

Out of Order Execution

Preemptive and Co-operative Multitasking

SMP to the rescue

Super threading with Multi threaded Processor

Hyper threading the next step (Implementation)

Multitasking

Caching and SMT

Multi Core : Programming Issues

Source : http://www.intel.com ; Reference : [6], [29], [31]

Page 9: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 9 C-DAC hyPACK-2013

Architectural State

Execution Engine

Local APIC

Bus Interface

Physical Package

System Bus

Single Processor

Logical Professor 0 Architectural

State

Logical Professor 1 Architectural

State

Execution Engine

Local APIC Local APIC

Single Processor with Hyper-Threading Technology

System Bus

Physical Package

Single Processor System without Hyper-Threading Technology and Single Processor System with Hyper-Threading Technology

Hyper-threading : Partitioned Resources

Hyper-threading (HT) technology is a hardware mechanism where

multiple independent hardware threads get to execute in a single

cycle on a single super-scalar processor core.

Bus Interface

Source : http://www.intel.com ; Reference : [6], [10], [19], [23], [24],[29], [31]

Page 10: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 10 C-DAC hyPACK-2013

System Bus

Logical Professor 0 Architectural

State

Logical Professor 1 Architectural

State

Execution Engine

Local APIC Local APIC

MP HT

Physical Package

Bus Interface

Logical Professor 0 Architectural

State

Logical Professor 1 Architectural

State

Execution Engine

Local APIC Local APIC

Physical Package

Bus Interface

Architectural State

Execution Engine

Local APIC

Physical Package

Bus Interface

Architectural State

Execution Engine

Local APIC

Physical Package

Bus Interface

MP

System Bus

Hyper-threading Technology

Multi-processor with and without Hyper-threading (HT) technology

Source : http://www.intel.com ; Reference : [6], [10], [19], [23], [24],[29], [31]

Page 11: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 11 C-DAC hyPACK-2013

App 0 App 1 App 2

T0 T1 T2 T3 T4 T5

Thread Pool

Multithreading

CPU

T0 T1 T2 T3 T4 T5 CPU

Time

App 0 App 1 App 2

T0 T1 T2 T3 T4 T5

Thread Pool

Hyper-threading Technology

CPU

T2 T0 T4

T1 T3 T5

LP0

LP1

CPU

2 Threads per Processor

Time

Multi-threaded Processing using Hyper-Threading Technology

Time taken to process n threads on a single processor is significantly

more than a single processor system with HT technology enabled.

Source : http://www.intel.com ; Reference : [6], [29], [31]

Page 12: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 12 C-DAC hyPACK-2013

A) Single Core

B) Multiprocessor CPU State

Interrupt Logic

Execution Units

Cache

CPU State

Interrupt Logic

Execution Units

Cache

CPU State

Interrupt Logic CPU State

Interrupt Logic

Cache Execution Units

CPU State

Interrupt Logic

Cache Execution Units

C) Hyper-Threading Technology D) Multi-core

Simple Comparison of Single-core, Multi-processor, and multi-Core Architectures

CPU State

Interrupt Logic

Execution Units Cache

B) Multi Processor

CPU State Interrupt Logic

Execution Units

Cache

Source : http://www.intel.com ; Reference : [4],[6], [29], [31]

CPU State

Interrupt Logic

Execution Units

CPU State

Interrupt Logic

Execution Units

Cache

E) Multi-core with Shared Cache

F) Multi-core with Hyper-threading Technology

CPU State

Interrupt Logic

CPU State

Interrupt Logic

Execution Units

Cache

CPU State

Interrupt Logic

CPU State

Interrupt Logic

Execution Units

Cache

Page 13: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 13 C-DAC hyPACK-2013

Multi Core Programming Tools

Intel Programming Tools : Intel Thread Building Blocks

Performance

Tools to Discover Parallelism

Use of Math Libraries

Measure Overheads of Threads

Hardware Counters

Auto Parallelism

Thread Analyzer

Thread Checker

Compilers & Math Lib

Thread Performance Analysis

Thread Profiler

Source : http://www.intel.com ; Reference : [6]

Page 14: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 14 C-DAC hyPACK-2013

Explicit Parallel Programming

Thread-based Programming Models.

Data Parallel Programming Models

Stream Programming Models

Programming Multicore Processors

Automatic Parallelization

Features of Most compliers for SMP systems, but currently see very little practical use

Polyhedral framework for dependencies and loop transformations – enabling composition of complex transformations over multiple statements.

Page 15: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 15 C-DAC hyPACK-2013

Initialize Attributes (pthread_attr_init)

Default attributes OK

Put thread in system-wide scheduling contention pthread_attr_setscope(&attrs,

PTHREAD_SCOPE_SYSTEM);

Spawn thread (pthread_create)

Creates a thread identifier

Need attribute structure for thread

Needs a function where thread starts

One 32-bit parameter can be passed (void *)

Spawning Threads

Source : Reference : [4],[6], [29]

Page 16: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 16 C-DAC hyPACK-2013

How does a thread know which thread it is? Does it

matter?

Yes, it matters if threads are to work together

Could pass some identifier in through parameter

Could contend for a shared counter in a critical section

pthread_self()returns the thread ID, but doesn’t help.

How big is a thread’s stack?

By default, not very big. (What are the ramifications?)

pthread_attr_setstacksize()changes stack size

Thread Spawning Issues

Page 17: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 17 C-DAC hyPACK-2013

Main thread must join with child threads (pthread_join)

Why?

Ans: So it knows when they are done.

pthread_join can pass back a 32-bit value

Can be used as a pointer to pass back a result

What kind of variable can be passed back that way? Local?

Static? Global? Heap?

Join Issues

Page 18: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 18 C-DAC hyPACK-2013

Thread Pitfalls

Shared data

2 threads perform

A = A + 1

Thread 1:

1) Load A into R1

2) Add 1 to R1

3) Store R1 to A

Thread 1:

1) Load A into

R1

2) Add 1 to R1

3) Store R1 to A

Mutual exclusion preserves

correctness

Locks/mutexes

Semaphores

Monitors

Java “synchronized”

False sharing

Non-shared data packed

into same cache line

int thread1data;

int thread1data;

Cache line ping-

pongsbetween CPUs

when threads access their

data

Locks for heap access

malloc() is expensive

because of mutual

exclusion

Use private

Page 19: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 19 C-DAC hyPACK-2013

Threading and synchronization built in

An object can have associated thread

Subclass Thread or Implement Runnable

“run”method is thread body

“synchronized”methods provide mutual exclusion

Main program

Calls “start”method of Thread objects to spawn

Calls “join”to wait for completion

Java Threads

Page 20: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 20 C-DAC hyPACK-2013

The Role of Intelligent Design : Multi-Core Processors Intel Quad Core (Clovertown) Server Processor with Blackford Chipset [2007]

Multi Cores Today

Source : http://www.intel.com ; http://www.amd.com

Page 21: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 21 C-DAC hyPACK-2013

Page 22: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 22 C-DAC hyPACK-2013

Latency from different levels of Memory

Bandwidth from different levels of Memory

Stream Benchmark

Prefetch Streams – Effective Bandwidth

Serial and parallel dot-product (dot-product)

Matrix-vector Multiplication (DAXPY) loop

In-direct dot product

Remote versus local memory

Memory Performance of Dual Core Systems

Source : http://www.intel.com ; Reference : [6]

Page 23: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 23 C-DAC hyPACK-2013

IA-32 IA-32

North

Bridge Memory

Memory

1.6 GB/s

1.6 GB/s

3.2 GB/s

200 MHz dual Channel

Commodity PC

Server

Share the 400 MHz front-

side bus (FSB)

3.2 GB /s Bandwidth

Memory Architecture for dual processor system

Commodity PC -Server

Shared Bus Micro Architecture

Intel IA-32 processors – Incorporates two processors on a

single motherboard that shares a common Northbridge and

Memory DIMMS

Source : http://www.intel.com ; Reference : [6]

Page 24: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 24 C-DAC hyPACK-2013

Multi Cores Today C

PU

1

CP

U2

System/

Mem I/F

L2 Cache

CP

U1

CP

U2

System/

Mem I/F

L2 Cache

CP

U1

CP

U2

System/

Mem I/F

L2 Cache

CP

U1

CP

U2

System/

Mem I/F

L2 Cache

Memory

Controller Mem Mem

HyperTransport Link

Memory

Controller

Front Side

Bus

CP

U1

CP

U2

System/

Mem I/F

L2 L2

CP

U1

CP

U2

System/

Mem I/F

L2 L2

Core 2 Xeon Dual-Core Opteron Core 2 Quad/Extreme

Source : http://www.intel.com; http://www.amd.com

Page 25: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 25 C-DAC hyPACK-2013

The Role of Intelligent Design in the Evolution of Multi-Core Processors

Intel’s First Dual-Core Designs - Performance improvement

Dual Independent Front Side Buses (FSBs) for each CPU Socket (Enhancement of Single Core)

• FSB Shared by the both Cores

• Interprocessor Cache Snooping (Ensure Coherency of Cached data must be performed over the FSDB places on incremental

load on FSB source : http:/www.intel.com

Multi-Core Processors

AMD Opteron – Dual Core; source : http:/www.amd.com

Sun Micro Systems – Niagara processors - UltraSPARC1

source : http:/www.sun.com

Page 26: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 26 C-DAC hyPACK-2013

Intel Quad Core : Quad Processors demand more Memory

Bandwidth – Starving for Memory Bandwidth

Cache Snooping over the Front Side Bus (FSB)

Contention for FSB Access

Overwork Memory Controller

Performance is marginally better in most Applications Significant

improvement in some other Applications

The Role of Intelligent Design in the Evolution of Multi-Core Processors

Multi-Core Processors

source : http:/www.intel.com

Page 27: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 27 C-DAC hyPACK-2013

The Intel Xeon Memory architecture – Single Core – Good

Memory Bandwidth from all level of the memory hierarchy.

Performance of shared bus memory architecture - when executing

two memory intensive processes in parallel on a dual-processors

node.

The effect of memory contention, the resulting degradation in

performance, is especially bad for random or strided memory

access.

Effect of entire FSB Bandwidth can be utilized

Good Scalability of the memory bandwidth leads to efficient use of

second CPU in a dual processor node.

Memory Performance of Dual Core Systems

source : http:/www.intel.com

Page 28: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 28 C-DAC hyPACK-2013

Intel Dual Core (Dempsey

with Blackboard Chipset)

• Common L2 Cache

• Shared bus interface

Intel Dual Core (Yonah

Mobile Processor)

• Inter Core Sharing in Dual-Core Architecture

Multi-Core Processors

Source : http://www.intel.com ;

Page 29: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 29 C-DAC hyPACK-2013

The Intel’s First Dual Core arena : Smithfield, Paxville, Demsey

& Presier

Communication between the cores must be accomplished over

the external front side bus that connects both the core of the

north bridge.

Memory Performance of Dual /Quad Core Systems

Dempsey with Blackford chipset – that provides a separate

front-side bus for each CPU socket in the system.

Good Scalability of the memory bandwidth leads to efficient

use of second CPU in a dual processor node.

Yonah Intel’s Dual-Core (yonah) Mobile Processor –

Two cores share a common L2 Cache, they never need to go

off-chip to ensure cache coherency.

Eliminates front-side bus (FSB) contention issues

source : http:/www.intel.com

Page 30: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 30 C-DAC hyPACK-2013

Multi-Core Processors

Intel Dual Core (Woodcrest) Server Processor with Blackford

Chipset : Improves FSB Speed.

source : http:/www.intel.com

Page 31: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 31 C-DAC hyPACK-2013

Intel Dual Core (Woodcrest) Server Processor with Blackford

Chipset

Communication between the cores must be accomplished over the

external front side bus that connects both the core of the north

bridge.

Dempsey with Blackford chipset – that provides a separate front-side

bus for each CPU socket in the system.

Good Scalability of the memory bandwidth leads to efficient use of

second CPU in a dual processor node.

Memory Performance of Dual /Quad Core Systems

source : http:/www.intel.com

Page 32: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 32 C-DAC hyPACK-2013

Intel Quad Core (Clovertown) Server Processor with

Blackford Chipset : Clovertown will consist of two Woodcrest

dice crammed into a single package.

Multi-Core Processors

source : http:/www.intel.com

Page 33: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 33 C-DAC hyPACK-2013

The Role of Intelligent Design : Multi-Core Processors Intel Quad Core (Caneland) Server Processor with Blackford Chipset [2007]

Multi-Core Processors

source : http:/www.intel.com

Page 34: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 34 C-DAC hyPACK-2013

Page 35: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 35 C-DAC hyPACK-2013

AMD introduced the first multi-core technology for x86 based servers

Multi-core processors represent a major evolution in computing technology. Placing tow or more powerful computing cores on a single processor opens up a world of important new possibilities.

The AMD Platform is leading industry to pervasive 64 bit computing

AMD Opteron processor –server and workstation

AMD : Introducing Multi-core technology

source : http:/www.amd.com

Page 36: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 36 C-DAC hyPACK-2013

CPU 0 CPU 1

Memory

Simple SMP Block Diagram

for a two processors

AMD Opteron

CPU0

Memory

AMD Opteron

CPU1

Memory

HyperTransport

Two processor AMD

Opteron system in

cc NUMA configuration

Two processor Dual Core

Multi Cores Today

Page 37: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 37 C-DAC hyPACK-2013

Core 0 Core 1

AMD Opteron Dual-

Core Processor 0

HyperTransport

Dual-Core AMD Opteron Processor configuration

Memory

Core 2 Core 3

Memory

AMD Opteron Dual-

Core Processor 1

AMD : Cache-Coherent nonuniform memory access (ccNUMA)

Two or more processors are connected together on the same

motherboard In ccNUMA design, each processor has its own memory system.

The phrase ‘Non Uniform Memory access’ refers to the potential

difference in latency

AMD Multi Cores

source : http:/www.amd.com

Page 38: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 38 C-DAC hyPACK-2013

HyperTransport technology is a high-speed bi-directional

,low latency point to point communication link

Provides bandwidth interconnect between computing cores.

AMD Opteron support up to three coherent HyperTransport

links yielding up to 24.0 GB/s peak bandwidth per processor

The AMD Opteron processor provides scalable architecture

that next-generation performance.

Integrated DDR Memory Controller : Increase application

performance by dramatically reducing memory latency

AMD Opteron : Hyper Transport technology

Page 39: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 39 C-DAC hyPACK-2013

Memory access is optimized First, Controllers are build into the chips, to create a

direct path to each processor’s locally attached memory

Scaling memory in coherency is achieved through a

virtual chipset, interconnected via high speed transport

layer, called HyperTransport.

Each processor can share its local memory and devices

with other processors through Hyper Transport Links; but

each processor retains a local, direct path to its attached

memory.

It is NUMA-like architecture because of latencies when

accessing local versus remote memory.

AMD : Scalable Memory Interconnect

source : http:/www.amd.com

Page 40: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 40 C-DAC hyPACK-2013

Hyper Transport technology within Opteron micro-architecture

AMD : Scalable Memory Interconnect

Scale in Memory Bandwidth : Memory access is optimized

I/O Support

source : http:/www.amd.com

Page 41: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 41 C-DAC hyPACK-2013

Block Diagram Of AMD Opteron Processor

source : http:/www.amd.com

Page 42: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 42 C-DAC hyPACK-2013

I/O is directly connected to CPU for more balanced throughput and I/O

CPUs are directly connected to each other allow linear symmetrical multiprocessing

128 bit wide integrated DDR DRAM memory controller capable of supporting up to eight registered DDR DIMMs per processor

Allows end users to run their existing 32 bit application

Designed to enable 64 bit computing

Enable single architecture across 32 and 64 bit environment

Memory is directly connected to the CPU optimizing

performance

AMD Opteron Processor Key Architecture features

Page 43: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 43 C-DAC hyPACK-2013

AMD Dual-Core Opteron, Circa 2005

Socket F used in the motherboard OEMs

AMD Opteron Dual Core Processor

source : http:/www.amd.com

Page 44: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 44 C-DAC hyPACK-2013

AMD Quad-Core Opteron, Circa 2007

AMD Opteron Quad Core Processor

source : http:/www.amd.com

Page 45: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 45 C-DAC hyPACK-2013

MD64 with direct connect architecture can improve overall performance and efficiency

No front side data buses, instead ,the processors, memory controller and i/o are directly connected to CPU and communicate at CPU speed

Available with all types of AMD (Opteron, Athlon)

AMD Opteron : Direct Connect Architecture

Page 46: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 46 C-DAC hyPACK-2013

AMD Opteron: Direct Connect Architecture

source : http:/www.amd.com

Page 47: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 47 C-DAC hyPACK-2013

AMD Opteron: Direct Connect Architecture

source : http:/www.amd.com

Page 48: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 48 C-DAC hyPACK-2013

AMD64 Core :

Enables simultaneous 32 and 64 bit computing

Eliminates the 4GB memory barrier imposed by 32-bit only systems

HyperTransport Technology :

provides maximum peak bandwidth and reduce bottle neck

Directly connect CPU enabling scalability

AMD Opteron : Processor Benefits

Page 49: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 49 C-DAC hyPACK-2013

Multi Core Architecture : I/O Bottelnecks

Balanced Architecture with Dual independent buses reduce I/O Bottlenecks

Page 50: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 50 C-DAC hyPACK-2013

8 cores soon

Room for improvement

Multi-way caches expensive

Coherence protocols perform

poorly

Stream programming

GPU or multi-core

GPGPU.org for details

Future CMP Technology

Athlon 64

CPU

ATI

GPU

XBAR

Hyper-

Transport

Memory

Controller

Possible Hybrid

AMD Multi-Core

Design

Page 51: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 51 C-DAC hyPACK-2013

An overview of Multi-Core Systems

Examples of Multi-Core Systems

Memory Performance Issues

Summary

Page 52: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 52 C-DAC hyPACK-2013

1. Andrews, Grogory R. (2000), Foundations of Multithreaded, Parallel, and Distributed Programming, Boston, MA : Addison-Wesley

2. Butenhof, David R (1997), Programming with POSIX Threads , Boston, MA : Addison Wesley Professional

3. Culler, David E., Jaswinder Pal Singh (1999), Parallel Computer Architecture - A Hardware/Software Approach , San Francsico, CA : Morgan Kaufmann

4. Grama Ananth, Anshul Gupts, George Karypis and Vipin Kumar (2003), Introduction to Parallel computing, Boston, MA : Addison-Wesley

5. Intel Corporation, (2003), Intel Hyper-Threading Technology, Technical User's Guide, Santa Clara CA : Intel Corporation Available at : http://www.intel.com

6. Shameem Akhter, Jason Roberts (April 2006), Multi-Core Programming - Increasing Performance through Software Multi-threading , Intel PRESS, Intel Corporation,

7. Bradford Nichols, Dick Buttlar and Jacqueline Proulx Farrell (1996), Pthread Programming O'Reilly and Associates, Newton, MA 02164,

8. James Reinders, Intel Threading Building Blocks – (2007) , O’REILLY series

9. Laurence T Yang & Minyi Guo (Editors), (2006) High Performance Computing - Paradigm and Infrastructure Wiley Series on Parallel and Distributed computing, Albert Y. Zomaya, Series Editor

10. Intel Threading Methodology ; Principles and Practices Version 2.0 copy right (March 2003), Intel Corporation

References

Page 53: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 53 C-DAC hyPACK-2013

11. William Gropp, Ewing Lusk, Rajeev Thakur (1999), Using MPI-2, Advanced Features of the Message-Passing Interface, The MIT Press..

12. Pacheco S. Peter, (1992), Parallel Programming with MPI, , University of Sanfrancisco, Morgan Kaufman Publishers, Inc., Sanfrancisco, California

13. Kai Hwang, Zhiwei Xu, (1998), Scalable Parallel Computing (Technology Architecture Programming), McGraw Hill New York.

14. Michael J. Quinn (2004), Parallel Programming in C with MPI and OpenMP McGraw-Hill International Editions, Computer Science Series, McGraw-Hill, Inc. Newyork

15. Andrews, Grogory R. (2000), Foundations of Multithreaded, Parallel, and Distributed Progrmaming, Boston, MA : Addison-Wesley

16. SunSoft. Solaris multithreaded programming guide. SunSoft Press, Mountainview, CA, (1996), Zomaya, editor. Parallel and Distributed Computing Handbook. McGraw-Hill,

17. Chandra, Rohit, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, (2001),Parallel Programming in OpenMP San Fracncisco Moraan Kaufmann

18. S.Kieriman, D.Shah, and B.Smaalders (1995), Programming with Threads, SunSoft Press, Mountainview, CA. 1995

19. Mattson Tim, (2002), Nuts and Bolts of multi-threaded Programming Santa Clara, CA : Intel Corporation, Available at : http://www.intel.com

20. I. Foster (1995, Designing and Building Parallel Programs ; Concepts and tools for Parallel Software Engineering, Addison-Wesley (1995)

21. J.Dongarra, I.S. Duff, D. Sorensen, and H.V.Vorst (1999), Numerical Linear Algebra for High Performance Computers (Software, Environments, Tools) SIAM, 1999

References

Page 54: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 54 C-DAC hyPACK-2013

22. OpenMP C and C++ Application Program Interface, Version 1.0". (1998), OpenMP Architecture Review Board. October 1998

23. D. A. Lewine. Posix Programmer's Guide: (1991), Writing Portable Unix Programs with the Posix. 1 Standard. O'Reilly & Associates, 1991

24. Emery D. Berger, Kathryn S McKinley, Robert D Blumofe, Paul R.Wilson, Hoard : A Scalable Memory Allocator for Multi-threaded Applications ; The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November (2000). Web site URL : http://www.hoard.org/

25. Marc Snir, Steve Otto, Steyen Huss-Lederman, David Walker and Jack Dongarra, (1998) MPI-The Complete Reference: Volume 1, The MPI Core, second edition [MCMPI-07].

26. William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir (1998) MPI-The Complete Reference: Volume 2, The MPI-2 Extensions

27. A. Zomaya, editor. Parallel and Distributed Computing Handbook. McGraw-Hill, (1996)

28. OpenMP C and C++ Application Program Interface, Version 2.5 (May 2005)”, From the OpenMP web site, URL : http://www.openmp.org/

29. Stokes, Jon 2002 Introduction to Multithreading, Super-threading and Hyper threading Ars Technica, October (2002)

30. Andrews Gregory R. 2000, Foundations of Multi-threaded, Parallel and Distributed Programming, Boston MA : Addison – Wesley (2000)

31. Deborah T. Marr , Frank Binns, David L. Hill, Glenn Hinton, David A Koufaty, J . Alan Miller, Michael Upton, “Hyperthreading, Technology Architecture and Microarchitecture”, Intel (2000-01)

References

Page 55: Lecture Topic - C-DAC

Multi-Core Processors : Multi Core Architectures Part-III (Intel/AMD) 55 C-DAC hyPACK-2013

Thank You Any questions ?