40
CPEG421-2001-F-Topic-3-II 1 Topic 3 -- II: System Software Fundamentals: Multithreaded Execution Models, Virtual Machines and Memory Models Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected]

Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

  • Upload
    xandy

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Topic 3 -- II: System Software Fundamentals: Multithreaded Execution Models, Virtual Machines and Memory Models. Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected]. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 1

Topic 3 -- II: System Software Fundamentals:

Multithreaded Execution Models, Virtual Machines

and Memory Models

Guang R. Gao

ACM Fellow and IEEE FellowEndowed Distinguished ProfessorElectrical & Computer Engineering

University of Delaware

[email protected]

Page 2: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 2

Outline• An introduction to parallel program execution

models• Coarse-grain vs. fine-grain multithreading• Evolution of fine-grain multithreaded program

execution models.• Memory and synchronization. models• Fine-Grain Multithreaded execution and virtual

machine models for peta-scale computing: a case study on HTMT/EARTH

Page 3: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 3

Terminology Clarification

• Parallel Model of Computation– Parallel Models for Algorithm Designers– Parallel Models for System Designers

• Parallel Programming Models• Parallel Execution Models• Parallel Architecture Models

Page 4: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 4

System Characterization

Questions:

Q1: What characteristics of a computational system are required …

Q2: The diversity of existing and potential multi-core architectures…

Response:

R1: An important characteristic of such a compiler should include, at both chip level and system level, a program execution model that should at least include the specification and API

Gao, ECCD Workshop, Washington D.C., Nov. 2007

Page 5: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 5

What Does Program Execution Model (PXM) Mean ?

• The notion of PXM

The program execution model (PXM) is the basic

low-level abstraction of the underlying system

architecture upon which our programming model,

compilation strategy, runtime system, and other software components are developed.

• The PXM (and its API) serves as an interface between the architecture and the software.

Page 6: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 6

Program Execution Model (PXM) – Cont’d

Unlike an instruction set architecture (ISA) specification, which usually focuses on lower level details (such as instruction encoding and organization of registers for a specific processor), the PXM refers to machine organization at a higher level for a whole class of high-end machines as view by the users

Gao, et. al., 2000

Page 7: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 7

What is your “Favorite”

Program Execution Model?

Page 8: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

A Generic MIMD Architecture

CPEG421-2001-F-Topic-3-II 8

Memory NICCommunication

Assist

$

P

$

P

IC

Node: Processor(s), Memory System plus Communication assist (Network Interface & Communication Controller)

Full Feature Interconnect Networks. Packet Switching Fabrics. Key: Scalable Network

Objective: Make efficient use of scarce communication resources – providing high bandwidth, low-latency communication between nodes with a minimum cost and energy

Page 9: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Programming Models for Multi-Processor Systems

• Message Passing Model– Multiple address

spaces

– Communication can only be achieved through “messages”

• Shared Memory Model– Memory address space

is accessible to all

– Communication is achieved through memory

CPEG421-2001-F-Topic-3-II 9

Local Memory

Processor

Local Memory

Processor

Messages

Processor Processor

Global Memory

Page 10: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Comparison

Message Passing

+ Less Contention

+ Highly Scalable

+ Simplified Synch – Message Passing Sync +

Comm.

– But does not mean highly programmable

- Load Balancing

- Deadlock prone

- Overhead of small messages

Shared Memory

+ global shared address space

+ Easy to program (?)

+ No (explicit) message passing (e.g. communication through memory put/get operations)

- Synchronization (memory consistency models, cache models)

- Scalability

CPEG421-2001-F-Topic-3-II 10

Page 11: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

What is A Shared Memory Execution Model?

CPEG421-2001-F-Topic-3-II 11

Thread ModelA set of rules for creating, destroying and managing threads

Thread ModelA set of rules for creating, destroying and managing threads

Memory ModelDictate the ordering of memory operations

Memory ModelDictate the ordering of memory operations

Synchronization ModelProvide a set of mechanisms to protect from data races

Synchronization ModelProvide a set of mechanisms to protect from data races

Execution Model

The Thread Virtual MachineThe Thread Virtual Machine

Page 12: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 12

Essential Aspects in User-Level Shared Memory Support?

• Shared address space support and management

• Access control and management

- Memory consistency model (MCM)

- Cache management mechanism

Page 13: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 13

Grand Challenge Problems

• How to build a shared-memory multiprocessor that is

scalable both within a (multi-core/many-core chip) and a

system with many chips ?

• How to program and optimize application programs?

Our view: One major obstacle in solving these problems in

the memory coherence assumption in today’s hardware-

centric memory consistency model.

Page 14: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

A Parallel Execution Model

CPEG421-2001-F-Topic-3-II 14

Application Programming Interface (API)

Execution / Architecture Model

Thread Model

Memory Model

Synchronization Model

Page 15: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

A Parallel Execution Model

CPEG421-2001-F-Topic-3-II 15

Application Programming Interface (API)

With Dataflow Origins

Execution / Architecture Model

Fine Grained Multithreaded

Model

Memory Adaptive /

Aware Model

Fine Grained Synchronization

Model

Our Model

Page 16: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 16

Comment on OS impact?

• Should compiler be OS-Aware too ? If so, how ?

• Or other alternatives ? Compiler-controlled runtime, of compiler-aware kernels, etc.

• Example: software pipelining …

Gao, ECCD Workshop, Washington D.C., Nov. 2007

Page 17: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 17

Outline

• An introduction to multithreaded program execution models

• Coarse-grain vs. fine-grain parallel execution models – a historical overview

• Fine-grain multithreaded program execution models.

• Memory and synchronization. models• Fine-grain multithreaded execution and virtual

machine models for extreme-scale machines: a case study on HTMT/EARTH

Page 18: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Course Grain Execution Models

CPEG421-2001-F-Topic-3-II 18

The Single Instruction Multiple Data (SIMD) Model

The Single Program Multiple Data (SPMD) Model

The Data Parallel Model

Pipelined Vector Unit orPipelined Vector Unit or

Array of ProcessorsArray of Processors

Program

Processor

Program

Processor

Program

Processor

Program

Processor

Task Task Task Task

Data Structure

Page 19: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Data Parallel Model

CPEG421-2001-F-Topic-3-II 19

Difficult to write unstructured programsDifficult to write unstructured programsConvenient only for problems with regular structured parallelism.

Limited composability!Limited composability!Inherent limitation of coarse-grain multi-threading

Compute

Communication

Compute

Communication

?

Limitations

Page 20: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Dataflow Model of Computation

CPEG421-2001-F-Topic-3-II 20

++

++**

a b c d e

1

3

4

3

Page 21: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Dataflow Model of Computation

CPEG421-2001-F-Topic-3-II 21

++

++**

a b c d e

4

3

4

Page 22: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Dataflow Model of Computation

CPEG421-2001-F-Topic-3-II 22

++

++**

a b c d e

7

4

Page 23: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Dataflow Model of Computation

CPEG421-2001-F-Topic-3-II 23

++

++**

a b c d e

28

Page 24: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Dataflow Model of Computation

CPEG421-2001-F-Topic-3-II 24

++

++**

a b c d e

1

3

4

3

28

Dataflow Software Pipelining

Page 25: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 25

Outline

• An introduction to multithreaded program execution models

• Coarse-grain vs. fine-grain parallel execution models – A Historical Overview

• Fine-grain multithreaded program execution models.

• Memory and synchronization. models• Fine-grain multithreaded execution and virtual

machine models for peta-scale machines: a case study on HTMT/EARTH

Page 26: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 26

CPU

Memory

Fine-Grain non-preemptive thread-The “hotel” model

ThreadUnit

ExecutorLocus

Coarse-Grain vs. Fine-Grain Multithreading

A PoolThread

CPU

Memory

ExecutorLocus

A SingleThread

Coarse-Grain thread-The family home model

ThreadUnit

[Gao: invited talk at Fran Allen’s Retirement Workshop, 07/2002]

Page 27: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 27

Evolution of Multithreaded Execution and Architecture Models

Non-dataflowbased

CDC 66001964

MASAHalstead1986

HEPB. Smith1978

Cosmic CubeSeiltz1985

J-MachineDally1988-93

M-MachineDally1994-98

Dataflowmodel inspired

MIT TTDAArvind1980

ManchesterGurd & Watson1982

*T/Start-NGMIT/Motorola1991-

SIGMA-IShimada1988

MonsoonPapadopoulos& Culler 1988

P-RISCNikhil & Arvind1989

EM-5/4/X RWC-11992-97

Iannuci’s1988-92

Others: Multiscalar (1994), SMT (1995), etc.

Flynn’sProcessor1969

CHoPP’77 CHoPP’87

TAMCuller1990

TeraB. Smith1990-

AlwifeAgarwal1989-96

CilkLeiserson

LAUSyre1976

Eldorado

CASCADE

StaticDataflowDennis 1972MIT

Arg-FetchingDataflowDennisGao1987-88

MDFAGao1989-93

MTAHumTheobaldGao 94

EARTH CAREPACT95’, ISCA96, Theobald99

Marquez04

Page 28: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

The Von Neumann-type Processing

CPEG421-2001-F-Topic-3-II 28

begin for i = 1 … … endforend

begin for i = 1 … … endforend

Source Code

CompilerSequential Machine

Representation

CPU

Load

Processor

Page 29: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

A Multithreaded Architecture

CPEG421-2001-F-Topic-3-II 29

To Other PE’s

One PE

Page 30: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 30

McGill Data FlowArchitecture Model

(MDFA)

Page 31: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 31

n1

n2 n3

stor

e

store

fetchfetch

n1

n2 n3

store

fetch fetch

Argument –flow Principle Argument –fetching Principle

Page 32: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

A Dataflow Program Tuple

CPEG421-2001-F-Topic-3-II 32

Program Tuple = { P-Code . S-Code }Program Tuple = { P-Code . S-Code }

P-CodeP-Code

N1: x = a + b;N2: y = c – d;N3: z = x * y;

S-CodeS-Code

22

33n1n1

a

b

22

33n2n2

c

d

22

33n1n1

IPUIPU ISUISU

Page 33: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

The McGill Dataflow Architecture Model

CPEG421-2001-F-Topic-3-II 33

Pipelined Instruction Processing Unit (PIPU)

Dataflow Instruction Scheduling Unit (DISU)

Enable Memory & Controller

Signal Processing

Fire Done

Page 34: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

The McGill Dataflow Architecture Model

CPEG421-2001-F-Topic-3-II 34

Pipelined Instruction Processing Unit (PIPU)

Dataflow Instruction Scheduling Unit (DISU)

Fire Done

Waiting Instructions

Enabled Instructions = PC

Important Features

Pipeline can be kept fully utilized provided that the program has sufficient parallelism

Page 35: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

The Scheduling Memory (Enable)

CPEG421-2001-F-Topic-3-II 35

Dataflow Instruction Scheduling Unit (DISU)

CONTROLLER

1 1

1 1

01

0 0

0 0

0

1 1

1

1 0

0 0

0 1

Signal Processing

Fire Done

Count Signal(s)

0 Waiting Instructions1 Enabled Instructions

Page 36: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 36

Advantages of the McGill Dataflow Architecture Model

• Eliminate unnecessary token copying and transmission overhead

• Instruction scheduling is separated from the main datapath of the processor (e.g. asynchronous, decoupled)

Page 37: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Von Neumann Threads as Macro Dataflow Nodes

CPEG421-2001-F-Topic-3-II 37

1

2

3

k

A sequence of instructions is “packed” into a macro-dataflow node

Synchronization is done at the macro-node level

Page 38: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 38

Hybrid Evaluation Von Neumann Style Instruction Execution” on

the McGill Dataflow Architecture• Group a “sequence” of dataflow instruction into a “thread” or

a macro dataflow node.• Data-driven synchronization among threads.• “Von Neumann style sequencing” within a thread.

Advantage:Preserves the parallelism among threads but avoids unnecessary fine-grain synchronization between instructions within a sequential thread.

Page 39: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

CPEG421-2001-F-Topic-3-II 39

What Do We Get?

• A hybrid architecture model without sacrificing the advantage of fine-grain parallelism!(latency-hiding, pipelining support)

Page 40: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

A Realization of the Hybrid Evaluation

CPEG421-2001-F-Topic-3-II 40

Pipelined Instruction Processing Unit (PIPU)

Dataflow Instruction Scheduling Unit (DISU)

Fire Done

Shortcut

1 2 k

Von Neumann bitVon Neumann bit