Intro Chp1

Embed Size (px)

Citation preview

  • 8/18/2019 Intro Chp1

    1/109

    B.Tech – IIIrd

    1

  • 8/18/2019 Intro Chp1

    2/109

    CO322: PARALLEL PROCESSING AND ARCHITECTURE

    (EIS – II)

    L – 3, T – 0, P- 0, C- 3 1 lecture/week [ from my side]

    30 Marks Midsem

    50 Marks Endsem

    10 Marks Class tests/Quizz/Assignments

    10 Marks Attandence

    100 Marks Total

    2

  • 8/18/2019 Intro Chp1

    3/109

    3

  • 8/18/2019 Intro Chp1

    4/109

    Parallel computer model - 4

    The state of Computing

    Multiprocessors and Multicomputers

    Multivector and SIMD Computers

    Program and network Properties - 4

    Conditions of parallelism

    Program Partitioning and scheduling

    Program Flow Mechanism

    System Interconnect Architecture

    4

  • 8/18/2019 Intro Chp1

    5/109

    Principles of scalable performance - 4

    Performance Metrics and Measures

    Parallel Processing Applications

    Speedup Performance Laws Scalability Analysis and Approaches

    Processors and Memory Hierarchy - 4

    Advanced Processor Technology

    Superscalar and vector Processors Memory Hierarchy Technology

    Virtual Memory Technology

    5

  • 8/18/2019 Intro Chp1

    6/109

    Multiprocessors and Multicomputers

    Multiprocessor system Interconnects

    Cache Coherence and synchronization

    Message Passing Mechanism

    Multivector and SIMD Computers

    Vector Processing Principles,

    Multivector Multiprocessors

    Compound Vector Processing

    SIMD Computer Organization,

    The Connection Machine CM-5.

    6

  • 8/18/2019 Intro Chp1

    7/109

    Scalable Multithreaded and dataflow Architecture

    Latency Hiding Techniques,

    Principles Of Multithreading,

    Fine Grain MultiComputers,

    Scalable and Multithreaded Architecture,

    Dataflow and Hybrid Architectures.

    Multicore Programming

    Single Core Processor Fundamentals,

    Introduction to Multi Core Architecture, System Overview of Threading,

    Fundamental Concepts of Parallel Programming,

    Threading and Parallel Programming

    7

  • 8/18/2019 Intro Chp1

    8/109

    1) Kai Hwang, F. Briggs, “Computer Architecture and Parallel

    Processing”, McGraw Hill International Edition, Reprint 2006.

    2) M. Flynn, “Computer Architecture: Pipelined and Parallel

    Processor Design”, 1/E, Jones and Bartlett, 1995

    3) Harry F. Jordan, “ Fundamentals of Parallel Processing”, 1/E,

    2002

    4) Hesham ElRewini and Mostsfa AbdElBarr, “Advanced

    Computer Architecture and Parallel Processing, WileyInterscience”, 2005.

    5) Shameem Akhter & Jason Roberts, “MultiCore

    Programming”, Intel Press, 2006.

    8

  • 8/18/2019 Intro Chp1

    9/109

    9

  • 8/18/2019 Intro Chp1

    10/109

    The State of Computing

    Multiprocessors and Multicomputer

    Multivector and SIMD Computers

    10

  • 8/18/2019 Intro Chp1

    11/109

    Parallel processing

    It is a form of processing in which many calculations are

    carried out simultaneously

    Increasing demand for higher performance, lower costs,

    and nonstop productivity in real life applications.

    Concurrent events are taking place in today's highperformance computers

    ▪ due to the common practice of multiprogramming, multiprocessing,

    or multicomputing.

    11

  • 8/18/2019 Intro Chp1

    12/109

    Processor = Programmable computing element that

    runs stored programs written using pre-defined

    instruction set

    A parallel computer (or multiple processor system)

    is a collection of communicating processing

    elements (processors) that cooperate to solve large

    computational problems fast by dividing such

    problems into parallel tasks.

    12

  • 8/18/2019 Intro Chp1

    13/109

    Parallelism appears in various forms

    ▪ lookahead, pipelining, vectorization, concurrency, data

    parallelism, partitioning, interleaving, overlapping,

    replication, timesharing, spacesharing, multitasking,

    multiprogramming, multithreading, and distributed

    computing at different processing levels.

    Model physical architectures of

    parallel computers, vector supercomputers,

    multiprocessors, multicomputers and massively parallel

    processors

    13

  • 8/18/2019 Intro Chp1

    14/109

    Modern computers are equipped with powerful

    hardware facilities driven by extensive software

    packages historical milestones in the development of

    computers

    crucial hardware and software elements

    Identified and analyzing the performance of

    computers

    14

  • 8/18/2019 Intro Chp1

    15/109

    Computers have gone through two major stages of

    development:

    Mechanical and Electronic Zuse's and Aiken's machines

    Designed for general-purpose computations

    Computing and communication were carried out with

    moving mechanical parts

    Limited the computing speed and reliability of mechanical

    computers

    15

  • 8/18/2019 Intro Chp1

    16/109

    Modern computers

    Electronic components

    Moving parts in mechanical computers replaced by

    high mobility electrons in electronic computers

    Information transmission by mechanical gears or

    levers replaced by electric signals

    16

  • 8/18/2019 Intro Chp1

    17/109

    17

  • 8/18/2019 Intro Chp1

    18/109

    Modern computer is an integrated system consisting

    of

    machine hardware, an instruction set, system software,application programs, and user interfaces

    The use of a computer is driven by real life problems

    demanding

    Numerical computing, transaction processing, and logical

    reasoning

    18

  • 8/18/2019 Intro Chp1

    19/109

    19

  • 8/18/2019 Intro Chp1

    20/109

    Computing Problems

    Numerical Computing: Science and engineering

    numerical problems demand intensive integer andfloating point computations

    Logical Reasoning: Artificial intelligence (AI)

    demand logic inferences and symbolic

    manipulations and large space searches

    20

  • 8/18/2019 Intro Chp1

    21/109

    Algorithms and Data Structures

    Special algorithms and data structures are needed to

    specify the computations and communications involved in

    computing problems Most numerical algorithms are deterministic using regular

    data structures

    Symbolic processing may use heuristics or non-

    deterministic searches

    Parallel algorithm development requires interdisciplinary

    interaction

    21

  • 8/18/2019 Intro Chp1

    22/109

    Hardware Resources

    Processors, memory and peripheral devices

    Special hardware interfaces built to I/O devices

    Software interface programs

    Processor connectivity (system interconnects,

    network), memory organization, influence the

    system architecture

    22

  • 8/18/2019 Intro Chp1

    23/109

    Operating System

    An effective operating system manages the allocation and

    deallocation of resources during the execution of user

    programs

    Mapping to match algorithmic structures with hardware

    architecture and vice versa: processor scheduling, memory

    mapping, interprocessor communication

    Parallelism utilization possible at:

    1- algorithm design,

    2- program writing,

    3- compilation, and

    4- run time

    23

  • 8/18/2019 Intro Chp1

    24/109

    System Software Support

    Needed for the development of efficient programs in high

    level languages.

    HLL to object code - compiler

    object code to machine code – assembler

    Loader is used to initiate the program execution

    24

  • 8/18/2019 Intro Chp1

    25/109

    Compiler Support – 3 approaches

    Preprocessor

    ▪ Uses a sequential compiler and a low-level library of the target

    computer to implement high-level parallel constructs

    Precompiler

    ▪ Program flow analysis, dependence checking limited optimizations

    toward parallelism detection

    Parallelizing compiler▪ Fully developed parallelizing compiler which can automatically

    detect parallelism in source code and transform sequential codes

    into parallel constructs

    25

  • 8/18/2019 Intro Chp1

    26/109

    Computing Resources and Computation Allocation

    The number of processing elements (PEs),computing power of each

    element and amount/organization of physical memory used.

    What portions of the computation and data are allocated or mapped

    to each PE

    Data access, Communication and Synchronization

    How the processing elements cooperate and communicate.

    How data is shared/transmitted between processors.

    Abstractions and primitives for cooperation/communication and

    synchronization.

    The characteristics and performance of parallel system network

    (System interconnects).

    26

  • 8/18/2019 Intro Chp1

    27/109

    Parallel Processing Performance and Scalability Goals

    Maximize performance enhancement of parallelism:

    Maximize Speedup.

    ▪ By minimizing parallelization overheads and balancing workload on

    processors

    Scalability of performance to larger systems

    27

  • 8/18/2019 Intro Chp1

    28/109

    Application demands:

    More computing cycles/memory needed

    Scientific/Engineering computing: CFD, Biology, Chemistry,Physics, ...

    General-purpose computing: Video, Graphics, CAD,

    Databases, Transaction Processing, Gaming

    Mainstream multithreaded programs, are similar to parallel

    programs

    28

  • 8/18/2019 Intro Chp1

    29/109

    Challenging Applications in Applied Science/Engineering

    Astrophysics

    Atmospheric and Ocean Modeling

    Bioinformatics

    Biomolecular simulation: Protein folding

    Computational Chemistry

    Computational Physics

    Computer vision and image understanding

    Data Mining and Data-intensive Computing

    Engineering analysis (CAD/CAM)

    Global climate modeling and forecasting

    Military applications

    Quantum chemistry

    VLSI design29

    Such applications have very

    high computational and

    memory Requirements that

    cannot be met with single-processor architectures.

    Many applications contain a

    large degree of

    computational parallelism

  • 8/18/2019 Intro Chp1

    30/109

    30

  • 8/18/2019 Intro Chp1

    31/109

    31

  • 8/18/2019 Intro Chp1

    32/109

    Study of arch. involved Hardware organization and

    programming/ software requirements

    Assembly language programmer point of view▪ Instruction set, which includes opcode (operation codes),

    addressing modes, registers, virtual memory

    Hardware implementation point of view

    ▪ CPUs, caches, buses, microcode, pipelines, physicalmemory

    Architecture cover ISA + M/c implementation

    32

  • 8/18/2019 Intro Chp1

    33/109

    Figure

    33

  • 8/18/2019 Intro Chp1

    34/109

    The von Neumann architecture built as a sequential

    machine executing scalar data

    Sequential computer – improved from

    bit-serial to word-parallel operations

    fixed-point to floating-point operations

    The von Neumann architecture is slow due to

    sequential execution of instructions in programs

    34

  • 8/18/2019 Intro Chp1

    35/109

    Lookahead

    Techniques introduced to prefetch instructions in order to

    overlap I/E (instruction fetch/decode and execution)

    operations and to enable functional parallelism

    Functional parallelism

    To use multiple functional units simultaneously

    To practice pipelining at various processing levels

    35

  • 8/18/2019 Intro Chp1

    36/109

    Pipeline Includes

    pipelined instruction execution

    Pipelined arithmetic computations

    memory-access operations

    Pipelining especially used to performing identical operations

    repeatedly over vector data strings

    Vector operations were originally carried out implicitly by

    software-controlled looping using scalar pipeline processors

    36

  • 8/18/2019 Intro Chp1

    37/109

  • 8/18/2019 Intro Chp1

    38/109

    SISD

    Conventional sequential machines

    38

  • 8/18/2019 Intro Chp1

    39/109

    They are also called scalar processor i.e., one

    instruction at a time and each instruction have only

    one set of operands.

    Single instruction: only one instruction stream is

    being acted on by the CPU during any one clock

    cycle

    Single data: only one data stream is being used as

    input during any one clock cycle

    39

  • 8/18/2019 Intro Chp1

    40/109

    SIMD

    Vector computers are equipped with scalar and vector

    hardware

    40

  • 8/18/2019 Intro Chp1

    41/109

    Single instruction: All processing units execute the same

    instruction issued by the control unit at any given clock cycle

    as shown in figure where there are multiple processor

    executing instruction given by one control unit.

    Multiple data: Each processing unit can operate on a different

    data element as shown if figure the processor are providing

    multiple data to processing unit

    41

  • 8/18/2019 Intro Chp1

    42/109

    MIMD

    42

  • 8/18/2019 Intro Chp1

    43/109

    Multiple Instruction: every processor may be

    executing a different instruction stream

    Multiple Data: every processor may be working witha different data stream as shown in figure multiple

    data stream is provided by shared memory.

    Can be categorized as loosely coupled or tightly

    coupled depending on sharing of data and control

    Execution can be synchronous or asynchronous,

    deterministic or non-deterministic

    43

  • 8/18/2019 Intro Chp1

    44/109

    MISD

    The same data stream flows through a linear array of

    processors executing different instruction streams

    44

  • 8/18/2019 Intro Chp1

    45/109

    A single data stream is fed into multiple processing units.

    Each processing unit operates on the data independently via

    independent instruction streams as shown in figure

    A single data stream is forwarded to different processing unit

    which are connected to different control unit and execute

    instruction given to it by processing unit to which it is

    attached.

    45

  • 8/18/2019 Intro Chp1

    46/109

    46

    Single Instruction stream over a Single Data stream (SISD):

    Conventional sequential machines or uniprocessors.

    Single Instruction stream over Multiple Data streams (SIMD):

    Vector computers, array of synchronized processing elements.

    Multiple Instruction streams and a Single Data stream (MISD):

    Systolic arrays for pipelined execution.

    Multiple Instruction streams over Multiple

    Data streams (MIMD): Parallel computers:

    Distributed memory multiprocessor system shown

    CU = Control Unit

    PE = Processing Element

    M = Memory

    Shown here:

    array of synchronized

    processing elements

    Parallel computers

    or multiprocessor systems

    Uniprocessor 

    (Taxonomy)

  • 8/18/2019 Intro Chp1

    47/109

    Parallel computers

    execute programs in MIMD mode

    Two major classes of parallel computers shared-memory multiprocessors

    message-passing multicomputers

    Different in Memory sharing and the mechanismsused for interprocessor communication

    47

  • 8/18/2019 Intro Chp1

    48/109

    Multiprocessor system

    communicate with each other through shared variables in

    a common memory

    Multicomputer system

    Each computer node has a local memory, unshared with

    other nodes

    Interprocessor communication is done through messagepassing among the nodes

    48

  • 8/18/2019 Intro Chp1

    49/109

    Vector processors (Implicit)

    Vector instructions

    Equipped with multiple vector pipelined

    Concurrently used under hardware or firmware control

    Two families of pipelined (explicit) vector processors:

    Memory-to-memory architecture

    ▪ Pipelined flow of vector operands directly from the memory topipelines and then back to the memory

    Register-to-register architecture

    ▪ Uses vector registers to interface between the memory and

    functional pipelines

    49

  • 8/18/2019 Intro Chp1

    50/109

    50

  • 8/18/2019 Intro Chp1

    51/109

    Hardware configurations differ from machine to machine

    (even with the same Flynn classification)

    Address spaces of processors

    vary among different architectures, and

    depend on memory organization, and

    should match target application domain.

    The communication model and language environments

    should ideally be machine-independent to allow porting to many computers with minimum conversion costs.

    Application developers prefer architectural transparency

    51

  • 8/18/2019 Intro Chp1

    52/109

    Programmability depends on the programming environment

    provided to the users

    Conventional computers are used in a sequential

    programming environment with tools developed for auniprocessor computer

    Parallel computers need

    parallel tools that allow specification or easy detection of parallelism

    operating systems that can perform parallel scheduling of concurrent

    events, shared memory allocation, and shared peripheral and

    communication links.

    52

  • 8/18/2019 Intro Chp1

    53/109

    Use a conventional language (like C, Fortran, Lisp, or Pascal)

    to write the program

    Use a parallelizing compiler to translate the source code into

    parallel code

    The compiler must detect parallelism and assign target

    machine resources

    Success relies heavily on the quality of the compiler.

    53

  • 8/18/2019 Intro Chp1

    54/109

    Programmer write explicit parallel code using

    parallel dialects of common languages

    Compiler has reduced need to detect parallelism,

    but must still preserve existing parallelism and assign

    target machine resources

    54

  • 8/18/2019 Intro Chp1

    55/109

    55

    Source code written inSource code written in

    concurrent dialects of C, C++concurrent dialects of C, C++

    FORTRAN, LISPFORTRAN, LISP ..

    Programmer Programmer 

    ConcurrencyConcurrency

    preserving compiler preserving compiler 

    ConcurrentConcurrent

    object codeobject code

    Execution byExecution by

    runtime systemruntime system

    Source code written inSource code written in

    sequential languages C, C++sequential languages C, C++

    FORTRAN, LISPFORTRAN, LISP ..

    Programmer Programmer 

    ParallelizingParallelizing

    compiler compiler 

    ParallelParallel

    object codeobject code

    Execution byExecution by

    runtime systemruntime system

    (a(a)) Implicit ParallelismImplicit Parallelism (b) Explicit Parallelism(b) Explicit Parallelism

  • 8/18/2019 Intro Chp1

    56/109

    Parallel extensions of conventional high-level

    languages

    Integrated environments to provide

    different levels of program abstraction validation,

    testing and debugging performance prediction and

    monitoring visualization support to aid program

    development, performance measurement graphics display

    and animation of computational results

    56

  • 8/18/2019 Intro Chp1

    57/109

    Shared-Memory Multiprocessors

    Distributed-Memory Multicomputers

    57

  • 8/18/2019 Intro Chp1

    58/109

    Shared memory parallel computers generally have the ability

    for all processors to access all memory as global address

    space.

    Multiple processors can operate independently but share thesame memory resources.

    Changes in a memory location effected by one processor are

    visible to all other processors.

    Shared memory machines can be divided into two mainclasses based upon memory access times: UMA, NUMA and

    COMA.

    58

  • 8/18/2019 Intro Chp1

    59/109

    Three shared-memory multiprocessor models

    ▪ The Uniform Memory Access (UMA) model,

    ▪ The Non Uniform Memory Access (NUMA) model,

    ▪ The Cache Only Memory Architecture (COMA) model

    Models differ in how the memory and peripheralresources are shared or distributed.

    59

  • 8/18/2019 Intro Chp1

    60/109

    The UMA Model

    The physical memory is uniformly shared by all the processors

    Have equal access time to all memory words

    Each processor use a private cache.

    Peripherals are also shared

    Called tightly coupled systems due to high degree of resource sharing

    The system interconnect in the form of Common bus, crossbar switch,

    or a multistage network

    Symmetric multiprocessor – all processors have equally capable of

    running the executive programs

    Asymmetric multiprocessor – one or subset of processors have

    executive capability

    60

  • 8/18/2019 Intro Chp1

    61/109

  • 8/18/2019 Intro Chp1

    62/109

    The NUMA Model A NUMA multiprocessor

    Shared-memory system in which the access time varies

    with the location of the memory word. Shared memory is physically distributed to all processors

    called local memories

    Collection of all local memories forms a global address

    space accessible by all processors

    Delay through the interconnection network

    62

  • 8/18/2019 Intro Chp1

    63/109

    NUMA

    63

  • 8/18/2019 Intro Chp1

    64/109

    Globally shared memory

    Three memory-access patterns

    ▪ The fastest is local memory access▪ The next is global memory access

    ▪ The slowest is access of remote memory

    64

  • 8/18/2019 Intro Chp1

    65/109

    Hierarchically structured multiprocessor

    processors are divided in to several clusters

    Cluster is it self an UMA or a NUMA

    Clusters are connected to global shared memory modules

    Entire system is considered a NUMA

    All processors belonging to the same cluster are allowed to

    uniformly access the cluster shared-memory modules

    All clusters have equal access to the global memory

    Access time for cluster memory is shorter than that to the

    global memory

    65

  • 8/18/2019 Intro Chp1

    66/109

    66

  • 8/18/2019 Intro Chp1

    67/109

    The COMA Model

    Is a special case of a NUMA machine

    The distributed main memories are converted to caches

    No memory hierarchy at each processor node All the caches form a global address space

    Remote cache access assisted by distributed cache

    directories D.

    Initial data placement is not critical

    67

  • 8/18/2019 Intro Chp1

    68/109

    The COMA Model

    68

  • 8/18/2019 Intro Chp1

    69/109

    Other variants for shared memory multiprocessors

    CC-NUMA

    Cache coherent non uniform memory access

    Model can be specified with distributed shared memory

    and cache directory

    CC-COMA

    Cache coherent coma

    69

  • 8/18/2019 Intro Chp1

    70/109

    System consists of

    multiple computers called nodes

    interconnected by a message passing network

    provides point-to-point static connections among the nodes

    Each node is autonomous computer consisting of a processor, local memory,

    attached disks or I/O peripherals

    All local memories are private and are accessible only by local

    processors

    NORMA - no-remote-memory-access Inter node communication is carried out by passing messages through

    the static connection network

    70

  • 8/18/2019 Intro Chp1

    71/109

    71

  • 8/18/2019 Intro Chp1

    72/109

    Multicomputers use hardware routers to pass messages

    Computer node is attached to each router

    Boundary router may be connected to I/O and peripheral

    devices Message passing between any two nodes involves a sequence

    of routers and channels

    Mixed types of nodes are allowed in a heterogeneous

    multicomputer

    Internode communications achieved through compatible datarepresentations and message-passing protocols

    72

  • 8/18/2019 Intro Chp1

    73/109

    First Based on processor board technology using hypercube

    architecture and software-controlled message switching

    The second generation was implemented with mesh-connected architecture, hardware message routing software

    environment for medium-grain distributed computing

    73

  • 8/18/2019 Intro Chp1

    74/109

    Important issues for multicomputers

    Famous topologies include the ring, tree, mesh, torus,

    hypercube, cube connected cycle etc.

    Various communication patterns one-to-one, broadcasting,

    permutations and multicast patterns

    message-routing schemes, network flow control strategies,

    deadlock avoidance, virtual channels, message-passing

    primitives and program decomposition techniques.

    74

  • 8/18/2019 Intro Chp1

    75/109

    Introduce supercomputers and parallel processors for vector

    processing and data parallelism

    Classify supercomputers as pipelined vector machines using afew powerful processors equipped with vector hardware

    SIMD computers emphasizing massive data parallelism

    75

  • 8/18/2019 Intro Chp1

    76/109

    A vector computer is often built on top of a scalar processor

    The vector processor attached to the scalar processor as an

    optional feature

    Program and data loaded in to main memory through hostcomputer

    All instructions are first decoded by the scalar control unit

    76

  • 8/18/2019 Intro Chp1

    77/109

    77

  • 8/18/2019 Intro Chp1

    78/109

    If the decoded instruction is scalar operation or a program

    control operation,

    will directly executed by the scalar process or using the scalar

    functional pipelines

    If the decoded instruction is vector operation

    will be sent to the vector control unit (CU)

    CU supervise the flow of vector data between the main memory and

    vector functional pipelines

    Number of vector functional pipelines may be built in to a vectorprocessor

    78

  • 8/18/2019 Intro Chp1

    79/109

    register-to-register architecture

    Vector registers are

    ▪ used to hold the vector operands, intermediate and final vector

    results

    ▪ Programmable in user instructions▪ Each equipped with a component counter

    ▪ Keeps track of the component registers used in successive pipeline

    cycles.

    ▪ Length of each vector register is usually fixed

    ▪ 64-bit component registers in a vector register in a CraySeriessupercomputer

    ▪ The vector functional pipelines retrieve operands from and put

    results into the vector registers

    79

  • 8/18/2019 Intro Chp1

    80/109

  • 8/18/2019 Intro Chp1

    81/109

    memory-to-memory architecture

    Differs from the use of a vector stream unit to

    replace the vector registers

    Vector operands and results are directly retrievedfrom the main memory in superwords

    512 bits as in the Cyber 205

    81

  • 8/18/2019 Intro Chp1

    82/109

    An operational model of an SIMD computer is specified by a

    5-tuple: M = { N, C, I, M, R }

    N - is the number of processing elements(PEs) in the machine.

    C - is the set of instructions directly executed by the control unit(CU)including scalar and program flow control instructions

    I - is the set of instructions broadcast by the CU to all PEs for parallel

    execution

    M - is the set of masking schemes, where each mask partitions the set

    of PEs in to enabled and disabled subsets.

    R - is the set of data-routing functions, specifying various patterns to

    be setup in the interconnection network for inter-PE communications.

    82

  • 8/18/2019 Intro Chp1

    83/109

    83

  • 8/18/2019 Intro Chp1

    84/109

    Performance Measures

    The ideal performance of a computer system

    demands a perfect match between machine

    capability and program behavior.

    Machine capability can be enhanced with better:

    ▪ Hardware technology,

    ▪ Innovative architectural features, and

    ▪ Efficient resource management.

    84

  • 8/18/2019 Intro Chp1

    85/109

    Program behavior, is difficult to predict due to its heavy

    dependence on application and run-time conditions.

    There are many factors affecting program behavior, including:

    Algorithm design, Data structures,

    Language efficiency,

    Programmer skill, and

    Compiler technology.

    85

  • 8/18/2019 Intro Chp1

    86/109

    Introduce some fundamental factors for

    projecting the performance of computer

    They can be used to guide system architect indesigning batter machine

    To educate the programmer or compiler

    writers in optimizing the codes for more

    efficient execution

    86

  • 8/18/2019 Intro Chp1

    87/109

    The simplest measure of program performance is the

    turnaround time, which includes disk and memory accesses,

    input and output activities, compilation time, OS overhead,

    and CPU time. In order to shorten the turnaround time, one must reduce all

    these time factors.

    In a multiprogrammed computer, the I/O and system

    overheads of a given program may overlap with the CPU

    times required in other programs. It is fair to compare just the total CPU time needed for

    program execution.

    87

  • 8/18/2019 Intro Chp1

    88/109

    Performance Measures

    Response Time (Execution time, Latency): Thetime elapse between the start and the completionof an event.

    Throughput (Bandwidth): The amount of workdone in a given time.

    Performance: Number of events occurring perunit of time.

    ▪ Note execution time is the reciprocal of performance —lower execution time implies higher performance.

    88

  • 8/18/2019 Intro Chp1

    89/109

    Performance Measures

    A system (X) is faster than (Y), if for a given task,

    the response time on X is lower than on Y .

    89

    Execution timeY

    Execution timeX

    PerformanceY

    PerformanceY

    PerformanceX

    Performance

    X

    =n = =

    1

    1

  • 8/18/2019 Intro Chp1

    90/109

    Consequently, the statement that X is n%

    faster than Y means:

    90

    Execution timeY

    Execution timeX

    PerformanceYPerformance

    Y

    PerformanceY

    PerformanceX

    PerformanceX

    PerformanceX

    PerformanceY

    = 1+100

    n= =

    =

    1

    1

    and

    n 100 ( )

     hence,

  • 8/18/2019 Intro Chp1

    91/109

    Example: Machine A runs a program in 10seconds and machine B runs the same

    program in 15 seconds. Therefore:

    91

    Execution t imeB

    Execution t imeA

    = 1 +100

    n and

    =n 100 (   )  hence,Execution t ime

    BExecution t ime

    A

    Execution t ime

    A

    =n 100 ( )15

    10

    10= 50

  • 8/18/2019 Intro Chp1

    92/109

    Clock Rate

    The processor is driven by a clock with a constant cycle

    time (t).

    The inverse of the cycle time is the clock rate (f =1/t).

    CPI - cycles per instruction

    Size of a program is instruction count (Ic) - number of

    machine instructions to be executed.

    Different instructions acquire different number of clock

    cycles to execute

    92

  • 8/18/2019 Intro Chp1

    93/109

    CPI is an important parameter for measuring the time needed

    to execute each instruction

    For a given instruction set, we can calculate an average CPI

    over all instruction types, provided we know theirfrequencies of appearance in the program.

    An accurate estimate of the average CPI requires a large

    amount of program code to be traced over a long period of

    time.

    CPI will be taken as an average value for a given instructionset and a given program mix.

    93

  • 8/18/2019 Intro Chp1

    94/109

    Let us define the average number of clock cycle per

    instruction (CPI) as:

    94

    CPI =CPU clock cycles for a program

    Instruction count Ic=

    ( )CPI i Iii =1

    n

    Where  I i is the number of time instruction i is executed andCPI i is the average number of clock cycles for instruction i.

    = (CPI i

    n

    Ic

    Ii )i =1

  • 8/18/2019 Intro Chp1

    95/109

    CPI and clock rate depends on the technology and

    architecture of the machine.

    Instruction count depends on the instruction set of the

    machine and compiler technology.

    95

  • 8/18/2019 Intro Chp1

    96/109

    The CPU time (T) or Execution Time : is the time needed to

    execute a given program excluding the waiting time for I/O or

    other running programs.

    CPU time is further divided into the user CPU time and thesystem CPU time.

    The CPU time is estimated as:

    96

    ( )CPIi Iii =1

    n  

    T = Ic * CPI *  =

  • 8/18/2019 Intro Chp1

    97/109

    The execution of an instruction requires going through cycle

    of events involves the instruction fetch, decode, operand(s)

    fetch, execution, and store result(s):

     p is the number of processor cycles needed to decode and

    execute the instruction

    m is the number of the memory references needed

    k is the ratio between memory cycle time and processor cycletime. ( latency factor -- how much the memory is slow w. r. toCPU)

    97

     T = Ic * CPI *  = I

    c * (p+m*k)* 

  • 8/18/2019 Intro Chp1

    98/109

    Now let C be Total number of cycles required to

    execute a program.

    And the time to execute a program will be

    98

    C = Ic

    CPI

    T = C 

    = C/ f 

    T = Ic  CPI  = Ic  CPI / f 

  • 8/18/2019 Intro Chp1

    99/109

    MIPS - Million Instructions Per Second

    Measure the processor speed

    MIPS = Ic /  T 106

    MIPS = f / CPI 106

    MIPS = f   Ic / C 106

    99

     

    Execution time(T) * 10 

    MIPS =  C I 

  • 8/18/2019 Intro Chp1

    100/109

    MFLOPS - Million Floating Point Operations Per

    Second

    is another performance measure to be used to

    evaluate computers.

    100

    MFLOPS =

     Number of floating point operations in a program

    Execution time * 106

  • 8/18/2019 Intro Chp1

    101/109

    Throughput Rate

    Number of programs executed per unit time.

    Ws = CPU throughput

    Wp = System (program/second)

    Wp = 1 / T

    Wp = MIPS 106 /Ic

    Based on the MIPS rates and the average program length Ic

    Ws

  • 8/18/2019 Intro Chp1

    102/109

    Throughput (W/Tn) - the execution rate on an n

    processor system, measured in FLOPs/unit-time or

    instructions/unit-time

    Speedup (Sn = T1/Tn) - how much faster an actual

    machine with n processors compared to 1.

    Efficiency (En = Sn/n) - fraction of the maximumspeedup achieved by n processors

    102

  • 8/18/2019 Intro Chp1

    103/109

    The attributes of a computer system which allow it to linearly

    scaled up or down in size, to handle smaller or larger

    workloads, or to obtain proportional decreases or increase in

    speed on a given application

    Good scalability requires the good algorithm and the machine

    to have the right properties

    Thus in general there are five performance factors (Ic,p,m,k,t)which are influenced by four system attributes

    103

  • 8/18/2019 Intro Chp1

    104/109

    104

    System

    Attributes

    Performance Factors

     I c

    CPI

     p m k  

    Instruction-set

    Architecture

    Compiler

    Technology

    CPU

    Implementation

    & Technology

    Memory

    Hierarchy

  • 8/18/2019 Intro Chp1

    105/109

    A benchmark program is executed on a 40MHz processor. The

    benchmark program has the following statistics.

    Calculate average CPI, MIPS rate & execution for the above

    benchmark program

    Solved in class

    105

    Instruction Type

     Arithmetic

    Branch

    Load/Store

    Floating Point

    Instruction Count

    45000

    32000

    15000

    8000

    Clock Cycle Count

    1

    2

    2

    2

  • 8/18/2019 Intro Chp1

    106/109

    (Book Problem 1.4) Solved in class

    106

  • 8/18/2019 Intro Chp1

    107/109

    (Book Problem 1.6) Solved in class

    107

  • 8/18/2019 Intro Chp1

    108/109

    Operation Frequency CPI

    ALU ops 35% 1

    Loads 25% 2

    Stores 15% 2

    Branches 25% 3

    Compute the Average CPI.

    Solved in class

    108

  • 8/18/2019 Intro Chp1

    109/109

    For the purpose of solving a given application problem, you benchmark a

    program on two computer systems.

    On system A, the object code executed 80 million Arithmetic Logic Unit

    operations (ALU ops), 40 million load instructions, and 25 million branch

    instructions.

    On system B, the object code executed 50 million ALU ops, 50 millionloads, and 40 million branch instructions.

    In both systems, each ALU op takes 1 clock cycles, each load takes 3 clock

    cycles, and each branch takes 5 clock cycles.

    A. Compute the relative frequency of occurrence of each type of instruction

    executed in both systemsB. Find the Average CPI for each system.