Manu (Hyper Threading)

Embed Size (px)

Citation preview

  • 8/7/2019 Manu (Hyper Threading)

    1/53

    Government EngineeringCollege, Thrissur

    Department of Electronics andCommunication

    Seminar Report 2004

    HYPER-THREADING

    Presented by

    MANU SAGAR K.V

    REG NO: 01-633

    VIITH SEMESTER

  • 8/7/2019 Manu (Hyper Threading)

    2/53

    ACKNOWLEDGEMENT

    I extend my sincere gratitude to Prof. N Premachandran,

    Principal, and Prof. K.P Indira Devi, Head of Electronics & Communication

    Department, Govt. Engineering College Thrissur, for providing me with the

    necessary infrastructure in completing my seminar. I would like to convey a

    deep sense of gratitude to the seminar coordinator Mrs. C.R Muneera, Asst.

    Prof Dept. of Electronics & Communication, for giving me unstinted support on

    my endeavor in the development of this report. I am also grateful to Mr. C.D

    Anil Kumar Lecturer, Dept. of Electronics & Communication, for his timely

    advices.

  • 8/7/2019 Manu (Hyper Threading)

    3/53

    CONTENTS

    ABSTRACT 3

    INTRODUCTION ..4

    PROCESSOR MICRO ARCHITECTURE 5

    CONCEPT OF SIMULTANEOUS MULTHREADING ..11

    CONVERTING TLP TO ILP via SMT14

    SMT AND MULTI PROCESSORS ..16

    SYNCHRONIZATION MECHANISMSAND MEMORY HIERARCHY ...21

    IMPLICITLY MULTI THREADED PROCESSORS ...25

    POWER CONSUMPTIONS IN MULTI THREADEDPROCESSORS ...32

    OPERATING SYSTEM CONSIDERATIONS....44

    HT PLATFORM REQUIREMENTS....46

    CONCLUSION ..47

    REFERENCES ...48

  • 8/7/2019 Manu (Hyper Threading)

    4/53

    Hyper Threading

    Abstract

    Intels Hyper-Threading Technology brings the concept ofsimultaneous multi-threading to the Intel Architecture. Hyper-Threading

    Technology makes a single physical processor appear as two logical processors;

    the physical execution resources are shared and the architecture state is

    duplicated for the two logical processors. From a software or architecture

    perspective, this means operating systems and user programs can schedule

    processes or threads to logical processors as they would on multiple physical

    processors. From a microarchitecture perspective, this means that instructionsfrom both logical processors will persist and execute simultaneously on shared

    execution resources.

    The first implementation of Hyper-Threading

    Technology was done on the IntelXeon processor MP. In thisimplementation there are two logical processors on each physical processor.

    The logical processors have their own independent architecture state, but they

    share nearly all the physical execution and hardware resources of the processor.

    The goal was to implement the technology at minimum cost while ensuring

    forward progress on logical processors, even if the other is stalled, and to

    deliver full performance even when there is only one active logical processor.

    The potential for Hyper-Threading Technology is

    tremendous; our current implementation has only just begun to tap into this

    potential. Hyper-Threading Technology is expected to be viable from mobileprocessors to servers; its introduction into market segments other than servers is

  • 8/7/2019 Manu (Hyper Threading)

    5/53

    only gated by the availability and prevalence of threaded applications and

    workloads in those markets.

    INTRODUCTION

    The amazing growth of the Internet and telecommunications is powered

    by ever-faster systems demanding increasingly higher levels of processor

    performance. To keep up with this demand, we cannot rely entirely on

    traditional approaches to processor design. Micro architecture techniques used

    to achieve past processor performance improvementsuper pipelining, branch

    prediction, super-scalar execution, out-of-order execution, cacheshave made

    microprocessors increasingly more complex, have more transistors, and

    consume more power. In fact, transistor counts and power are increasing at

    rates greater than processor performance. Processor architects are therefore

    looking for ways to improve performance at a greater rate than transistor counts

    and power dissipation. Intels Hyper-Threading Technology is one solution.

    Making Hyper-Threading Technology a reality was the result

    of enormous dedication, planning, and sheer hard work from a large number of

    designers, validators, architects, and others. There was incredible teamwork

    from the operating system developers, BIOS writers, and software developers

    who helped with innovations and provided support for many decisions thatwere

    made during the definition process of Hyper-Threading Technology. Many

    dedicated engineers are continuing to work with our ISV partners to analyze

    application performance for this technology. Their contributions and hard work

    have already made and will continue to make a real difference to our customers.

  • 8/7/2019 Manu (Hyper Threading)

    6/53

    PROCESSOR MICRO-ARCHITECTURE

    Traditional approaches to processor design have focused on

    higher clock speeds, instruction-level parallelism (ILP), and caches. Techniques

    to achieve higher clock speeds involve pipelining the microarchitecture tofiner granularities, also called super-pipelining. Higher clock frequencies can

    greatly improve performance by increasing the number of instructions that can

    be executed each second. Because there will be far more instructions in-flight in

    a superpipelined microarchitecture, handling of events that disrupt the pipeline,

    e.g., cache misses, interrupts and branch mispredictions, can be costly.ILP

    refers to techniques to increase the number of instructions executed each clock

    cycle. For example, a super-scalar processor has multiple parallel execution

    units that can process instructions simultaneously. With super-scalar execution,

    several instructions can be executed each clock cycle. However, with simple

    inorder execution, it is not enough to simply have multiple execution units.

    The challenge is to find enough instructions to execute.

    One technique is out-of-order execution where a large window of

    instructions is simultaneously evaluated and sent to execution units, based on

    instruction dependencies rather than program order.Accesses to DRAM

    memory are slow compared to execution speeds of the processor. One

    technique to reduce this latency is to add fast caches close to the processor.

    Caches can provide fast memory access to frequently accessed data or

    instructions. However, caches can only be fast when they are small. For this

    reason, processors often are designed with a cache hierarchy in which fast,

    small caches are located and operated at access latencies very close to that of

  • 8/7/2019 Manu (Hyper Threading)

    7/53

    the processor core, and progressively larger caches, which handle less

    frequently accessed data or instructions, are implemented with longer access

    latencies. However, there will always be times when the data needed will not be

    in any processor cache. Handling such cache misses requires accessingmemory, and the processor is likely to quickly run out of instructions to execute

    before stalling on the cache miss.

    The vast majority of techniques to improve processor performance from one

    generation to the next is complex and often adds significant die-size and power

    costs. These techniques increase performance but not with 100% efficiency;

    i.e., doubling the number of execution units in a processor does not double theperformance of the processor, due to limited parallelism in instruction flows.

    Similarly, simply doubling the clock rate does not double the performance due

    to the number of processor cycles lost to branch mispredictions.

    figure shows the relative increase in performance and the costs,

    such as die size and power, over the last ten years on Intel processors1. In order

    to isolate the microarchitecture impact, this comparison assumes that the four

    generations of processors are on the same silicon process technology and thatthe speed-ups are normalized to the performance of an Intel486processor.

  • 8/7/2019 Manu (Hyper Threading)

    8/53

    Although we use Intels processor history in this example, other high-

    performance processor manufacturers during this time period would have

    similar trends. Intels processor performance, due to microarchitecture advances

    alone, has improved integer performance five- or six-fold1. Most integerapplications have limited ILP and the instruction flow can be hard to predict.

    Over the same period, the relative die size has gone up fifteen-fold, a three-

    times-higher rate than the gains in integer performance. Fortunately, advances

    in silicon process technology allow more transistors to be packed into a given

    amount of die area so that the actual measured die size of each generation

    microarchitecture has not increased significantly.

    The relative power increased almost eighteen-fold

    during this period1. Fortunately, there exist a number of known techniques to

    significantly reduce power consumption on processors and there is much on-

    going research in this area. However, current processor power dissipation is at

    the limit of what can be easily dealt with in desktop platforms and we must put

    greater emphasis on improving performance in conjunction with new

    technology, specifically to control power.

    CONVENTIONAL MULTI-THREADING

    A look at todays software trends reveals that server applications

    consist of multiple threads or processes that can be executed in parallel. On-line

    transaction processing and Web services have an abundance of software threads

    that can be executed simultaneously for faster performance. Even desktop

    applications are becoming increasingly parallel. Intel architects have been

    trying to leverage this so-called thread-level parallelism (TLP) to gain a better

    performance vs.transistor count and power ratio.

    In both the high-end and mid-range server markets,

    multiprocessors have been commonly used to get more performance from the

    system. By adding more processors, applications potentially get substantialperformance improvement by executing multiple threads on

  • 8/7/2019 Manu (Hyper Threading)

    9/53

    multiple processors at the same time. These threads might be from the same

    application, from different applications running simultaneously, from operating

    system services, or from operating system threads doing background

    maintenance. Multiprocessor systems havebeen used for many years, and high-end programmers are familiar with the techniques to exploit multiprocessors for

    higher performance levels.

    In recent years a number of other techniques to further

    exploit TLP have been discussed and some products have been announced. One

    of these techniques is chip multiprocessing (CMP), where two processors are

    put on a single die.

    The two processors each have a full set of execution and architectural

    resources. The processors may or may not share a large on-chip cache.CMP is

    largely orthogonal to conventional multiprocessor systems, as you can have

    multiple CMP processors in a multiprocessor configuration. Recently

    announced processors incorporate two processors on each die. However, a CMP

    chip is significantly larger than the size of a single-core chip and therefore more

    expensive to manufacture; moreover, it does not begin to address the die size

    and power considerations.

    TIME-SLICE MULTI-THREADING

    Time-slice multithreading is where the processor switches

    between software threads after a fixed time period. Quite a bit of what a CPU

    does is illusion. For instance, modern out-of-order processor architectures don't

    actually execute code sequentially in the order in which it was written.. It is

    noted that an OOE(out of order execution)architecture takes code that was

    written and compiled to be executed in a specific order, reschedules the

    sequence of instructions (if possible) so that they make maximum use of the

    processor resources, executes them, and then arranges them back in their

    original order so that the results can be written out to memory. To the

    programmer and the user, it looks as if an ordered, sequential stream ofinstructions went into the CPU and identically ordered, sequential stream of

  • 8/7/2019 Manu (Hyper Threading)

    10/53

    computational results emerged. Only the CPU knows in what order the

    program's instructionswere actually executed, and in that respect the processor

    is like a black box to both the programmer and the user.

    The same kind of sleight-of-hand happens when

    we run multiple programs at once, except that this time the operating system is

    also involved in the scam. To the end user, it appears as if the processor is

    "running" more than one program at the same time, and indeed, there actually

    are multiple programs loaded into memory. But the CPU can execute only one

    of these programs at a time. The OS maintains the llusion of concurrency by

    rapidly switching between running programs at a fixed interval, called a time

    slice. The time slice has to be small enough that the user doesn't notice any

    degradation in the usability and performance of the running programs, and it

    has to be large enough that each program has a sufficient amount of CPU time

    in which to get useful work done. Most modern operating systems include a

    way to change the size of an individual program's time slice. So a program witha larger time slice gets more actual execution time on the CPU relative to its

    lower priority peers, and hence it runs faster.

    But time-slice multithreading can result in

    wasted execution slots but can effectively minimize the effects of long latencies

    to memory. Switch-on-event Multithreading would switch threads on long

    latency events such as cache misses. This approach can work well for serverapplications that have large numbers of cache misses and where the two threads

    are executing similar tasks. However, both the time-slice and the switch-

    onevent onevent multi-threading techniques do not achieve optimal overlap of

    many sources of inefficient resource usage,such as branch mispredictions,

    instruction dependencies, etc.

  • 8/7/2019 Manu (Hyper Threading)

    11/53

    Finally, there is simultaneous multi-threading, where

    multiple threads can execute on a single processor without switching. The

    threads execute simultaneously and make much better use of the resources.

    This approach makes the most effective use of processor

    resources: it maximizes the performance vs. transistor count and power

    consumption.

  • 8/7/2019 Manu (Hyper Threading)

    12/53

    CONCEPT OF SIMULTANEOUS MULTI-THREADING

    Simultaneous Multi-threading is a processor design that

    combines hardware multithreading with superscalar processor technology to

    allow multiple threads to issue instructions each cycle. Unlike other hardware

    multithreaded architectures (such as the Tera MTA), in which only a single

    hardware context (i.e., thread) is active on any given cycle, SMT permits all

    thread contexts to simultaneously compete for and share processor resources.

    Unlike conventional superscalar processors, which suffer from a lack of per-

    thread instruction-level parallelism, simultaneous multithreading uses multiple

    threads to compensate for low single-thread ILP. The performance consequence

    is significantly higher instruction throughput and program speedups on a variety

    of workloads that include commercial databases, web servers and scientific

    applications in both multi programmed and parallel environments.

    Simultaneous multithreading has already had impact in both the academic and

    commercial communities.

    It has been found that the throughput gains from

    simultaneous multithreading can be achieved without extensive changes to a

    conventional wide-issue superscalar, either in hardware structures or sizes. It is

    seen that the architecture for simultaneous multi threading achieves three

    goals: (1) it minimizes the architectural impact on the conventional superscalar

    design, (2) it has minimal performance impact on a single thread executing

    alone, and (3) it achieves significant throughput gains when running multiple

    threads.The simultaneous multi threading enjoys a 2.5-fold improvement over

    an unmodified superscalar with the same hardware resources. This speedup is

    enabled by an advantage of multithreading previously unexploited in other

    architectures: the ability to favor for fetch and issue those threads most

    efficiently using the processor each cycle, thereby providing the "best"

  • 8/7/2019 Manu (Hyper Threading)

    13/53

    instructions to the processor. Several heuristics that helps to identify and use

    the best threads for fetch and issue are found, and show that such heuristics can

    increase throughput by as much as 37%. Using the best fetch and issue

    alternatives, we then use bottleneck analysis to identify opportunities for furthergains on the improved architecture.

    Simultaneous Multithreading: Maximizing On-Chip

    Parallelism

    The increase in component density on modern microprocessors has led

    to a substantial increase in on-chip parallelism. In particular, modern

    superscalar RISCs can issue several instructions to independent functional units

    each cycle. However, the benefit of such superscalar architectures is ultimately

    limited by the parallelism available in a single thread.

    Simultaneous Multi threading is a technique permitting several independent

    threads to issue instructions to a super-scalar's multiple functional units in a

    single cycle. In the most general case, the binding between thread and

    functional unit is completely dynamic. We present several models of

    simultaneous multithreading and compare them with wide superscalar, fine-

    grain multithreaded, and single-chip, multiple-issue multiprocessing

    architectures. To perform these evaluations, simultaneous multithreadedarchitecture was simulated based on the DEC Alpha 21164 design, and execute

    code generated by the Multi flow trace scheduling compiler. The results show

    that: (1) No single latency-hiding technique is likely to produce acceptable

    utilization of wide superscalar processors. Increasing processor utilization will

    therefore require a new approach, one that attacks multiple causes of processor

    idle cycles. (2) Simultaneous multithreading is such a technique. With our

  • 8/7/2019 Manu (Hyper Threading)

    14/53

    machine model, an 8-thread, 8-issue simultaneous multithreaded processor

    sustains over 5 instructions per cycle, while a single-threaded processor can

    sustain fewer than 1.5 instructions per cycle with similar resources and issue

    bandwidth. (3) Multithreaded workloads degrade cache performance relative tosingle-thread performance, as previous studies have shown. We evaluate several

    cache configurations and demonstrate that private instruction and shared data

    caches provide excellent performance regardless of the number of threads. (4)

    Simultaneous multithreading is an attractive alternative to single-chip

    multiprocessors. We show that simultaneous multithreaded processors with a

    variety of organizations are all superior to conventional multiprocessors with

    similar resources.

    While simultaneous multithreading has excellent potential to increase processor

    utilization, it can add substantial complexity to the design.

  • 8/7/2019 Manu (Hyper Threading)

    15/53

    CONVERTING TLP TO ILP VIA SMT

    To achieve high performance, contemporarycomputer systems rely on two forms of parallelism: instruction-

    level parallelism (ILP) and thread-level parallelism (TLP).

    Although they correspond to different granularities of

    parallelism, ILP and TLP are fundamentally identical: they both

    identify independent instructions that can execute in parallel

    and therefore can utilize parallel hardware. Wide-issue

    superscalar processors exploit ILP by executing multiple

    instructions from a single program in a single

    cycle.Multiprocessors exploit TLP by executing different threads

    in parallel on different processors. Unfortunately, neither

    parallel processing style is capable of adapting to dynamically

    changing levels of ILP and TLP, because the hardware enforces

    the distinction between the two types of parallelism. Amultiprocessor must statically partition its resources among the

    multiple CPUs (see Figure 1); if insufficient TLP is available,

    some of the processors will be idle. A superscalar executes only

    a single thread; if insufficient ILP exists, much of that

    processors multiple-issue hardware will be wasted.

    Simultaneous multithreading (SMT) [Tullsen et al. 1995; 1996;

    Gulati etal. 1996; Hirata et al. 1992] allows multiple threads to

    compete for and share available processor resources every

    cycle. One of its key advantages when executing parallel

    applications is its ability to use thread-level parallelism and

    instruction-level parallelism interchangeably. By allowing

    multiple threads to share the processors functional units

    simultaneously, thread-level parallelism is essentially

    converted into instruction-level parallelism.An SMT processor

  • 8/7/2019 Manu (Hyper Threading)

    16/53

    can therefore accommodate variations in ILP and TLP. When a

    program has only a single thread (i.e., it lacks TLP) all of the

    SMT processors resources can be dedicated to that thread;

    when more TLP exists, this parallelism can compensate for alack of per-thread ILP.

    figure1

    Fig. A comparison of issue slot (functional unit) utilization in various

    architectures. Each square corresponds to an issue slot, with white

    squares signifying unutilized slots. Hardware utilization suffers when

    a program exhibits insufficient parallelism or when available

    parallelism is not used effectively. A superscalar processor achieves

    low utilization because of low ILP in its single thread. Multiprocessors

    physically partition hardware to exploit TLP, and therefore

    performance suffers when TLP is low (e.g., in sequential portions of

    parallel programs). In contrast, simultaneous multithreading avoids

    resource partitioning. Because it allows multiple threads to compete

    for all resources in the same cycle, SMT can cope with varying levels

    of ILP and TLP; consequently utilization is higher, and performance is

    better.

  • 8/7/2019 Manu (Hyper Threading)

    17/53

    An SMT processor can uniquely exploit whichever type of

    parallelism is available, thereby utilizing the functional units

    more effectively to achieve the goals of greater throughput and

    significant program speedups.

    SIMULTANEOUS MULTITHREADING AND MULTIPROCESSORS

    Both the SMT(simultaneous multithreading)

    processor and the on-chip shared-memory MPs we examine

    are built from a common out-of-order superscalar base

    processor. The multiprocessor combines several of these

    superscalar CPUs in a small-scale MP, whereas simultaneous

    multithreading uses a wider-issue superscalar,and then adds

    support for multiple contexts

    Base Processor Architecture

    The base processor is asophisticated, out-of-order superscalar processor with a

    dynamic scheduling core similar to the MIPS R10000 [Year

    1996].Figure 2 illustrates the organization of this processor,

    and Figure 3 shows its processor pipeline. On each cycle, the

    processor fetches a block of instructions from the instruction

    cache. After decoding these instructions,the register-renaming

    logic maps the logical registers to a pool of physical renaming

  • 8/7/2019 Manu (Hyper Threading)

    18/53

    registers to remove false dependencies. Instructions are then

    fed to either the integer or floating-point instruction queues.

    When their operands become available, instructions are issued

    from these queues to the corresponding functional units.Instructions are retired in order.

    figure-2

    figure-3

    SMT ARCHITECTURE

    The SMT architecture, which can simultaneously

    execute threads from up to eight hardware contexts, is a

  • 8/7/2019 Manu (Hyper Threading)

    19/53

    straightforward extension of the base processor. To support

    simultaneous multithreading, the base processor architecture

    requires significant changes in only two primary areas: the

    instruction fetch mechanism and the register file. Aconventional system of branch prediction hardware (branch

    target buffer and pattern history table) drives instruction

    fetching, although we now have 8 program counters and 8

    subroutine return stacks (1 per context). On each cycle, the

    fetch mechanism selects up to 2 threads (among threads not

    already incurring I-cache misses) and fetches up to 4

    instructions from each thread (the 2.4 scheme from Tullsen et

    al. [1996]). The total fetch bandwidth of 8 instructions is

    therefore equivalent to that required for an 8-wide superscalar

    processor, and only 2 I-cache ports are required.Additional

    logic, however, is necessary in the SMT to prioritize thread

    selection. Thread priorities are assigned using the icount

    feedback technique , which favors threads that are usingprocessor resources most effectively. Under icount, highest

    priority is given to the threads that have the least number of

    instructions in the decode, renaming,and queue pipeline

    stages. This approach prevents a single thread from

    clogging the instruction queue, avoids thread starvation, and

    provides a more even distribution of instructions from all

    threads, thereby heightening interthread parallelism. The peak

    throughput of our machine is limited by the fetch and decode

    bandwidth of 8 instructions per cycle.

    Following instruction fetch and decode,

    register renaming is performed,as in the base processor. Each

    thread can address 32 architectural integer (and FP) registers.

    The register-renaming mechanism maps these architecturalregisters (1 set per thread) onto the machines physical

  • 8/7/2019 Manu (Hyper Threading)

    20/53

    registers. An 8-context SMT will require at least 8 * 32 5 256

    physical registers, plus additional physical Registers for register

    renaming. With a larger register file, longer access times will be

    required, so the SMT processor pipeline is extended by 2 cyclesto avoid an increase in the cycle time. Figure 3 compares the

    pipeline for the SMT versus that of the base superscalar. On the

    SMT, register reads take 2 pipe stages and are pipelined.

    Writers to the register file behave in a similar manner, also

    using an extra pipeline stage.In practice, we found that the

    lengthened pipeline degraded performance by less than 2%

    when running a single thread. The additional pipe stage

    requires an extra level of bypass logic, but the number of

    stages has a smaller impact on the complexity and delay of this

    logic (O(n)) than the issue width (O(n2)) [Palacharla et al.

    1997]. Our previous study contains more details regarding the

    effects of the two-stage register read/write pipelines on the

    architecture and performance.In addition to the new fetch

    mechanism and the larger register file and longer pipeline, only

    three processor resources are replicated or added to support

    SMT: per-thread instruction retirement, trap mechanisms, and

    an additional thread id field in each branch target buffer entry.

    No additional hardware is required to perform multithreaded

    scheduling of instructions to functional units. The register-

    renaming phase removes any apparent interthread register

    dependencies, so that the conventional instruction ACM

    Transactions on Computer Systems queues can be used to

    dynamically schedule instructions from multiple threads.

    Instructions from all threads are dumped into the instruction

    queues, and an instruction from any thread can be issued fromthe queues once its operands become available.In this design,

  • 8/7/2019 Manu (Hyper Threading)

    21/53

    few resources are statically partitioned among

    contexts;consequently, almost all hardware resources are

    available even when only one thread is executing. The

    architecture allows us to achieve the performance advantagesof simultaneous multithreading, while keeping intact the design

    and single-thread peak performance of the dynamically

    scheduled CPU core present in modern superscalar

    architectures.

    Single-Chip Multiprocessor Hardware

    configurations

    In our analysis of SMT and multiprocessing, we focus on a

    particular region of the MP design space, specifically, small-

    scale, single-chip, shared-memory

    multiprocessors. As chip densities increase, single-chip

    multiprocessing will be possible, and some architects have

    already begun to investigate this use of chip real estate

    [Olukotun et al. 1996]. An SMT processor and a small-scale, on-

    chip multiprocessor have many similarities: for example,both

    have large numbers of registers and functional units, on-chip

    caches,and the ability to issue multiple instructions each cycle.

    In this study, we keep these resources approximately similar

    for the SMT and MP comparisons,and in some cases we give a

    hardware advantage to the MP.1We look at both two- and four-

    processor multiprocessors, partitioning the scheduling unit

    resources of the multiprocessor CPUs (the functional

    units,instruction queues, and renaming registers) differently for

    each case. In the two-processor MP (MP2), each processor

    receives half of the on-chip execution resources previously

    described, so that the total resources relative to an SMT are

  • 8/7/2019 Manu (Hyper Threading)

    22/53

    comparable (Table I). For a four-processor MP (MP4), each

    processor contains approximately one-fourth of the chip

    resources. The issue width for each processor in these two MP

    models is indicated by the total number of functional units.Note that even within the MP design space, these two

    alternatives (MP2 and MP4) represent an interesting tradeoff

    between TLP and ILP. The two-processor machine can exploit

    more ILP, because each processor has more functional units

    than its MP4 counterpart, whereas MP4 has additional

    processors to take advantage of more TLP.Table I also includes

    several multiprocessor configurations in which we increase

    hardware resources. These configurations are designed to

    reduce bottlenecks in resource usage in order to improve

    aggregate performance.MP2fu (MP4fu), MP2q (MP4q), and MP2r

    (MP4r) address bottlenecks of functional units, instruction

    queues, and renaming registers, respectively.MP2a increases

    all three of these resources,so that the total executionresources of each processor are equivalent to a single SMT

    processor.2 (MP4a is similarly augmented in all three resource

    classes, so that the entire MP4a multiprocessor also has twice

    as many resources as our SMT.)For all MP configurations, the

    base processor uses the out-of-order )scheduling processor

    described earlier and the base pipeline from Figure 3 . Each MP

    processor only supports one context; therefore its register file

    will be smaller, and access will be faster than the SMT. Hence

    the shorter pipeline is more appropriate.

  • 8/7/2019 Manu (Hyper Threading)

    23/53

    Synchronization Mechanisms and Memory Hierarchy

    SMT has three key advantages over multiprocessing:

    flexible use of ILP and TLP, the potential for fast

    synchronization, and a shared L1 cache.This study focuses on

    SMTs ability to exploit the first advantage by determining the

    costs of partitioning execution resources; we therefore allow

    the multiprocessor to use SMTs synchronization mechanisms

    and cache hierarchy to avoid tainting our results with effects of

    the latter two.We implement a set of synchronization primitives

    for thread creation and termination, as well as hardware

    blocking locks. Because the threads in an SMT processor share

    the same scheduling core, inexpensive hardware blocking locks

    can be implemented in a synchronization functional unit.Thischeap synchronization is not available to multiprocessors,

  • 8/7/2019 Manu (Hyper Threading)

    24/53

    because the distinct processors cannot share functional units.

    In our workload, most interthread synchronization is in the form

    of barrier synchronization or

    simple locks, and we found that synchronization time is notcritical to performance. Therefore, we allow the MPs to use the

    same cheap synchronization techniques, so that our

    comparisons are not colored by synchronization effects. The

    entire cache hierarchy, including the L1 caches, is shared by all

    threads in an SMT processor. Multiprocessors typically do not

    use a shared L1 cache to exploit data sharing between parallel

    threads. Each processor in an MP usually has its own private

    cache (as in the commercial multiprocessors described by Sun

    Microsystems [1997], Slater [1992], IBM [1997],and Silicon

    Graphics [1996]) and therefore incurs some coherence

    overhead when data sharing occurs. Because we allow the MP

    to use the SMTs shared L1 cache, this coherence overhead is

    eliminated. Although multiple threads may have working setsthat interfere in a shared cache, that interthread interference is

    not a problem.

    Comparing SMT and MP

    In comparing the total hardware dedicated to our

    multiprocessor and SMT configurations, we have not taken into

    account chip area required for buses or the cycle-time effects

    of a wider-issue machine. In our study, the intent is not to claim

    that SMT has an absolute xpercent performance advantage

    over MP, but instead to demonstrate that SMT can overcome

    some fundamental limitations of multiprocessors, namely, their

    inability to exploit changing levels of ILP and TLP. We believe

  • 8/7/2019 Manu (Hyper Threading)

    25/53

    that in the target design space we are studying, the intrinsic

    flaws resulting from resource partitioning in MPs will limit their

    effectiveness relative to SMT, even taking into consideration

    cycle time.

    PARALLEL APPLICATION

    SMT is most effective when threads have complementary

    hardware resource requirements. Multiprogrammed workloads

    and workloads consisting of parallel applications both provide

    TLP via independent streams of control, but they compete forhardware resources differently. Because a multiprogrammed

    workload does not share memory references across threads, it

    places more stress on the caches. Furthermore, its threads

    have different instruction execution patterns, causing

    interference in branch prediction hardware. On the other hand,

    multiprogrammed workloads are less likely to compete for

    identical functional units.Although parallel applications have

    the benefit of sharing the caches and branch prediction

    hardware, they are an interesting and different test of SMT for

    several reasons. First, unlike the multiprogrammed workload,

    all threads in a parallel application execute the same code and,

    therefore, have similar execution resource requirements,

    memory reference patterns, and levels of ILP. Because all

    threads tend to have the same resource needs at the same

    time, there is potentially more contention for these resources

    compared to a multiprogrammed workload.

  • 8/7/2019 Manu (Hyper Threading)

    26/53

    Secondly, parallel applications illustrate the promise of SMT as

    an architecture

    for improving the performance of single applications. By using

    threads to parallelize programs, SMT can improve processor

    utilization,but more important, it can achieve program

    speedups. Finally, parallel applications are a natural workload

    for traditional parallel architectures and therefore serve as a

    fair basis for comparing SMT and multiprocessors.

    Effectively Using Parallelism on an SMT

    Processor

    Rather than adding more execution resources to improve

    performance, SMT boosts performance and improves

    utilization of existing resources by using parallelism more

    effectively.Unlike multiprocessors that suffer from rigid

  • 8/7/2019 Manu (Hyper Threading)

    27/53

    partitioning, simultaneous multithreading permits dynamic

    resource sharing, so that resources can be flexibly partitioned

    on a per-cycle basis to match the ILP and TLP needs of the

    program. When a thread has a lot of ILP, it can access allprocessor resources; and TLP can compensate for a lack of per-

    thread ILP.

    As more threads are used , speedups increase (up to 2.68 on

    average with 8 threads), exceeding the performance gains

    attained by the enhanced MP configurations.The degree of

    SMTs improvement varies across the benchmarks, depending

    on the amount of per-thread ILP. The five programs with the

    least ILP (radix, tomcatv, hydro2d, water-spatial, and shallow)

    get the five largest speedups for SMT.T8, because TLP

    compensates for low ILP;programs that already have a large

    amount of ILP (LU and FFT) benefit less from using additional

    threads, because resources are already busy executing useful

    instructions. In linpack, performance tails off after two threads,because the granularity of parallelism in the program is very

    small. The gain from parallelism is outweighed by the overhead

    of parallelization (not only thread creation, but also the work

    required to set up the loops in each thread).

  • 8/7/2019 Manu (Hyper Threading)

    28/53

    IMPLICITLY-MULTITHREADED PROCESSORS[IMT]. IMT executes compiler-specified speculative threads from a sequential

    program on a wide-issue SMT pipeline. IMT is based on the fundamental

    observation that Multi scalars execution model i.e., compiler-specified

    speculative threads [11] can be decoupled from the processor organization

    i.e., distributed processing cores. Multi scalar [11] employs sophisticated

    specialized hardware, the register ring and address resolution buffer, which are

    strongly coupled to the distributed core organization. In contrast, IMT proposes

    to map speculative threads on to generic SMT.IMT differs fundamentally from

    prior proposals, TME and DMT, for speculative threading on SMT. While TME

    executes multiple threads only in the uncommon case of branch mispredictions,

    IMT invokes threads in the common

    case of correct predctions, thereby enhancing execution parallelism. Unlike

    IMT, DMT creates threads in hardware. Because of the lack of compile-time

    information, DMT uses value prediction to break data dependence across

    threads. Unfortunately, inaccurate value prediction incurs frequent

    misspeculation stalls, prohibiting DMT from extracting thread-level parallelism

    effectively. Moreover,selective recovery from misspeculation in DMT requires

    fast and frequent searches through prohibitively large (e.g., ~1000 entries)

    custom instruction trace buffers that are difficult to implement efficiently.In this

    paper, we find that a naive mapping of compiler-specified speculative threadsonto SMT performs poorly. Despite using an advanced compiler [14] to

  • 8/7/2019 Manu (Hyper Threading)

    29/53

    generate threads, a Naive IMT (N-IMT) implementation performs only

    comparably to an aggressive superscalar. NIMTs key shortcoming is its

    indiscriminate approach to fetching/executing instructions from threads,

    without accounting for resource availability, thread resource usage, and inter-thread dependence information. The resulting poor utilization of pipeline

    resources (e.g., issue queue, load/store queues, and register file) in N-IMT

    negatively offsets the advantages of speculative threading.

    Implicitly-MultiThreaded (IMT) processor utilizes SMTs support for

    multithreading by executing speculative threads. Figure 3 depicts the anatomy

    of an IMT processor derived from SMT. IMT uses the rename tables for

    register renaming, the issue queue for out-of-order scheduling, the per-context

    load/store queue (LSQ) and active list for memory dependences and instruction

    reordering prior to commit. As in SMT, IMT shares the functional units,

    physical registers, issue queue,and memory hierarchy among all contexts.IMT

    exploits implicit parallelism, as opposed to programmer- specified, explicit

    parallelism exploited by conventional SMT and ultiprocessors. Like

    Multiscalar, IMT predicts the threads in succession and maps them to execution

    resources, with the earliest thread as the onspeculative(head) thread, followed

    by subsequent speculative threads [11]. IMT honors the inter-thread control-

    flow and register dependences specified by the compiler.IMT uses the LSQ to

    enforce inter-thread memory dependences. Upon completion, IMT commits the

    threads in program order.Two IMT variations are presented: (1) aNaive IMT

    (NIMT) that performs comparably to an aggressive superscalar, and (2) anOptimized IMT (O-IMT) that uses novel micro architectural techniques to

    enhance performance.

  • 8/7/2019 Manu (Hyper Threading)

    30/53

  • 8/7/2019 Manu (Hyper Threading)

    31/53

    order. For instance, later (program-order) threads may result in resource (e.g.,

    physical registers, issue queue and LSQ entries) starvation in earlier threads,

    forcing the later threads to squash and relinquish the resources for use by earlier

    threads. Unfortunately, frequent thread squashing due to indiscriminateresource allocation without regards to demand incurs high overhead. Moreover,

    treating (control-and data-) dependent and independent threads alike is

    suboptimal. Fetching and executing instructions from later threads that are

    dependent on earlier threads may be counter-productive because it increases

    inter-thread dependence delays by taking away front-end fetch and processing

    bandwidth from earlier threads. Finally, dependent instructions from later

    threads exacerbate issue queue contention because they remain in the queue

    until the dependences are resolved.

    To mitigate the above shortcomings, O-IMT employs a novel resource- and

    dependence-based fetch policy that is bimodal. In the dependent mode, the

    policy biases fetch towards the non-speculative thread when the threads are

    likely to be dependent, fetching sequentially to the highest extent possible. In

    the independent mode, the policy uses ICOUNT when the threads are

    potentially independent, enhancing overlap among multiple threads. Because

    loop iterations are typically independent, the policy employs anInter-Thread

    Dependence Heuristic (ITDH) to identify loop iterations for the independent

    mode, otherwise considering threads to be dependent. ITDH predicts that

    subsequent threads are loop iterations if the next two threads start PCs are the

    same as the nonspeculative (head) threads start PC.To reduce resourcecontention among threads, the policy employs aDynamic Resource Predictor

    (DRP) to initiate fetch from an invoked thread only if the available hardware

    resources exceed the predicted demand by the thread. The DRP dynamically

    monitors the threads activity and allows fetch to be initiated from newly

    invoked threads when earlier threads commit and resources become available.

    Figure 4 (a) depicts an example of DRP. O-IMT indexes into a table using the

    start PC of a thread. Each table entry holds the numbers of active list and LSQ

  • 8/7/2019 Manu (Hyper Threading)

    32/53

    slots,and physical registers used by the threads last four execution instances.

    The pipeline monitors a threads resource needs, and upon thread commit,

    pdates the threads DRP entry. DRP supplies the maximum among the four

    instances for each resource as the prediction for the next instances resourcerequirement. The policy alleviates enterthread data dependence by processing

    producer instructions earlier and decreasing instruction execution stalls,thereby

    reducing pipeline resource contention.In contrast to O-IMT, prior proposals for

    speculative threading using SMT use variants of conventional fetch policies.

    TME uses biased-ICOUNT, a variant of ICOUNT that does not consider

    resource availability and thread-level independence. DMTs fetch policy

    statically partitions two fetch ports, and allocates one port for the non-

    speculative thread and the other for speculative threads in a round-robin

    manner. However, DMT does not suffer from resource contention because the

    design assumes prohibitively large custom instruction trace buffers (holding

    thousands of instructions) allowing for threads to make forward progress

    without regards to resource availability and thread-level

    independence.Unfortunately, frequent associative searches through such large

    buffers are slow and impractical.

  • 8/7/2019 Manu (Hyper Threading)

    33/53

    figure 4.using(a) and context multiplexing(b).

    Multiplexing Hardware Contexts

    Much like prior proposals, N-IMT assigns a single thread to a hardware context.

    Because many programs have short threads [14] and real SMT implementations

    are bound to have only a few (e.g., 2-8) contexts, this approach often leads to

    insufficient instruction overlap.Larger threads, however, increase both the

    likelihood of dependence misspeculation [14] and the number of instructions

    discarded per misspeculation, and cause speculative buffer overflow [5].Instead,

    to increase instruction overlap without the unwanted side-effects of large

    threads, O-IMT multiplexes the hardware contexts by mapping as many threads

    as allowed by the resources in one context (typically 3-6 threads for SPEC2K).

    Context multiplexing requires for each context only an additional fetch PC

    register and rename table pointer per thread for a given maximum number of

    threads per context. Context multiplexing differs from prior proposals for

    mapping multiple threads on to a single processing core [12,3] to alleviate load

  • 8/7/2019 Manu (Hyper Threading)

    34/53

    imbalance,in that multiplexing allows instructions from multiple threads within

    a context to execute and share resources simultaneously.Two design

    complexities arise due to sharing resources in context multiplexing. First,

    conventional active list and LSQ designs assume that instructions enter thesequeues in (the predicted) program order. Such an assumption enables the active

    list to be a non-searchable (potentially large) structure, and allows honoring

    emory dependences via an ordered (associative) search in the LSQ. If care is

    not taken, multiplexing would invalidate this assumption if multiple threads

    were to place instructions out of program order in the shared active list and

    LSQ. Such out-of-order placement would require an associative search on the

    active list to determine the correct instruction(s) to be removed upon commit or

    misspeculation.In the case of the LSQ, the requirements would be even more

    complicated. A memory access would have to search through the LSQ for an

    address match among the entries from the accessing thread, and then

    (conceptually)repeat the search among entries from the thread preceding the

    accessing thread, working towards older threads.Unfortunately, the active list

    and LSQ cannot afford these additional design complications because active

    lists are made large and therefore non-searchable by design and the LSQs

    ordered, associative search is already complex and time-critical.Second,

    allowing a single context to have multiple out-of-program-order threads

    complicates managing interthread dependence. Because two in-program-order

    threads may be mapped to different contexts, honoring memory dependences

    would require memory accesses to search through multiple contexts thereby

    prohibitively increasing LSQ search time and design complexity.O-IMT avoidsthe second design complexity by mapping threads to a context in program

    order. Inter-thread and intra-thread dependences within a single context are

    treated similarly. Figure 3 (b) shows how in-programorder threads X and X+1

    are mapped to a context. In addition to program order within contexts, O-IMT

    tracks the global program order among the contexts themselves for precise

    interrupts.

  • 8/7/2019 Manu (Hyper Threading)

    35/53

  • 8/7/2019 Manu (Hyper Threading)

    36/53

    frequency reduction, and low power circuit design, optimizations can be applied

    at the architecture level as well.The other environment where power

    consumption is important is that of high performance processors. These

    processors are used in environments where energy supply is not typically alimitation. For high performance computing, clock frequencies continue to

    increase, causing the power dissipated to reach the thresholds of current

    packaging technology.When the maximum power dissipation becomes a critical

    design constraint, that architecture which maximizes the performance/power

    ratio thereby maximizes performance. Multithreading is a processor

    architecture technique which has been shown to provide significant

    performance advantage over conventional architectures which can only follow a

    single stream of execution. Simultaneous multithreading (SMT) [15, 14] can

    provide up to twice the throughput of a dynamic superscalar single-threaded

    processor. Announced architectures that will feature multithreading include the

    Compaq Alpha EV8 [5] and the Sun MAJC Java processor [12], joining the

    existing Tera MTA supercomputer architecture [1]. We can show that a

    multithreading processor is attractive in the context of low-power or power-

    constrained devices for many of the same reasons that enable its high

    throughput. First, it supplies extra parallelism via multiple threads, allowing the

    processor to rely much less heavily on speculation; thus, it wastes fewer

    resources on speculative, never-committed instructions. Second, it provides

    both higher and more even parallelism over time when running multiple

    threads, wasting less power on underutilized execution resources. A

    multithreading architecture also allows different design decisions than areavailable in a single threaded processor, such as the ability to impact power

    through the thread selection mechanism.

    Modelling Power

    To describe the power and energy results of the processor,a particular power

    model can be identified. This power model is integrated into a detailed cycle-

    by-cycle instruction-level architectural simulator, allowing the model to

    accurately model both useful and non-useful (incorrectly speculated) use of all

  • 8/7/2019 Manu (Hyper Threading)

    37/53

    processor resources.The power model utilized for this study is an area-based

    model. In this power model,the microprocessor is divided into several high-

    level units,and the corresponding activity of each unit is used in the

    computation of the overall power consumption. The following processor unitsare modeled: L1 instruction cache, L1 data cache, L2 unified cache, fetch unit,

    integer and floating point instruction queues, branch predictor, instruction TLB,

    data TLB, load-store queue, integer and floating-point functional units, register

    file, register renamer, completion queue, and return stack.The total processor

    energy consumption is the summation of the unit energies where each unit

    energy is equal to:

    An activity factor is a defined statistic measuring how many architectural eventsa program or workload generates for a particular unit. For a given unit, there

    may be multiple activity factors each representing a different action that

    can occur within the unit. Activity factors represent high level actions and

    therefore are independent of the data values causing a particular unit to become

    active. We can compute the overall power consumption of the processor on a

    cycle-by-cycle basis by summing the energy usage of all units in the processor.

    The entire power model consists of 44 activity factors.Each of these activity

    factors is a measurement of the activity of a particular unit or function of the

    processor (e.g. number of instruction queue writes, L2 cache reads). The model

    does not assume that a given factor or combination of factors exercises a

    microprocessor unit in its entirety, but instead is modeled as exercising a

    fraction of the area depending on the particular activity factor or combination of

    activity factors. The weight of how much of the unit is exercised by a particularactivity is an estimate based on knowledge of the units functionality and

  • 8/7/2019 Manu (Hyper Threading)

    38/53

    assumed implementation. Each unit is also assigned a relative area value which

    is an estimate of the units size given its parameters. In addition to the overall

    area of the unit, each unit is modeled as consisting of a particular ratio of 4

    different circuit types: dynamic logic, static logic, memory, and clockcircuitry.Each circuit type is given an energy density. The ratio of circuit types

    for a given logic unit are estimates based on engineering experience and

    consultation with processor designers. The advantage of an area-based design is

    its adaptability. It can model an aggregate of several existing processors, as

    long as average area breakdowns can be derived for each of the units. And it

    can be adapted for future processors for which no circuit implementations exist,

    as long as an estimate of the area expansion can be made. In both cases, we

    rely on the somewhat coarse assumption that the general circuit makeup of

    these units remains fairly constant across designs. This assumption should be

    relatively valid for the types of high-level architectural comparisons made in

    this paper, but might not be for more low-level design options. Our power

    model is an architecture-level power model,and is not intended to produce

    precise estimates of absolute whole-processor power consumption. For

    example, it does not include dynamic leakage factors, I/O power (e.g.,pin

    drivers), etc. However, the power model is useful for obtaining relative power

    numbers for comparison against results obtained from the same power model.

    The numbers that are obtained from the power model are independent of clock

    frequency (again, because we focus on relative values).

    Power Consumption of a Multithreaded Architecture

    The power-energy characteristics of multi threaded

    execution can be examined. It examines performance (IPC), energy efficiency

    (energy per useful instruction executed,E/UI), and power (the average power

    utilized during each simulation run) for both single-thread and multithreadexecution.All results (including IPC) are normalized to a baseline (in each case,

  • 8/7/2019 Manu (Hyper Threading)

    39/53

    the lowest single-thread value). This is done for two reasons first, the power

    model is not intended to be an indicator of absolute power or energy, so this

    allows us to focus on the relative values; second, it allows us to view the

    diverse metrics on a single graph. Later results are normalized to other baselinevalues. The single-thread results (Figure 5) show diversity in almost all metrics.

    Performance and power appear to be positively correlated; the more of the

    processor used, the more power used. However, performance and energy per

    useful instruction are somewhat negatively correlated; the fewer unused

    resources and the fewer wastefully used resources, the more efficient the

    processor.Also included in the graph (the Ideal E/UI bar) is the energy

    efficiency of each application with perfect branch prediction. This gives an

    indication of the energy lost to incorrect speculation (when compared with the

    E/UI result).Gcc and goes suffer most from low branch prediction

    accuracy.Gcc, for example, only commits 39% of fetched instructions and 56%

    of executed instructions. In cases such as gcc and go, the energy effects of

    mispredicted branches is quite large (35% - 40%).Figure 6 shows that

    multithreaded execution has significantly improved energy efficiency over

    conventional processor execution. Multithreading attacks the two primary

    sources of wasted energy/power unutilized resources,and most importantly,

    wasted resources due to incorrect speculation. Multithreaded processors rely far

    less on speculation to achieve high throughput than conventional processors .

  • 8/7/2019 Manu (Hyper Threading)

    40/53

    figure-5.the relative performance,energy efficiency and

    power

    results for the six bench marks when done.

    figure-6.the relative performance ,energy efficiency and powerresults as the number of theads varies.

    With multithreading, then, we achieve as much as a 90% increase in

    performance (the 4-thread result), while actually decreasing the energydissipation per useful instruction of a given workload. This is achieved via a

    relatively modest increase in power. The increase in power occurs because of

    the overall increase in utilization and throughput of the processor that is,

    because the processor completes the same work in a shorter period of time.

    Figure 3 shows the effect of the reduced misspeculation on the individual

    components of the power model. The greatest reductions come in the front of

    the pipeline (e.g. fetch), which is always the slowest to recover from a branch

    misprediction. The reduction in energy is achieved despite an increase in L2

    cache utilization and power. Multithreading,particularly a multiprogrammed

    workload, reduces the locality of accesses to the cache, resulting in more cache

    misses; however, this has a bigger impact on L2 power than L1. The L1 caches

    see more cache fills, while the L2 sees more accesses per instruction. The

    increase in L1 cache fills is also mitigated by the same overall efficiency gains,

    as it sees fewer speculative accesses.

  • 8/7/2019 Manu (Hyper Threading)

    41/53

    figure 7.contributors to overall energy efficiency of differentthreads

    The power advantage of simultaneous

    multithreading can provide a significant gain in both the mobile and high

    performance domains. The E/UI values demonstrate that a multithreaded

    processor operates more efficiently, which is desirable in a mobile computing

    environment. For a high performance processor, the constraint is more likely to

    be average power or peak power. SMT can make better use of a given average

  • 8/7/2019 Manu (Hyper Threading)

    42/53

    power budget, but it really shines in being able to take greater advantage of a

    peak power constraint. That is, an SMT processor will achieve sustained

    performance that is much closer to the peak performance (power) the processor

    was designed for than a single threaded processor.

    Power Optimizations

    In this section, we examine power optimizations that will either reduce the

    overall energy usage of a multithreading processor or will reduce the power

    consumption of the processor. We examine the following possible power

    optimizations, each proving practical in some way by multithreading: reduced

    execution bandwidth, dynamic power consumption controls, and thread fetch

    optimizations. Reduced execution bandwidth exploits the higher execution

    efficiency of multithreading to achieve higher performance while maintain

    moderate power consumption. Dynamic power consumption controls allow a

    high performance processor to gracefully degrade activity when power

    consumption thresholds are exceeded. The thread fetch optimization looks to

    achieve better processor resource utilization via thread selection algorithms.

    Reduced Execution Bandwidth

    Prior studies have shown that an SMT processor has the potential to achieve

    double the instruction throughput of a single-threaded processor with similar

    execution resources . This implies that a given performance goal can be met

    with a less aggressive architecture, possibly consuming less power. Therefore, a

    multi-threaded processor with a smaller execution bandwidth may achievecomparable performance to a more aggressive single-threaded processor while

    consuming less power. In this section, for example, we compare the power

    consumption and energy efficiency of an 8-issue single-threaded processor with

    a 4-issue multithreaded processor.

    Controlling Power Consumption via Feedback

  • 8/7/2019 Manu (Hyper Threading)

    43/53

    We strive to reduce the peak power dissipation

    while still maximizing performance, a goal that is likely to face future high-

    performance architects. In this section, the processor is given feedback

    regarding power dissipation in order to limit its activity.Such a feedbackmechanism does indeed provide reduced average power, but the real attraction

    of this technique is the ability to reduce peak power to an arbitrary level. A

    feedback mechanism that guaranteed that the actual power consumption did not

    exceed a threshold could make peak and achieved power consumption

    arbitrarily close.Thus, a processor could still be 8-issue (because that width is

    useful some fraction of the time), but might still be designed knowing that the

    peak power corresponded to 5 IPC sustained, as long as there is some

    mechanism in the processor that can guarantee sustained power does not exceed

    that level. This section models a mechanism, applied to an SMT processor,

    which utilizes the power consumption of the processor as the feedback input.

    For this optimization, we would like to set an average power consumption

    target and have the processor stay within a reasonable tolerance of the target

    while achieving higher performance than the single threaded processor could

    achieve. Our target power, in this case, is exactly the average power of all the

    benchmarks running on the single-threaded processor. This is a somewhat

    arbitrary threshold, but makes it easier to interpret the results. The feedback is

    an estimated value for the current power the processor is utilizing. An

    implementation of this mechanism could use either on-die sensors or a power

    value which is computed from some performance counters. The advantage of

    using performance counters is that the lag time for the value is much less thanthose obtained from a physical temperature sensor. For the experiments listed

    here, the power value used is the number obtained from the simulator power

    model. We used a power sampling interval of 5 cycles (this is basically the

    delay for when the feedback information can be used by the processor. In each

    of the techniques we define thresholds whereby if the power consumption

    exceeds a given threshold, fetching or branch prediction activity will be

    modified to curb power consumption. The threshold values are fixed throughout

  • 8/7/2019 Manu (Hyper Threading)

    44/53

    a given simulation. In all cases, fetching is halted for all threads when power

    consumption reaches 110% of the desired power value.

    figure-8 the average over all bench marks of performance andpower of 2 and 4 threads with feedback

    Power feedback control is a technique which enables the designer to lower the

    peak power constraints on the processor.This technique is particularly effective

    in a multithreading environment for two reasons. First, even with the drastic

    step of eliminating speculation, even for fetch, an SMT processor can still make

    much better progress than a singlethreaded processor. Second, we have more

    dimensions on which to scale back executionthe results were best when

    we took advantage of the opportunity to scale back threads incrementally.

    Thread Selection

  • 8/7/2019 Manu (Hyper Threading)

    45/53

    This optimization examines the effect of thread selection algorithms on power

    and performance. The ability to select from among multiple threads provides

    the opportunity to optimize for power consumption and operating efficiency

    when making thread fetch decisions.The optimization we model involves twothreads that are selected each cycle to attempt to fetch instructions from the I-

    cache. The heuristic used to select which threads fetch can have a large impact

    on performance [14]. This mechanism can also impact power if we use it to bias

    against the most speculative threads. This heuristic does not, in fact, favor low-

    power threads over high-power threads, because that only delays the running of

    the high-power threads. Rather, it favors less speculative threads over more

    speculative threads. This works because the threads that get stalled become less

    speculative over time (as branches are resolved in the processor), and quickly

    become good candidates for fetch. We modify the ICOUNT thread selection

    scheme, from [14], by adding a branch confidence metric to thread fetch

    priority. This could be viewed as a derivative of pipeline gating [11] applied to

    a multithreaded processor;however, we use it to change the fetch decision, not

    to stop fetching altogether. The low-confscheme biases heavily against threads

    with more unresolved low-confidence branches, while all branches biases

    against the threads with more branches overall in the processor (regardless of

    confidence). Both use ICOUNT as the secondary metric.

  • 8/7/2019 Manu (Hyper Threading)

    46/53

    figure-9. the performance, power and energy effects of usingbranch status to direct thread selection

    Figure 9 shows that this mechanism has the potential to improve both raw

    performance and energy efficiency at the same time, particularly with theconfidence predictor. For the 4 thread simulations, performance increased by

    6.3% and 4.9% for the low-conf and all branches schemes, respectively.The

    efficiency gains should be enough to outweigh the additional power required for

    the confidence counters (not modeled), assuming the architecture did not

    already need them for other purposes. If not, the technique without the

    confidence counters was equally effective at improving power efficiency,

    lagging only slightly in performance.This figure shows an improvement even

    with two threads, because when two threads are chosen for fetch in a cycle, one

    is still chosen to have higher priority and possibly consume more of the

    available fetch bandwidth .This figure shows an improvement even with two

    threads, because when two threads are chosen for fetch in a cycle, one is still

    chosen to have higher priority and possibly consume more of the available fetch

    bandwidth .

  • 8/7/2019 Manu (Hyper Threading)

    47/53

  • 8/7/2019 Manu (Hyper Threading)

    48/53

    The operating system essentially manages the Hyper-Threaded processor as if it

    was running in a true MP arrangement, with some Minor modifications. The

    Hyper-Threaded processor can be toggled by the OS between MultiTask (MT)

    mode or two Single Task (ST) modes, ST0, and ST1.

    In cases where a logical processor is idle, the operating system (must

    be Ring 0 privilege level) may issue a HLT (Halt) instruction which enables

    either ST0 or ST1 mode, and the partitioned processor resources can be

    recombined and applied to the single active thread. This is in line with Intels

    comments where a single-threaded application should run as fast on a Hyper-

    Threaded processor as on a non-Hyper-Threaded processor. An interrupt sent tothe Halted thread can wake it up and move the processor back to MT mode.

    In cases where more than one physical Hyper-Threaded

    processor is installed, the OS should schedule threads across the physical

    processors before scheduling two threads on a single Processor. In a dual-

    processor system the BIOS might aid in this procedure by assigning processor

    mappings as follows:

    Processor 1 = Physical processor1, logical processor1

    Processor 2 = Physical processor2, logical processor1

    Processor 3 = Physical processor1, logical processor2

    Processor 4 = Physical processor2,logical processor2

    Looking at this functionality in MP-capable Windows Oss unfortunately

    Windows 2000 has some limitations in the way it counts physical and logical

    processors. In a four HT CPU-licensed Windows 2000 server, only the first

    logical processor of each of the four physical processor is usable. The second

    logical processor on each physical processor goes unused. In contrast, under

    Windows XP Pro or .NET server using a dual-HT CPU configuration, all four

    logical processor will be recognized and used. A four-HT CPU licensed .NET

    server can use 8 logical processors. This is the key reason why Intelrecommends XP in the Windows arena..

  • 8/7/2019 Manu (Hyper Threading)

    49/53

    Hyper Threading implementation in Windows XP

    Hyper threading( HT) is a feature of the CPU itself and not the motherboard,

    chipset, or other component, that allows the operating system to see more than

    one logical CPU.Its called logical because there is only one CPUit just

    appears as two. HT does this by providing a special set of protocols to the OS

    which can be executed/queried at bootup. The OS determines if HT is available

    then installs itself accordingly. There is a distinct difference between HT and

    SMP (Symmetric Multi-Processingmultiple CPUs on a main board). SMP is

    chipset-dependent and requires a coordinated effort by many factors off the

    CPU itself (mostly for cache coherency). This is one of the main reasons

    Microsoft has been complaining lately about implementing HT in its

    Windows NT-based operating systems (NT, 2K, and XP). These versions of

    Windows use something called a HAL (Hardware Abstraction Layer) which

    logically isolates everything a PC can do from the PC itself. HAL is specifically

    designed to accommodate the cumulative capabilities derived from any and allPCs Microsoft supports (if any PC out there can do it then theres a part of the

    HAL which accommodates its range of features). The problem with T is that it

    doesnt fit into their existing HAL model. Since HT is a chip component in and

    of itself the basic CPU aspects of the HAL have to be extended or modified to

    accommodate it. This is where the problem lies. It is ironic, actually, as just

    about the time Microsoft releases their greatest operating system (Windows XP)

    the very HAL its designed on will have to be modified extensively to

    accommodate HT.

    HT PLATFORM REQUIREMENTS

  • 8/7/2019 Manu (Hyper Threading)

    50/53

    Intel specifies a P4 3.06GHz or higher processor isrequired to support Hyper-Threading on the desktop. Even though older P4

    desktop processors had HT technology built-in, it was permanently disabled at

    the fab. All 533 MHz FSB-capable Intel chipsets will support Hyper-Threadingexcept the A1 (initial) stepping of the 845G. The newer chipsets implement

    necessary APIC (Advanced Programmable Interrupt Controller) updates for

    Hyper-Threaded processors per Intel (to support two logical processors within a

    physical processor we suspect, as each logical processor has its own Local

    APIC).

    Other Hyper-Threading requirements:

    Motherboards must meet P4 3.06 GHz power and

    thermal specifications, and have an updated BIOS and drivers. As of this

    writing, Intel only recommends Windows XP (both Pro and Home editions) and

    Linux 2.4.x for use in Hyper-Threading-enabled PCs.

    CONCLUSION

  • 8/7/2019 Manu (Hyper Threading)

    51/53

    Intels Hyper-Threading Technology brings the concept of

    simultaneous multi-threading to the Intel Architecture. This is a

    significant new technology direction for Intels processors. It will

    become increasingly important for all future applications as it adds a newtechnique for obtaining additional performance for lower transistor and

    power costs.The first implementation of Hyper-Threading Technology was

    done on the Intel Xeon processor MP. In this implementation there

    were two logical processors on each physical processor. The logical

    processors had their own independent architecture state, but they shared

    nearly all the physical execution and hardware resources of the processor.

    The goal was to implement the technology at minimum cost while

    ensuring forward progress on logical processors, even if the other is stalled,

    and to deliver full performance even when there is only one active logical

    processor. These goals were achieved through efficient logical processor

    selection algorithms and the creative partitioning and recombining

    algorithms of many key resources. Measured performance on the Intel Xeon

    processor MP with Hyper-Threading Technology shows performance gains

    of up to 30% on common server application benchmarks for this

    technology.

    The startling growth of the Internet and telecommunications

    demands increasingly higher levels of processor performance. But this

    enhancement cannot be at the expense of power and cost. Therefore previous

    methods to attain higher processor performance cant be afforded.In this context

    Intels Hyper threading becomes very important as it achieves maximumprocessor performance using minimum hardware complexity and thereby much

    reduced cost and acceptable power consumption.

    REFERENCES

  • 8/7/2019 Manu (Hyper Threading)

    52/53

    1.T. N. Vijaykumar and Gurindar S. Sohi. Task selection for a multiscalar

    Processor. In Proceedings of the 31st Annual IEEE/ACM International

    Symposium on Microarchitecture (MICRO 31), December 1998.

    2.S. Wallace, B. Calder, and D. M. Tullsen. Threaded multiple path execution.

    In Proceedings of the 25th Annual International Symposium on Computer

    Architecture, June 1998.

    3. M. Gulati and N. Bagherzadeh, Performance Study of a

    Multithreaded Superscalar Microprocessor, Proc. Intl Symp.High-

    Performance Computer Architecture, ACM, 1996.

    4. Yamamoto and M. Nemirovsky, Increasing Superscalar Performance

    through Multistreaming, Proc. Intl Conf.Parallel Architectures and

    Compilation Techniques, IFIP.

    5.H. Hirata et al., An Elementary Processor Architecture with

    Simultaneous Instruction Issuing from Multiple Threads,Proc. Intl

    Symp. Computer Architecture, Assoc. of Computing Machinery, N.Y.

    1992.

    6.A. Agarwal, B.H. Lim, D. Kranz and J. Kubiatowicz, APRIL:A

    processor Architecture for Multiprocessing, in Proceedings of

    the 17th Annual International Symposium on Computer

    Architectures, May 1990.

    7.R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A.Porter,

    and B. Smith, The TERA Computer System, in International

    Conference on Supercomputing, June 1990.

  • 8/7/2019 Manu (Hyper Threading)

    53/53

    8.L. A. Barroso et. al., Piranha: A Scalable Architecture Based

    on Single-Chip Multiprocessing, in Proceedings of the 27th

    Annual , June 2000.

    9.M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y.Gurevich,

    and W. Lee, The M-Machine Multicomputer,in 28th Annual

    International Symposium on Microarchitecture, Nov. 1995.

    10.L. Hammond, B. Nayfeh, and K. Olukotun, A Single-Chip

    Multiprocessor, Computer, September 1997.

    11.D. J. C. Johnson, HP's Mako Processor, Microprocessor

    Forum, October 2001.