Manu (Hyper Threading)

8/7/2019 Manu (Hyper Threading)

1/53

Government EngineeringCollege, Thrissur

Department of Electronics andCommunication

Seminar Report 2004

HYPER-THREADING

Presented by

MANU SAGAR K.V

REG NO: 01-633

VIITH SEMESTER


2/53

ACKNOWLEDGEMENT

I extend my sincere gratitude to Prof. N Premachandran,

Principal, and Prof. K.P Indira Devi, Head of Electronics & Communication

Department, Govt. Engineering College Thrissur, for providing me with the

necessary infrastructure in completing my seminar. I would like to convey a

deep sense of gratitude to the seminar coordinator Mrs. C.R Muneera, Asst.

Prof Dept. of Electronics & Communication, for giving me unstinted support on

my endeavor in the development of this report. I am also grateful to Mr. C.D

Anil Kumar Lecturer, Dept. of Electronics & Communication, for his timely

advices.


3/53

CONTENTS

ABSTRACT 3

INTRODUCTION ..4

PROCESSOR MICRO ARCHITECTURE 5

CONCEPT OF SIMULTANEOUS MULTHREADING ..11

CONVERTING TLP TO ILP via SMT14

SMT AND MULTI PROCESSORS ..16

SYNCHRONIZATION MECHANISMSAND MEMORY HIERARCHY ...21

IMPLICITLY MULTI THREADED PROCESSORS ...25

POWER CONSUMPTIONS IN MULTI THREADEDPROCESSORS ...32

OPERATING SYSTEM CONSIDERATIONS....44

HT PLATFORM REQUIREMENTS....46

CONCLUSION ..47

REFERENCES ...48


4/53

Hyper Threading

Abstract

Intels Hyper-Threading Technology brings the concept ofsimultaneous multi-threading to the Intel Architecture. Hyper-Threading

Technology makes a single physical processor appear as two logical processors;

the physical execution resources are shared and the architecture state is

duplicated for the two logical processors. From a software or architecture

perspective, this means operating systems and user programs can schedule

processes or threads to logical processors as they would on multiple physical

processors. From a microarchitecture perspective, this means that instructionsfrom both logical processors will persist and execute simultaneously on shared

execution resources.

The first implementation of Hyper-Threading

Technology was done on the IntelXeon processor MP. In thisimplementation there are two logical processors on each physical processor.

The logical processors have their own independent architecture state, but they

share nearly all the physical execution and hardware resources of the processor.

The goal was to implement the technology at minimum cost while ensuring

forward progress on logical processors, even if the other is stalled, and to

deliver full performance even when there is only one active logical processor.

The potential for Hyper-Threading Technology is

tremendous; our current implementation has only just begun to tap into this

potential. Hyper-Threading Technology is expected to be viable from mobileprocessors to servers; its introduction into market segments other than servers is


5/53

only gated by the availability and prevalence of threaded applications and

workloads in those markets.

INTRODUCTION

The amazing growth of the Internet and telecommunications is powered

by ever-faster systems demanding increasingly higher levels of processor

performance. To keep up with this demand, we cannot rely entirely on

traditional approaches to processor design. Micro architecture techniques used

to achieve past processor performance improvementsuper pipelining, branch

prediction, super-scalar execution, out-of-order execution, cacheshave made

microprocessors increasingly more complex, have more transistors, and

consume more power. In fact, transistor counts and power are increasing at

rates greater than processor performance. Processor architects are therefore

looking for ways to improve performance at a greater rate than transistor counts

and power dissipation. Intels Hyper-Threading Technology is one solution.

Making Hyper-Threading Technology a reality was the result

of enormous dedication, planning, and sheer hard work from a large number of

designers, validators, architects, and others. There was incredible teamwork

from the operating system developers, BIOS writers, and software developers

who helped with innovations and provided support for many decisions thatwere

made during the definition process of Hyper-Threading Technology. Many

dedicated engineers are continuing to work with our ISV partners to analyze

application performance for this technology. Their contributions and hard work

have already made and will continue to make a real difference to our customers.


6/53

PROCESSOR MICRO-ARCHITECTURE

Traditional approaches to processor design have focused on

higher clock speeds, instruction-level parallelism (ILP), and caches. Techniques

to achieve higher clock speeds involve pipelining the microarchitecture tofiner granularities, also called super-pipelining. Higher clock frequencies can

greatly improve performance by increasing the number of instructions that can

be executed each second. Because there will be far more instructions in-flight in

a superpipelined microarchitecture, handling of events that disrupt the pipeline,

e.g., cache misses, interrupts and branch mispredictions, can be costly.ILP

refers to techniques to increase the number of instructions executed each clock

cycle. For example, a super-scalar processor has multiple parallel execution

units that can process instructions simultaneously. With super-scalar execution,

several instructions can be executed each clock cycle. However, with simple

inorder execution, it is not enough to simply have multiple execution units.

The challenge is to find enough instructions to execute.

One technique is out-of-order execution where a large window of

instructions is simultaneously evaluated and sent to execution units, based on

instruction dependencies rather than program order.Accesses to DRAM

memory are slow compared to execution speeds of the processor. One

technique to reduce this latency is to add fast caches close to the processor.

Caches can provide fast memory access to frequently accessed data or

instructions. However, caches can only be fast when they are small. For this

reason, processors often are designed with a cache hierarchy in which fast,

small caches are located and operated at access latencies very close to that of


7/53

the processor core, and progressively larger caches, which handle less

frequently accessed data or instructions, are implemented with longer access

latencies. However, there will always be times when the data needed will not be

in any processor cache. Handling such cache misses requires accessingmemory, and the processor is likely to quickly run out of instructions to execute

before stalling on the cache miss.

The vast majority of techniques to improve processor performance from one

generation to the next is complex and often adds significant die-size and power

costs. These techniques increase performance but not with 100% efficiency;

i.e., doubling the number of execution units in a processor does not double theperformance of the processor, due to limited parallelism in instruction flows.

Similarly, simply doubling the clock rate does not double the performance due

to the number of processor cycles lost to branch mispredictions.

figure shows the relative increase in performance and the costs,

such as die size and power, over the last ten years on Intel processors1. In order

to isolate the microarchitecture impact, this comparison assumes that the four

generations of processors are on the same silicon process technology and thatthe speed-ups are normalized to the performance of an Intel486processor.


8/53

Although we use Intels processor history in this example, other high-

performance processor manufacturers during this time period would have

similar trends. Intels processor performance, due to microarchitecture advances

alone, has improved integer performance five- or six-fold1. Most integerapplications have limited ILP and the instruction flow can be hard to predict.

Over the same period, the relative die size has gone up fifteen-fold, a three-

times-higher rate than the gains in integer performance. Fortunately, advances

in silicon process technology allow more transistors to be packed into a given

amount of die area so that the actual measured die size of each generation

microarchitecture has not increased significantly.

The relative power increased almost eighteen-fold

during this period1. Fortunately, there exist a number of known techniques to

significantly reduce power consumption on processors and there is much on-

going research in this area. However, current processor power dissipation is at

the limit of what can be easily dealt with in desktop platforms and we must put

greater emphasis on improving performance in conjunction with new

technology, specifically to control power.

CONVENTIONAL MULTI-THREADING

A look at todays software trends reveals that server applications

consist of multiple threads or processes that can be executed in parallel. On-line

transaction processing and Web services have an abundance of software threads

that can be executed simultaneously for faster performance. Even desktop

applications are becoming increasingly parallel. Intel architects have been

trying to leverage this so-called thread-level parallelism (TLP) to gain a better

performance vs.transistor count and power ratio.

In both the high-end and mid-range server markets,

multiprocessors have been commonly used to get more performance from the

system. By adding more processors, applications potentially get substantialperformance improvement by executing multiple threads on


9/53

multiple processors at the same time. These threads might be from the same

application, from different applications running simultaneously, from operating

system services, or from operating system threads doing background

maintenance. Multiprocessor systems havebeen used for many years, and high-end programmers are familiar with the techniques to exploit multiprocessors for

higher performance levels.

In recent years a number of other techniques to further

exploit TLP have been discussed and some products have been announced. One

of these techniques is chip multiprocessing (CMP), where two processors are

put on a single die.

The two processors each have a full set of execution and architectural

resources. The processors may or may not share a large on-chip cache.CMP is

largely orthogonal to conventional multiprocessor systems, as you can have

multiple CMP processors in a multiprocessor configuration. Recently

announced processors incorporate two processors on each die. However, a CMP

chip is significantly larger than the size of a single-core chip and therefore more

expensive to manufacture; moreover, it does not begin to address the die size

and power considerations.

TIME-SLICE MULTI-THREADING

Time-slice multithreading is where the processor switches

between software threads after a fixed time period. Quite a bit of what a CPU

does is illusion. For instance, modern out-of-order processor architectures don't

actually execute code sequentially in the order in which it was written.. It is

noted that an OOE(out of order execution)architecture takes code that was

written and compiled to be executed in a specific order, reschedules the

sequence of instructions (if possible) so that they make maximum use of the

processor resources, executes them, and then arranges them back in their

original order so that the results can be written out to memory. To the

programmer and the user, it looks as if an ordered, sequential stream ofinstructions went into the CPU and identically ordered, sequential stream of


10/53

computational results emerged. Only the CPU knows in what order the

program's instructionswere actually executed, and in that respect the processor

is like a black box to both the programmer and the user.

The same kind of sleight-of-hand happens when

we run multiple programs at once, except that this time the operating system is

also involved in the scam. To the end user, it appears as if the processor is

"running" more than one program at the same time, and indeed, there actually

are multiple programs loaded into memory. But the CPU can execute only one

of these programs at a time. The OS maintains the llusion of concurrency by

rapidly switching between running programs at a fixed interval, called a time

slice. The time slice has to be small enough that the user doesn't notice any

degradation in the usability and performance of the running programs, and it

has to be large enough that each program has a sufficient amount of CPU time

in which to get useful work done. Most modern operating systems include a

way to change the size of an individual program's time slice. So a program witha larger time slice gets more actual execution time on the CPU relative to its

lower priority peers, and hence it runs faster.

But time-slice multithreading can result in

wasted execution slots but can effectively minimize the effects of long latencies

to memory. Switch-on-event Multithreading would switch threads on long

latency events such as cache misses. This approach can work well for serverapplications that have large numbers of cache misses and where the two threads

are executing similar tasks. However, both the time-slice and the switch-

onevent onevent multi-threading techniques do not achieve optimal overlap of

many sources of inefficient resource usage,such as branch mispredictions,

instruction dependencies, etc.


11/53

Finally, there is simultaneous multi-threading, where

multiple threads can execute on a single processor without switching. The

threads execute simultaneously and make much better use of the resources.

This approach makes the most effective use of processor

resources: it maximizes the performance vs. transistor count and power

consumption.


12/53

CONCEPT OF SIMULTANEOUS MULTI-THREADING

Simultaneous Multi-threading is a processor design that

combines hardware multithreading with superscalar processor technology to

allow multiple threads to issue instructions each cycle. Unlike other hardware

multithreaded architectures (such as the Tera MTA), in which only a single

hardware context (i.e., thread) is active on any given cycle, SMT permits all

thread contexts to simultaneously compete for and share processor resources.

Unlike conventional superscalar processors, which suffer from a lack of per-

thread instruction-level parallelism, simultaneous multithreading uses multiple

threads to compensate for low single-thread ILP. The performance consequence

is significantly higher instruction throughput and program speedups on a variety

of workloads that include commercial databases, web servers and scientific

applications in both multi programmed and parallel environments.

Simultaneous multithreading has already had impact in both the academic and

commercial communities.

It has been found that the throughput gains from

simultaneous multithreading can be achieved without extensive changes to a

conventional wide-issue superscalar, either in hardware structures or sizes. It is

seen that the architecture for simultaneous multi threading achieves three

goals: (1) it minimizes the architectural impact on the conventional superscalar

design, (2) it has minimal performance impact on a single thread executing

alone, and (3) it achieves significant throughput gains when running multiple

threads.The simultaneous multi threading enjoys a 2.5-fold improvement over

an unmodified superscalar with the same hardware resources. This speedup is

enabled by an advantage of multithreading previously unexploited in other

architectures: the ability to favor for fetch and issue those threads most

efficiently using the processor each cycle, thereby providing the "best"


13/53

instructions to the processor. Several heuristics that helps to identify and use

the best threads for fetch and issue are found, and show that such heuristics can

increase throughput by as much as 37%. Using the best fetch and issue

alternatives, we then use bottleneck analysis to identify opportunities for furthergains on the improved architecture.

Simultaneous Multithreading: Maximizing On-Chip

Parallelism

The increase in component density on modern microprocessors has led

to a substantial increase in on-chip parallelism. In particular, modern

superscalar RISCs can issue several instructions to independent functional units

each cycle. However, the benefit of such superscalar architectures is ultimately

limited by the parallelism available in a single thread.

Simultaneous Multi threading is a technique permitting several independent

threads to issue instructions to a super-scalar's multiple functional units in a

single cycle. In the most general case, the binding between thread and

functional unit is completely dynamic. We present several models of

simultaneous multithreading and compare them with wide superscalar, fine-

grain multithreaded, and single-chip, multiple-issue multiprocessing

architectures. To perform these evaluations, simultaneous multithreadedarchitecture was simulated based on the DEC Alpha 21164 design, and execute

code generated by the Multi flow trace scheduling compiler. The results show

that: (1) No single latency-hiding technique is likely to produce acceptable

utilization of wide superscalar processors. Increasing processor utilization will

therefore require a new approach, one that attacks multiple causes of processor

idle cycles. (2) Simultaneous multithreading is such a technique. With our


14/53

machine model, an 8-thread, 8-issue simultaneous multithreaded processor

sustains over 5 instructions per cycle, while a single-threaded processor can

sustain fewer than 1.5 instructions per cycle with similar resources and issue

bandwidth. (3) Multithreaded workloads degrade cache performance relative tosingle-thread performance, as previous studies have shown. We evaluate several

cache configurations and demonstrate that private instruction and shared data

caches provide excellent performance regardless of the number of threads. (4)

Simultaneous multithreading is an attractive alternative to single-chip

multiprocessors. We show that simultaneous multithreaded processors with a

variety of organizations are all superior to conventional multiprocessors with

similar resources.

While simultaneous multithreading has excellent potential to increase processor

utilization, it can add substantial complexity to the design.


15/53

CONVERTING TLP TO ILP VIA SMT

To achieve high performance, contemporarycomputer systems rely on two forms of parallelism: instruction-

level parallelism (ILP) and thread-level parallelism (TLP).

Although they correspond to different granularities of

parallelism, ILP and TLP are fundamentally identical: they both

identify independent instructions that can execute in parallel

and therefore can utilize parallel hardware. Wide-issue

superscalar processors exploit ILP by executing multiple

instructions from a single program in a single

cycle.Multiprocessors exploit TLP by executing different threads

in parallel on different processors. Unfortunately, neither

parallel processing style is capable of adapting to dynamically

changing levels of ILP and TLP, because the hardware enforces

the distinction between the two types of parallelism. Amultiprocessor must statically partition its resources among the

multiple CPUs (see Figure 1); if insufficient TLP is available,

some of the processors will be idle. A superscalar executes only

a single thread; if insufficient ILP exists, much of that

processors multiple-issue hardware will be wasted.

Simultaneous multithreading (SMT) [Tullsen et al. 1995; 1996;

Gulati etal. 1996; Hirata et al. 1992] allows multiple threads to

compete for and share available processor resources every

cycle. One of its key advantages when executing parallel

applications is its ability to use thread-level parallelism and

instruction-level parallelism interchangeably. By allowing

multiple threads to share the processors functional units

simultaneously, thread-level parallelism is essentially

converted into instruction-level parallelism.An SMT processor


16/53

can therefore accommodate variations in ILP and TLP. When a

program has only a single thread (i.e., it lacks TLP) all of the

SMT processors resources can be dedicated to that thread;

when more TLP exists, this parallelism can compensate for alack of per-thread ILP.

figure1

Fig. A comparison of issue slot (functional unit) utilization in various

architectures. Each square corresponds to an issue slot, with white

squares signifying unutilized slots. Hardware utilization suffers when

a program exhibits insufficient parallelism or when available

parallelism is not used effectively. A superscalar processor achieves

low utilization because of low ILP in its single thread. Multiprocessors

physically partition hardware to exploit TLP, and therefore

performance suffers when TLP is low (e.g., in sequential portions of

parallel programs). In contrast, simultaneous multithreading avoids

resource partitioning. Because it allows multiple threads to compete

for all resources in the same cycle, SMT can cope with varying levels

of ILP and TLP; consequently utilization is higher, and performance is

better.


17/53

An SMT processor can uniquely exploit whichever type of

parallelism is available, thereby utilizing the functional units

more effectively to achieve the goals of greater throughput and

significant program speedups.

SIMULTANEOUS MULTITHREADING AND MULTIPROCESSORS

Both the SMT(simultaneous multithreading)

processor and the on-chip shared-memory MPs we examine

are built from a common out-of-order superscalar base

processor. The multiprocessor combines several of these

superscalar CPUs in a small-scale MP, whereas simultaneous

multithreading uses a wider-issue superscalar,and then adds

support for multiple contexts

Base Processor Architecture

The base processor is asophisticated, out-of-order superscalar processor with a

dynamic scheduling core similar to the MIPS R10000 [Year

1996].Figure 2 illustrates the organization of this processor,

and Figure 3 shows its processor pipeline. On each cycle, the

processor fetches a block of instructions from the instruction

cache. After decoding these instructions,the register-renaming

logic maps the logical registers to a pool of physical renaming


18/53

registers to remove false dependencies. Instructions are then

fed to either the integer or floating-point instruction queues.

When their operands become available, instructions are issued

from these queues to the corresponding functional units.Instructions are retired in order.

figure-2

figure-3

SMT ARCHITECTURE

The SMT architecture, which can simultaneously

execute threads from up to eight hardware contexts, is a


19/53

straightforward extension of the base processor. To support

simultaneous multithreading, the base processor architecture

requires significant changes in only two primary areas: the

instruction fetch mechanism and the register file. Aconventional system of branch prediction hardware (branch

target buffer and pattern history table) drives instruction

fetching, although we now have 8 program counters and 8

subroutine return stacks (1 per context). On each cycle, the

fetch mechanism selects up to 2 threads (among threads not

already incurring I-cache misses) and fetches up to 4

instructions from each thread (the 2.4 scheme from Tullsen et

al. [1996]). The total fetch bandwidth of 8 instructions is

therefore equivalent to that required for an 8-wide superscalar

processor, and only 2 I-cache ports are required.Additional

logic, however, is necessary in the SMT to prioritize thread

selection. Thread priorities are assigned using the icount

feedback technique , which favors threads that are usingprocessor resources most effectively. Under icount, highest

priority is given to the threads that have the least number of

instructions in the decode, renaming,and queue pipeline

stages. This approach prevents a single thread from

clogging the instruction queue, avoids thread starvation, and

provides a more even distribution of instructions from all

threads, thereby heightening interthread parallelism. The peak

throughput of our machine is limited by the fetch and decode

bandwidth of 8 instructions per cycle.

Following instruction fetch and decode,

register renaming is performed,as in the base processor. Each

thread can address 32 architectural integer (and FP) registers.

The register-renaming mechanism maps these architecturalregisters (1 set per thread) onto the machines physical


20/53

registers. An 8-context SMT will require at least 8 * 32 5 256

physical registers, plus additional physical Registers for register

renaming. With a larger register file, longer access times will be

required, so the SMT processor pipeline is extended by 2 cyclesto avoid an increase in the cycle time. Figure 3 compares the

pipeline for the SMT versus that of the base superscalar. On the

SMT, register reads take 2 pipe stages and are pipelined.

Writers to the register file behave in a similar manner, also

using an extra pipeline stage.In practice, we found that the

lengthened pipeline degraded performance by less than 2%

when running a single thread. The additional pipe stage

requires an extra level of bypass logic, but the number of

stages has a smaller impact on the complexity and delay of this

logic (O(n)) than the issue width (O(n2)) [Palacharla et al.

1997]. Our previous study contains more details regarding the

effects of the two-stage register read/write pipelines on the

architecture and performance.In addition to the new fetch

mechanism and the larger register file and longer pipeline, only

three processor resources are replicated or added to support

SMT: per-thread instruction retirement, trap mechanisms, and

an additional thread id field in each branch target buffer entry.

No additional hardware is required to perform multithreaded

scheduling of instructions to functional units. The register-

renaming phase removes any apparent interthread register

dependencies, so that the conventional instruction ACM

Transactions on Computer Systems queues can be used to

dynamically schedule instructions from multiple threads.

Instructions from all threads are dumped into the instruction

queues, and an instruction from any thread can be issued fromthe queues once its operands become available.In this design,


21/53

few resources are statically partitioned among

contexts;consequently, almost all hardware resources are

available even when only one thread is executing. The

architecture allows us to achieve the performance advantagesof simultaneous multithreading, while keeping intact the design

and single-thread peak performance of the dynamically

scheduled CPU core present in modern superscalar

architectures.

Single-Chip Multiprocessor Hardware

configurations

In our analysis of SMT and multiprocessing, we focus on a

particular region of the MP design space, specifically, small-

scale, single-chip, shared-memory

multiprocessors. As chip densities increase, single-chip

multiprocessing will be possible, and some architects have

already begun to investigate this use of chip real estate

[Olukotun et al. 1996]. An SMT processor and a small-scale, on-

chip multiprocessor have many similarities: for example,both

have large numbers of registers and functional units, on-chip

caches,and the ability to issue multiple instructions each cycle.

In this study, we keep these resources approximately similar

for the SMT and MP comparisons,and in some cases we give a

hardware advantage to the MP.1We look at both two- and four-

processor multiprocessors, partitioning the scheduling unit

resources of the multiprocessor CPUs (the functional

units,instruction queues, and renaming registers) differently for

each case. In the two-processor MP (MP2), each processor

receives half of the on-chip execution resources previously

described, so that the total resources relative to an SMT are


22/53

comparable (Table I). For a four-processor MP (MP4), each

processor contains approximately one-fourth of the chip

resources. The issue width for each processor in these two MP

models is indicated by the total number of functional units.Note that even within the MP design space, these two

alternatives (MP2 and MP4) represent an interesting tradeoff

between TLP and ILP. The two-processor machine can exploit

more ILP, because each processor has more functional units

than its MP4 counterpart, whereas MP4 has additional

processors to take advantage of more TLP.Table I also includes

several multiprocessor configurations in which we increase

hardware resources. These configurations are designed to

reduce bottlenecks in resource usage in order to improve

aggregate performance.MP2fu (MP4fu), MP2q (MP4q), and MP2r

(MP4r) address bottlenecks of functional units, instruction

queues, and renaming registers, respectively.MP2a increases

all three of these resources,so that the total executionresources of each processor are equivalent to a single SMT

processor.2 (MP4a is similarly augmented in all three resource

classes, so that the entire MP4a multiprocessor also has twice

as many resources as our SMT.)For all MP configurations, the

base processor uses the out-of-order )scheduling processor

described earlier and the base pipeline from Figure 3 . Each MP

processor only supports one context; therefore its register file

will be smaller, and access will be faster than the SMT. Hence

the shorter pipeline is more appropriate.


23/53

Synchronization Mechanisms and Memory Hierarchy

SMT has three key advantages over multiprocessing:

flexible use of ILP and TLP, the potential for fast

synchronization, and a shared L1 cache.This study focuses on

SMTs ability to exploit the first advantage by determining the

costs of partitioning execution resources; we therefore allow

the multiprocessor to use SMTs synchronization mechanisms

and cache hierarchy to avoid tainting our results with effects of

the latter two.We implement a set of synchronization primitives

for thread creation and termination, as well as hardware

blocking locks. Because the threads in an SMT processor share

the same scheduling core, inexpensive hardware blocking locks

can be implemented in a synchronization functional unit.Thischeap synchronization is not available to multiprocessors,


24/53

because the distinct processors cannot share functional units.

In our workload, most interthread synchronization is in the form

of barrier synchronization or

simple locks, and we found that synchronization time is notcritical to performance. Therefore, we allow the MPs to use the

same cheap synchronization techniques, so that our

comparisons are not colored by synchronization effects. The

entire cache hierarchy, including the L1 caches, is shared by all

threads in an SMT processor. Multiprocessors typically do not

use a shared L1 cache to exploit data sharing between parallel

threads. Each processor in an MP usually has its own private

cache (as in the commercial multiprocessors described by Sun

Microsystems [1997], Slater [1992], IBM [1997],and Silicon

Graphics [1996]) and therefore incurs some coherence

overhead when data sharing occurs. Because we allow the MP

to use the SMTs shared L1 cache, this coherence overhead is

eliminated. Although multiple threads may have working setsthat interfere in a shared cache, that interthread interference is

not a problem.

Comparing SMT and MP

In comparing the total hardware dedicated to our

multiprocessor and SMT configurations, we have not taken into

account chip area required for buses or the cycle-time effects

of a wider-issue machine. In our study, the intent is not to claim

that SMT has an absolute xpercent performance advantage

over MP, but instead to demonstrate that SMT can overcome

some fundamental limitations of multiprocessors, namely, their

inability to exploit changing levels of ILP and TLP. We believe


25/53

that in the target design space we are studying, the intrinsic

flaws resulting from resource partitioning in MPs will limit their

effectiveness relative to SMT, even taking into consideration

cycle time.

PARALLEL APPLICATION

SMT is most effective when threads have complementary

hardware resource requirements. Multiprogrammed workloads

and workloads consisting of parallel applications both provide

TLP via independent streams of control, but they compete forhardware resources differently. Because a multiprogrammed

workload does not share memory references across threads, it

places more stress on the caches. Furthermore, its threads

have different instruction execution patterns, causing

interference in branch prediction hardware. On the other hand,

multiprogrammed workloads are less likely to compete for

identical functional units.Although parallel applications have

the benefit of sharing the caches and branch prediction

hardware, they are an interesting and different test of SMT for

several reasons. First, unlike the multiprogrammed workload,

all threads in a parallel application execute the same code and,

therefore, have similar execution resource requirements,

memory reference patterns, and levels of ILP. Because all

threads tend to have the same resource needs at the same

time, there is potentially more contention for these resources

compared to a multiprogrammed workload.


26/53

Secondly, parallel applications illustrate the promise of SMT as

an architecture

for improving the performance of single applications. By using

threads to parallelize programs, SMT can improve processor

utilization,but more important, it can achieve program

speedups. Finally, parallel applications are a natural workload

for traditional parallel architectures and therefore serve as a

fair basis for comparing SMT and multiprocessors.

Effectively Using Parallelism on an SMT

Processor

Rather than adding more execution resources to improve

performance, SMT boosts performance and improves

utilization of existing resources by using parallelism more

effectively.Unlike multiprocessors that suffer from rigid


27/53

partitioning, simultaneous multithreading permits dynamic

resource sharing, so that resources can be flexibly partitioned

on a per-cycle basis to match the ILP and TLP needs of the

program. When a thread has a lot of ILP, it can access allprocessor resources; and TLP can compensate for a lack of per-

thread ILP.

As more threads are used , speedups increase (up to 2.68 on

average with 8 threads), exceeding the performance gains

attained by the enhanced MP configurations.The degree of

SMTs improvement varies across the benchmarks, depending

on the amount of per-thread ILP. The five programs with the

least ILP (radix, tomcatv, hydro2d, water-spatial, and shallow)

get the five largest speedups for SMT.T8, because TLP

compensates for low ILP;programs that already have a large

amount of ILP (LU and FFT) benefit less from using additional

threads, because resources are already busy executing useful

instructions. In linpack, performance tails off after two threads,because the granularity of parallelism in the program is very

small. The gain from parallelism is outweighed by the overhead

of parallelization (not only thread creation, but also the work

required to set up the loops in each thread).


28/53

IMPLICITLY-MULTITHREADED PROCESSORS[IMT]. IMT executes compiler-specified speculative threads from a sequential

program on a wide-issue SMT pipeline. IMT is based on the fundamental

observation that Multi scalars execution model i.e., compiler-specified

speculative threads [11] can be decoupled from the processor organization

i.e., distributed processing cores. Multi scalar [11] employs sophisticated

specialized hardware, the register ring and address resolution buffer, which are

strongly coupled to the distributed core organization. In contrast, IMT proposes

to map speculative threads on to generic SMT.IMT differs fundamentally from

prior proposals, TME and DMT, for speculative threading on SMT. While TME

executes multiple threads only in the uncommon case of branch mispredictions,

IMT invokes threads in the common

case of correct predctions, thereby enhancing execution parallelism. Unlike

IMT, DMT creates threads in hardware. Because of the lack of compile-time

information, DMT uses value prediction to break data dependence across

threads. Unfortunately, inaccurate value prediction incurs frequent

misspeculation stalls, prohibiting DMT from extracting thread-level parallelism

effectively. Moreover,selective recovery from misspeculation in DMT requires

fast and frequent searches through prohibitively large (e.g., ~1000 entries)

custom instruction trace buffers that are difficult to implement efficiently.In this

paper, we find that a naive mapping of compiler-specified speculative threadsonto SMT performs poorly. Despite using an advanced compiler [14] to


29/53

generate threads, a Naive IMT (N-IMT) implementation performs only

comparably to an aggressive superscalar. NIMTs key shortcoming is its

indiscriminate approach to fetching/executing instructions from threads,

without accounting for resource availability, thread resource usage, and inter-thread dependence information. The resulting poor utilization of pipeline

resources (e.g., issue queue, load/store queues, and register file) in N-IMT

negatively offsets the advantages of speculative threading.

Implicitly-MultiThreaded (IMT) processor utilizes SMTs support for

multithreading by executing speculative threads. Figure 3 depicts the anatomy

of an IMT processor derived from SMT. IMT uses the rename tables for

register renaming, the issue queue for out-of-order scheduling, the per-context

load/store queue (LSQ) and active list for memory dependences and instruction

reordering prior to commit. As in SMT, IMT shares the functional units,

physical registers, issue queue,and memory hierarchy among all contexts.IMT

exploits implicit parallelism, as opposed to programmer- specified, explicit

parallelism exploited by conventional SMT and ultiprocessors. Like

Multiscalar, IMT predicts the threads in succession and maps them to execution

resources, with the earliest thread as the onspeculative(head) thread, followed

by subsequent speculative threads [11]. IMT honors the inter-thread control-

flow and register dependences specified by the compiler.IMT uses the LSQ to

enforce inter-thread memory dependences. Upon completion, IMT commits the

threads in program order.Two IMT variations are presented: (1) aNaive IMT

(NIMT) that performs comparably to an aggressive superscalar, and (2) anOptimized IMT (O-IMT) that uses novel micro architectural techniques to

enhance performance.


30/53


31/53

order. For instance, later (program-order) threads may result in resource (e.g.,

physical registers, issue queue and LSQ entries) starvation in earlier threads,

forcing the later threads to squash and relinquish the resources for use by earlier

threads. Unfortunately, frequent thread squashing due to indiscriminateresource allocation without regards to demand incurs high overhead. Moreover,

treating (control-and data-) dependent and independent threads alike is

suboptimal. Fetching and executing instructions from later threads that are

dependent on earlier threads may be counter-productive because it increases

inter-thread dependence delays by taking away front-end fetch and processing

bandwidth from earlier threads. Finally, dependent instructions from later

threads exacerbate issue queue contention because they remain in the queue

until the dependences are resolved.

To mitigate the above shortcomings, O-IMT employs a novel resource- and

dependence-based fetch policy that is bimodal. In the dependent mode, the

policy biases fetch towards the non-speculative thread when the threads are

likely to be dependent, fetching sequentially to the highest extent possible. In

the independent mode, the policy uses ICOUNT when the threads are

potentially independent, enhancing overlap among multiple threads. Because

loop iterations are typically independent, the policy employs anInter-Thread

Dependence Heuristic (ITDH) to identify loop iterations for the independent

mode, otherwise considering threads to be dependent. ITDH predicts that

subsequent threads are loop iterations if the next two threads start PCs are the

same as the nonspeculative (head) threads start PC.To reduce resourcecontention among threads, the policy employs aDynamic Resource Predictor

(DRP) to initiate fetch from an invoked thread only if the available hardware

resources exceed the predicted demand by the thread. The DRP dynamically

monitors the threads activity and allows fetch to be initiated from newly

invoked threads when earlier threads commit and resources become available.

Figure 4 (a) depicts an example of DRP. O-IMT indexes into a table using the

start PC of a thread. Each table entry holds the numbers of active list and LSQ


32/53

slots,and physical registers used by the threads last four execution instances.

The pipeline monitors a threads resource needs, and upon thread commit,

pdates the threads DRP entry. DRP supplies the maximum among the four

instances for each resource as the prediction for the next instances resourcerequirement. The policy alleviates enterthread data dependence by processing

producer instructions earlier and decreasing instruction execution stalls,thereby

reducing pipeline resource contention.In contrast to O-IMT, prior proposals for

speculative threading using SMT use variants of conventional fetch policies.

TME uses biased-ICOUNT, a variant of ICOUNT that does not consider

resource availability and thread-level independence. DMTs fetch policy

statically partitions two fetch ports, and allocates one port for the non-

speculative thread and the other for speculative threads in a round-robin

manner. However, DMT does not suffer from resource contention because the

design assumes prohibitively large custom instruction trace buffers (holding

thousands of instructions) allowing for threads to make forward progress

without regards to resource availability and thread-level

independence.Unfortunately, frequent associative searches through such large

buffers are slow and impractical.


33/53

figure 4.using(a) and context multiplexing(b).

Multiplexing Hardware Contexts

Much like prior proposals, N-IMT assigns a single thread to a hardware context.

Because many programs have short threads [14] and real SMT implementations

are bound to have only a few (e.g., 2-8) contexts, this approach often leads to

insufficient instruction overlap.Larger threads, however, increase both the

likelihood of dependence misspeculation [14] and the number of instructions

discarded per misspeculation, and cause speculative buffer overflow [5].Instead,

to increase instruction overlap without the unwanted side-effects of large

threads, O-IMT multiplexes the hardware contexts by mapping as many threads

as allowed by the resources in one context (typically 3-6 threads for SPEC2K).

Context multiplexing requires for each context only an additional fetch PC

register and rename table pointer per thread for a given maximum number of

threads per context. Context multiplexing differs from prior proposals for

mapping multiple threads on to a single processing core [12,3] to alleviate load


34/53

imbalance,in that multiplexing allows instructions from multiple threads within

a context to execute and share resources simultaneously.Two design

complexities arise due to sharing resources in context multiplexing. First,

conventional active list and LSQ designs assume that instructions enter thesequeues in (the predicted) program order. Such an assumption enables the active

list to be a non-searchable (potentially large) structure, and allows honoring

emory dependences via an ordered (associative) search in the LSQ. If care is

not taken, multiplexing would invalidate this assumption if multiple threads

were to place instructions out of program order in the shared active list and

LSQ. Such out-of-order placement would require an associative search on the

active list to determine the correct instruction(s) to be removed upon commit or

misspeculation.In the case of the LSQ, the requirements would be even more

complicated. A memory access would have to search through the LSQ for an

address match among the entries from the accessing thread, and then

(conceptually)repeat the search among entries from the thread preceding the

accessing thread, working towards older threads.Unfortunately, the active list

and LSQ cannot afford these additional design complications because active

lists are made large and therefore non-searchable by design and the LSQs

ordered, associative search is already complex and time-critical.Second,

allowing a single context to have multiple out-of-program-order threads

complicates managing interthread dependence. Because two in-program-order

threads may be mapped to different contexts, honoring memory dependences

would require memory accesses to search through multiple contexts thereby

prohibitively increasing LSQ search time and design complexity.O-IMT avoidsthe second design complexity by mapping threads to a context in program

order. Inter-thread and intra-thread dependences within a single context are

treated similarly. Figure 3 (b) shows how in-programorder threads X and X+1

are mapped to a context. In addition to program order within contexts, O-IMT

tracks the global program order among the contexts themselves for precise

interrupts.


35/53


36/53

frequency reduction, and low power circuit design, optimizations can be applied

at the architecture level as well.The other environment where power

consumption is important is that of high performance processors. These

processors are used in environments where energy supply is not typically alimitation. For high performance computing, clock frequencies continue to

increase, causing the power dissipated to reach the thresholds of current

packaging technology.When the maximum power dissipation becomes a critical

design constraint, that architecture which maximizes the performance/power

ratio thereby maximizes performance. Multithreading is a processor

architecture technique which has been shown to provide significant

performance advantage over conventional architectures which can only follow a

single stream of execution. Simultaneous multithreading (SMT) [15, 14] can

provide up to twice the throughput of a dynamic superscalar single-threaded

processor. Announced architectures that will feature multithreading include the

Compaq Alpha EV8 [5] and the Sun MAJC Java processor [12], joining the

existing Tera MTA supercomputer architecture [1]. We can show that a

multithreading processor is attractive in the context of low-power or power-

constrained devices for many of the same reasons that enable its high

throughput. First, it supplies extra parallelism via multiple threads, allowing the

processor to rely much less heavily on speculation; thus, it wastes fewer

resources on speculative, never-committed instructions. Second, it provides

both higher and more even parallelism over time when running multiple

threads, wasting less power on underutilized execution resources. A

multithreading architecture also allows different design decisions than areavailable in a single threaded processor, such as the ability to impact power

through the thread selection mechanism.

Modelling Power

To describe the power and energy results of the processor,a particular power

model can be identified. This power model is integrated into a detailed cycle-

by-cycle instruction-level architectural simulator, allowing the model to

accurately model both useful and non-useful (incorrectly speculated) use of all


37/53

processor resources.The power model utilized for this study is an area-based

model. In this power model,the microprocessor is divided into several high-

level units,and the corresponding activity of each unit is used in the

computation of the overall power consumption. The following processor unitsare modeled: L1 instruction cache, L1 data cache, L2 unified cache, fetch unit,

integer and floating point instruction queues, branch predictor, instruction TLB,

data TLB, load-store queue, integer and floating-point functional units, register

file, register renamer, completion queue, and return stack.The total processor

energy consumption is the summation of the unit energies where each unit

energy is equal to:

An activity factor is a defined statistic measuring how many architectural eventsa program or workload generates for a particular unit. For a given unit, there

may be multiple activity factors each representing a different action that

can occur within the unit. Activity factors represent high level actions and

therefore are independent of the data values causing a particular unit to become

active. We can compute the overall power consumption of the processor on a

cycle-by-cycle basis by summing the energy usage of all units in the processor.

The entire power model consists of 44 activity factors.Each of these activity

factors is a measurement of the activity of a particular unit or function of the

processor (e.g. number of instruction queue writes, L2 cache reads). The model

does not assume that a given factor or combination of factors exercises a

microprocessor unit in its entirety, but instead is modeled as exercising a

fraction of the area depending on the particular activity factor or combination of

activity factors. The weight of how much of the unit is exercised by a particularactivity is an estimate based on knowledge of the units functionality and


38/53

assumed implementation. Each unit is also assigned a relative area value which

is an estimate of the units size given its parameters. In addition to the overall

area of the unit, each unit is modeled as consisting of a particular ratio of 4

different circuit types: dynamic logic, static logic, memory, and clockcircuitry.Each circuit type is given an energy density. The ratio of circuit types

for a given logic unit are estimates based on engineering experience and

consultation with processor designers. The advantage of an area-based design is

its adaptability. It can model an aggregate of several existing processors, as

long as average area breakdowns can be derived for each of the units. And it

can be adapted for future processors for which no circuit implementations exist,

as long as an estimate of the area expansion can be made. In both cases, we

rely on the somewhat coarse assumption that the general circuit makeup of

these units remains fairly constant across designs. This assumption should be

relatively valid for the types of high-level architectural comparisons made in

this paper, but might not be for more low-level design options. Our power

model is an architecture-level power model,and is not intended to produce

precise estimates of absolute whole-processor power consumption. For

example, it does not include dynamic leakage factors, I/O power (e.g.,pin

drivers), etc. However, the power model is useful for obtaining relative power

numbers for comparison against results obtained from the same power model.

The numbers that are obtained from the power model are independent of clock

frequency (again, because we focus on relative values).

Power Consumption of a Multithreaded Architecture

The power-energy characteristics of multi threaded

execution can be examined. It examines performance (IPC), energy efficiency

(energy per useful instruction executed,E/UI), and power (the average power

utilized during each simulation run) for both single-thread and multithreadexecution.All results (including IPC) are normalized to a baseline (in each case,


39/53

the lowest single-thread value). This is done for two reasons first, the power

model is not intended to be an indicator of absolute power or energy, so this

allows us to focus on the relative values; second, it allows us to view the

diverse metrics on a single graph. Later results are normalized to other baselinevalues. The single-thread results (Figure 5) show diversity in almost all metrics.

Performance and power appear to be positively correlated; the more of the

processor used, the more power used. However, performance and energy per

useful instruction are somewhat negatively correlated; the fewer unused

resources and the fewer wastefully used resources, the more efficient the

processor.Also included in the graph (the Ideal E/UI bar) is the energy

efficiency of each application with perfect branch prediction. This gives an

indication of the energy lost to incorrect speculation (when compared with the

E/UI result).Gcc and goes suffer most from low branch prediction

accuracy.Gcc, for example, only commits 39% of fetched instructions and 56%

of executed instructions. In cases such as gcc and go, the energy effects of

mispredicted branches is quite large (35% - 40%).Figure 6 shows that

multithreaded execution has significantly improved energy efficiency over

conventional processor execution. Multithreading attacks the two primary

sources of wasted energy/power unutilized resources,and most importantly,

wasted resources due to incorrect speculation. Multithreaded processors rely far

less on speculation to achieve high throughput than conventional processors .


40/53

figure-5.the relative performance,energy efficiency and

power

results for the six bench marks when done.

figure-6.the relative performance ,energy efficiency and powerresults as the number of theads varies.

With multithreading, then, we achieve as much as a 90% increase in

performance (the 4-thread result), while actually decreasing the energydissipation per useful instruction of a given workload. This is achieved via a

relatively modest increase in power. The increase in power occurs because of

the overall increase in utilization and throughput of the processor that is,

because the processor completes the same work in a shorter period of time.

Figure 3 shows the effect of the reduced misspeculation on the individual

components of the power model. The greatest reductions come in the front of

the pipeline (e.g. fetch), which is always the slowest to recover from a branch

misprediction. The reduction in energy is achieved despite an increase in L2

cache utilization and power. Multithreading,particularly a multiprogrammed

workload, reduces the locality of accesses to the cache, resulting in more cache

misses; however, this has a bigger impact on L2 power than L1. The L1 caches

see more cache fills, while the L2 sees more accesses per instruction. The

increase in L1 cache fills is also mitigated by the same overall efficiency gains,

as it sees fewer speculative accesses.


41/53

figure 7.contributors to overall energy efficiency of differentthreads

The power advantage of simultaneous

multithreading can provide a significant gain in both the mobile and high

performance domains. The E/UI values demonstrate that a multithreaded

processor operates more efficiently, which is desirable in a mobile computing

environment. For a high performance processor, the constraint is more likely to

be average power or peak power. SMT can make better use of a given average


42/53

power budget, but it really shines in being able to take greater advantage of a

peak power constraint. That is, an SMT processor will achieve sustained

performance that is much closer to the peak performance (power) the processor

was designed for than a single threaded processor.

Power Optimizations

In this section, we examine power optimizations that will either reduce the

overall energy usage of a multithreading processor or will reduce the power

consumption of the processor. We examine the following possible power

optimizations, each proving practical in some way by multithreading: reduced

execution bandwidth, dynamic power consumption controls, and thread fetch

optimizations. Reduced execution bandwidth exploits the higher execution

efficiency of multithreading to achieve higher performance while maintain

moderate power consumption. Dynamic power consumption controls allow a

high performance processor to gracefully degrade activity when power

consumption thresholds are exceeded. The thread fetch optimization looks to

achieve better processor resource utilization via thread selection algorithms.

Reduced Execution Bandwidth

Prior studies have shown that an SMT processor has the potential to achieve

double the instruction throughput of a single-threaded processor with similar

execution resources . This implies that a given performance goal can be met

with a less aggressive architecture, possibly consuming less power. Therefore, a

multi-threaded processor with a smaller execution bandwidth may achievecomparable performance to a more aggressive single-threaded processor while

consuming less power. In this section, for example, we compare the power

consumption and energy efficiency of an 8-issue single-threaded processor with

a 4-issue multithreaded processor.

Controlling Power Consumption via Feedback


43/53

We strive to reduce the peak power dissipation

while still maximizing performance, a goal that is likely to face future high-

performance architects. In this section, the processor is given feedback

regarding power dissipation in order to limit its activity.Such a feedbackmechanism does indeed provide reduced average power, but the real attraction

of this technique is the ability to reduce peak power to an arbitrary level. A

feedback mechanism that guaranteed that the actual power consumption did not

exceed a threshold could make peak and achieved power consumption

arbitrarily close.Thus, a processor could still be 8-issue (because that width is

useful some fraction of the time), but might still be designed knowing that the

peak power corresponded to 5 IPC sustained, as long as there is some

mechanism in the processor that can guarantee sustained power does not exceed

that level. This section models a mechanism, applied to an SMT processor,

which utilizes the power consumption of the processor as the feedback input.

For this optimization, we would like to set an average power consumption

target and have the processor stay within a reasonable tolerance of the target

while achieving higher performance than the single threaded processor could

achieve. Our target power, in this case, is exactly the average power of all the

benchmarks running on the single-threaded processor. This is a somewhat

arbitrary threshold, but makes it easier to interpret the results. The feedback is

an estimated value for the current power the processor is utilizing. An

implementation of this mechanism could use either on-die sensors or a power

value which is computed from some performance counters. The advantage of

using performance counters is that the lag time for the value is much less thanthose obtained from a physical temperature sensor. For the experiments listed

here, the power value used is the number obtained from the simulator power

model. We used a power sampling interval of 5 cycles (this is basically the

delay for when the feedback information can be used by the processor. In each

of the techniques we define thresholds whereby if the power consumption

exceeds a given threshold, fetching or branch prediction activity will be

modified to curb power consumption. The threshold values are fixed throughout


44/53

a given simulation. In all cases, fetching is halted for all threads when power

consumption reaches 110% of the desired power value.

figure-8 the average over all bench marks of performance andpower of 2 and 4 threads with feedback

Power feedback control is a technique which enables the designer to lower the

peak power constraints on the processor.This technique is particularly effective

in a multithreading environment for two reasons. First, even with the drastic

step of eliminating speculation, even for fetch, an SMT processor can still make

much better progress than a singlethreaded processor. Second, we have more

dimensions on which to scale back executionthe results were best when

we took advantage of the opportunity to scale back threads incrementally.

Thread Selection


45/53

This optimization examines the effect of thread selection algorithms on power

and performance. The ability to select from among multiple threads provides

the opportunity to optimize for power consumption and operating efficiency

when making thread fetch decisions.The optimization we model involves twothreads that are selected each cycle to attempt to fetch instructions from the I-

cache. The heuristic used to select which threads fetch can have a large impact

on performance [14]. This mechanism can also impact power if we use it to bias

against the most speculative threads. This heuristic does not, in fact, favor low-

power threads over high-power threads, because that only delays the running of

the high-power threads. Rather, it favors less speculative threads over more

speculative threads. This works because the threads that get stalled become less

speculative over time (as branches are resolved in the processor), and quickly

become good candidates for fetch. We modify the ICOUNT thread selection

scheme, from [14], by adding a branch confidence metric to thread fetch

priority. This could be viewed as a derivative of pipeline gating [11] applied to

a multithreaded processor;however, we use it to change the fetch decision, not

to stop fetching altogether. The low-confscheme biases heavily against threads

with more unresolved low-confidence branches, while all branches biases

against the threads with more branches overall in the processor (regardless of

confidence). Both use ICOUNT as the secondary metric.


46/53

figure-9. the performance, power and energy effects of usingbranch status to direct thread selection

Figure 9 shows that this mechanism has the potential to improve both raw

performance and energy efficiency at the same time, particularly with theconfidence predictor. For the 4 thread simulations, performance increased by

6.3% and 4.9% for the low-conf and all branches schemes, respectively.The

efficiency gains should be enough to outweigh the additional power required for

the confidence counters (not modeled), assuming the architecture did not

already need them for other purposes. If not, the technique without the

confidence counters was equally effective at improving power efficiency,

lagging only slightly in performance.This figure shows an improvement even

with two threads, because when two threads are chosen for fetch in a cycle, one

is still chosen to have higher priority and possibly consume more of the

available fetch bandwidth .This figure shows an improvement even with two

threads, because when two threads are chosen for fetch in a cycle, one is still

chosen to have higher priority and possibly consume more of the available fetch

bandwidth .


47/53


48/53

The operating system essentially manages the Hyper-Threaded processor as if it

was running in a true MP arrangement, with some Minor modifications. The

Hyper-Threaded processor can be toggled by the OS between MultiTask (MT)

mode or two Single Task (ST) modes, ST0, and ST1.

In cases where a logical processor is idle, the operating system (must

be Ring 0 privilege level) may issue a HLT (Halt) instruction which enables

either ST0 or ST1 mode, and the partitioned processor resources can be

recombined and applied to the single active thread. This is in line with Intels

comments where a single-threaded application should run as fast on a Hyper-

Threaded processor as on a non-Hyper-Threaded processor. An interrupt sent tothe Halted thread can wake it up and move the processor back to MT mode.

In cases where more than one physical Hyper-Threaded

processor is installed, the OS should schedule threads across the physical

processors before scheduling two threads on a single Processor. In a dual-

processor system the BIOS might aid in this procedure by assigning processor

mappings as follows:

Processor 1 = Physical processor1, logical processor1



Processor 4 = Physical processor2,logical processor2

Looking at this functionality in MP-capable Windows Oss unfortunately

Windows 2000 has some limitations in the way it counts physical and logical

processors. In a four HT CPU-licensed Windows 2000 server, only the first

logical processor of each of the four physical processor is usable. The second

logical processor on each physical processor goes unused. In contrast, under

Windows XP Pro or .NET server using a dual-HT CPU configuration, all four

logical processor will be recognized and used. A four-HT CPU licensed .NET

server can use 8 logical processors. This is the key reason why Intelrecommends XP in the Windows arena..


49/53

Hyper Threading implementation in Windows XP

Hyper threading( HT) is a feature of the CPU itself and not the motherboard,

chipset, or other component, that allows the operating system to see more than

one logical CPU.Its called logical because there is only one CPUit just

appears as two. HT does this by providing a special set of protocols to the OS

which can be executed/queried at bootup. The OS determines if HT is available

then installs itself accordingly. There is a distinct difference between HT and

SMP (Symmetric Multi-Processingmultiple CPUs on a main board). SMP is

chipset-dependent and requires a coordinated effort by many factors off the

CPU itself (mostly for cache coherency). This is one of the main reasons

Microsoft has been complaining lately about implementing HT in its

Windows NT-based operating systems (NT, 2K, and XP). These versions of

Windows use something called a HAL (Hardware Abstraction Layer) which

logically isolates everything a PC can do from the PC itself. HAL is specifically

designed to accommodate the cumulative capabilities derived from any and allPCs Microsoft supports (if any PC out there can do it then theres a part of the

HAL which accommodates its range of features). The problem with T is that it

doesnt fit into their existing HAL model. Since HT is a chip component in and

of itself the basic CPU aspects of the HAL have to be extended or modified to

accommodate it. This is where the problem lies. It is ironic, actually, as just

about the time Microsoft releases their greatest operating system (Windows XP)

the very HAL its designed on will have to be modified extensively to

accommodate HT.

HT PLATFORM REQUIREMENTS


50/53

Intel specifies a P4 3.06GHz or higher processor isrequired to support Hyper-Threading on the desktop. Even though older P4

desktop processors had HT technology built-in, it was permanently disabled at

the fab. All 533 MHz FSB-capable Intel chipsets will support Hyper-Threadingexcept the A1 (initial) stepping of the 845G. The newer chipsets implement

necessary APIC (Advanced Programmable Interrupt Controller) updates for

Hyper-Threaded processors per Intel (to support two logical processors within a

physical processor we suspect, as each logical processor has its own Local

APIC).

Other Hyper-Threading requirements:

Motherboards must meet P4 3.06 GHz power and

thermal specifications, and have an updated BIOS and drivers. As of this

writing, Intel only recommends Windows XP (both Pro and Home editions) and

Linux 2.4.x for use in Hyper-Threading-enabled PCs.

CONCLUSION


51/53

Intels Hyper-Threading Technology brings the concept of

simultaneous multi-threading to the Intel Architecture. This is a

significant new technology direction for Intels processors. It will

become increasingly important for all future applications as it adds a newtechnique for obtaining additional performance for lower transistor and

power costs.The first implementation of Hyper-Threading Technology was

done on the Intel Xeon processor MP. In this implementation there

were two logical processors on each physical processor. The logical

processors had their own independent architecture state, but they shared

nearly all the physical execution and hardware resources of the processor.

The goal was to implement the technology at minimum cost while

ensuring forward progress on logical processors, even if the other is stalled,

and to deliver full performance even when there is only one active logical

processor. These goals were achieved through efficient logical processor

selection algorithms and the creative partitioning and recombining

algorithms of many key resources. Measured performance on the Intel Xeon

processor MP with Hyper-Threading Technology shows performance gains

of up to 30% on common server application benchmarks for this

technology.

The startling growth of the Internet and telecommunications

demands increasingly higher levels of processor performance. But this

enhancement cannot be at the expense of power and cost. Therefore previous

methods to attain higher processor performance cant be afforded.In this context

Intels Hyper threading becomes very important as it achieves maximumprocessor performance using minimum hardware complexity and thereby much

reduced cost and acceptable power consumption.

REFERENCES


52/53

1.T. N. Vijaykumar and Gurindar S. Sohi. Task selection for a multiscalar

Processor. In Proceedings of the 31st Annual IEEE/ACM International

Symposium on Microarchitecture (MICRO 31), December 1998.

2.S. Wallace, B. Calder, and D. M. Tullsen. Threaded multiple path execution.

In Proceedings of the 25th Annual International Symposium on Computer

Architecture, June 1998.

3. M. Gulati and N. Bagherzadeh, Performance Study of a

Multithreaded Superscalar Microprocessor, Proc. Intl Symp.High-

Performance Computer Architecture, ACM, 1996.

4. Yamamoto and M. Nemirovsky, Increasing Superscalar Performance

through Multistreaming, Proc. Intl Conf.Parallel Architectures and

Compilation Techniques, IFIP.

5.H. Hirata et al., An Elementary Processor Architecture with

Simultaneous Instruction Issuing from Multiple Threads,Proc. Intl

Symp. Computer Architecture, Assoc. of Computing Machinery, N.Y.

1992.

6.A. Agarwal, B.H. Lim, D. Kranz and J. Kubiatowicz, APRIL:A

processor Architecture for Multiprocessing, in Proceedings of

the 17th Annual International Symposium on Computer

Architectures, May 1990.

7.R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A.Porter,

and B. Smith, The TERA Computer System, in International

Conference on Supercomputing, June 1990.


53/53

8.L. A. Barroso et. al., Piranha: A Scalable Architecture Based

on Single-Chip Multiprocessing, in Proceedings of the 27th

Annual , June 2000.

9.M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y.Gurevich,

and W. Lee, The M-Machine Multicomputer,in 28th Annual

International Symposium on Microarchitecture, Nov. 1995.

10.L. Hammond, B. Nayfeh, and K. Olukotun, A Single-Chip

Multiprocessor, Computer, September 1997.

11.D. J. C. Johnson, HP's Mako Processor, Microprocessor

Forum, October 2001.

Documents

Manu (Hyper Threading)