Upload
yuben-joseph
View
219
Download
0
Embed Size (px)
Citation preview
8/7/2019 Manu (Hyper Threading)
1/53
Government EngineeringCollege, Thrissur
Department of Electronics andCommunication
Seminar Report 2004
HYPER-THREADING
Presented by
MANU SAGAR K.V
REG NO: 01-633
VIITH SEMESTER
8/7/2019 Manu (Hyper Threading)
2/53
ACKNOWLEDGEMENT
I extend my sincere gratitude to Prof. N Premachandran,
Principal, and Prof. K.P Indira Devi, Head of Electronics & Communication
Department, Govt. Engineering College Thrissur, for providing me with the
necessary infrastructure in completing my seminar. I would like to convey a
deep sense of gratitude to the seminar coordinator Mrs. C.R Muneera, Asst.
Prof Dept. of Electronics & Communication, for giving me unstinted support on
my endeavor in the development of this report. I am also grateful to Mr. C.D
Anil Kumar Lecturer, Dept. of Electronics & Communication, for his timely
advices.
8/7/2019 Manu (Hyper Threading)
3/53
CONTENTS
ABSTRACT 3
INTRODUCTION ..4
PROCESSOR MICRO ARCHITECTURE 5
CONCEPT OF SIMULTANEOUS MULTHREADING ..11
CONVERTING TLP TO ILP via SMT14
SMT AND MULTI PROCESSORS ..16
SYNCHRONIZATION MECHANISMSAND MEMORY HIERARCHY ...21
IMPLICITLY MULTI THREADED PROCESSORS ...25
POWER CONSUMPTIONS IN MULTI THREADEDPROCESSORS ...32
OPERATING SYSTEM CONSIDERATIONS....44
HT PLATFORM REQUIREMENTS....46
CONCLUSION ..47
REFERENCES ...48
8/7/2019 Manu (Hyper Threading)
4/53
Hyper Threading
Abstract
Intels Hyper-Threading Technology brings the concept ofsimultaneous multi-threading to the Intel Architecture. Hyper-Threading
Technology makes a single physical processor appear as two logical processors;
the physical execution resources are shared and the architecture state is
duplicated for the two logical processors. From a software or architecture
perspective, this means operating systems and user programs can schedule
processes or threads to logical processors as they would on multiple physical
processors. From a microarchitecture perspective, this means that instructionsfrom both logical processors will persist and execute simultaneously on shared
execution resources.
The first implementation of Hyper-Threading
Technology was done on the IntelXeon processor MP. In thisimplementation there are two logical processors on each physical processor.
The logical processors have their own independent architecture state, but they
share nearly all the physical execution and hardware resources of the processor.
The goal was to implement the technology at minimum cost while ensuring
forward progress on logical processors, even if the other is stalled, and to
deliver full performance even when there is only one active logical processor.
The potential for Hyper-Threading Technology is
tremendous; our current implementation has only just begun to tap into this
potential. Hyper-Threading Technology is expected to be viable from mobileprocessors to servers; its introduction into market segments other than servers is
8/7/2019 Manu (Hyper Threading)
5/53
only gated by the availability and prevalence of threaded applications and
workloads in those markets.
INTRODUCTION
The amazing growth of the Internet and telecommunications is powered
by ever-faster systems demanding increasingly higher levels of processor
performance. To keep up with this demand, we cannot rely entirely on
traditional approaches to processor design. Micro architecture techniques used
to achieve past processor performance improvementsuper pipelining, branch
prediction, super-scalar execution, out-of-order execution, cacheshave made
microprocessors increasingly more complex, have more transistors, and
consume more power. In fact, transistor counts and power are increasing at
rates greater than processor performance. Processor architects are therefore
looking for ways to improve performance at a greater rate than transistor counts
and power dissipation. Intels Hyper-Threading Technology is one solution.
Making Hyper-Threading Technology a reality was the result
of enormous dedication, planning, and sheer hard work from a large number of
designers, validators, architects, and others. There was incredible teamwork
from the operating system developers, BIOS writers, and software developers
who helped with innovations and provided support for many decisions thatwere
made during the definition process of Hyper-Threading Technology. Many
dedicated engineers are continuing to work with our ISV partners to analyze
application performance for this technology. Their contributions and hard work
have already made and will continue to make a real difference to our customers.
8/7/2019 Manu (Hyper Threading)
6/53
PROCESSOR MICRO-ARCHITECTURE
Traditional approaches to processor design have focused on
higher clock speeds, instruction-level parallelism (ILP), and caches. Techniques
to achieve higher clock speeds involve pipelining the microarchitecture tofiner granularities, also called super-pipelining. Higher clock frequencies can
greatly improve performance by increasing the number of instructions that can
be executed each second. Because there will be far more instructions in-flight in
a superpipelined microarchitecture, handling of events that disrupt the pipeline,
e.g., cache misses, interrupts and branch mispredictions, can be costly.ILP
refers to techniques to increase the number of instructions executed each clock
cycle. For example, a super-scalar processor has multiple parallel execution
units that can process instructions simultaneously. With super-scalar execution,
several instructions can be executed each clock cycle. However, with simple
inorder execution, it is not enough to simply have multiple execution units.
The challenge is to find enough instructions to execute.
One technique is out-of-order execution where a large window of
instructions is simultaneously evaluated and sent to execution units, based on
instruction dependencies rather than program order.Accesses to DRAM
memory are slow compared to execution speeds of the processor. One
technique to reduce this latency is to add fast caches close to the processor.
Caches can provide fast memory access to frequently accessed data or
instructions. However, caches can only be fast when they are small. For this
reason, processors often are designed with a cache hierarchy in which fast,
small caches are located and operated at access latencies very close to that of
8/7/2019 Manu (Hyper Threading)
7/53
the processor core, and progressively larger caches, which handle less
frequently accessed data or instructions, are implemented with longer access
latencies. However, there will always be times when the data needed will not be
in any processor cache. Handling such cache misses requires accessingmemory, and the processor is likely to quickly run out of instructions to execute
before stalling on the cache miss.
The vast majority of techniques to improve processor performance from one
generation to the next is complex and often adds significant die-size and power
costs. These techniques increase performance but not with 100% efficiency;
i.e., doubling the number of execution units in a processor does not double theperformance of the processor, due to limited parallelism in instruction flows.
Similarly, simply doubling the clock rate does not double the performance due
to the number of processor cycles lost to branch mispredictions.
figure shows the relative increase in performance and the costs,
such as die size and power, over the last ten years on Intel processors1. In order
to isolate the microarchitecture impact, this comparison assumes that the four
generations of processors are on the same silicon process technology and thatthe speed-ups are normalized to the performance of an Intel486processor.
8/7/2019 Manu (Hyper Threading)
8/53
Although we use Intels processor history in this example, other high-
performance processor manufacturers during this time period would have
similar trends. Intels processor performance, due to microarchitecture advances
alone, has improved integer performance five- or six-fold1. Most integerapplications have limited ILP and the instruction flow can be hard to predict.
Over the same period, the relative die size has gone up fifteen-fold, a three-
times-higher rate than the gains in integer performance. Fortunately, advances
in silicon process technology allow more transistors to be packed into a given
amount of die area so that the actual measured die size of each generation
microarchitecture has not increased significantly.
The relative power increased almost eighteen-fold
during this period1. Fortunately, there exist a number of known techniques to
significantly reduce power consumption on processors and there is much on-
going research in this area. However, current processor power dissipation is at
the limit of what can be easily dealt with in desktop platforms and we must put
greater emphasis on improving performance in conjunction with new
technology, specifically to control power.
CONVENTIONAL MULTI-THREADING
A look at todays software trends reveals that server applications
consist of multiple threads or processes that can be executed in parallel. On-line
transaction processing and Web services have an abundance of software threads
that can be executed simultaneously for faster performance. Even desktop
applications are becoming increasingly parallel. Intel architects have been
trying to leverage this so-called thread-level parallelism (TLP) to gain a better
performance vs.transistor count and power ratio.
In both the high-end and mid-range server markets,
multiprocessors have been commonly used to get more performance from the
system. By adding more processors, applications potentially get substantialperformance improvement by executing multiple threads on
8/7/2019 Manu (Hyper Threading)
9/53
multiple processors at the same time. These threads might be from the same
application, from different applications running simultaneously, from operating
system services, or from operating system threads doing background
maintenance. Multiprocessor systems havebeen used for many years, and high-end programmers are familiar with the techniques to exploit multiprocessors for
higher performance levels.
In recent years a number of other techniques to further
exploit TLP have been discussed and some products have been announced. One
of these techniques is chip multiprocessing (CMP), where two processors are
put on a single die.
The two processors each have a full set of execution and architectural
resources. The processors may or may not share a large on-chip cache.CMP is
largely orthogonal to conventional multiprocessor systems, as you can have
multiple CMP processors in a multiprocessor configuration. Recently
announced processors incorporate two processors on each die. However, a CMP
chip is significantly larger than the size of a single-core chip and therefore more
expensive to manufacture; moreover, it does not begin to address the die size
and power considerations.
TIME-SLICE MULTI-THREADING
Time-slice multithreading is where the processor switches
between software threads after a fixed time period. Quite a bit of what a CPU
does is illusion. For instance, modern out-of-order processor architectures don't
actually execute code sequentially in the order in which it was written.. It is
noted that an OOE(out of order execution)architecture takes code that was
written and compiled to be executed in a specific order, reschedules the
sequence of instructions (if possible) so that they make maximum use of the
processor resources, executes them, and then arranges them back in their
original order so that the results can be written out to memory. To the
programmer and the user, it looks as if an ordered, sequential stream ofinstructions went into the CPU and identically ordered, sequential stream of
8/7/2019 Manu (Hyper Threading)
10/53
computational results emerged. Only the CPU knows in what order the
program's instructionswere actually executed, and in that respect the processor
is like a black box to both the programmer and the user.
The same kind of sleight-of-hand happens when
we run multiple programs at once, except that this time the operating system is
also involved in the scam. To the end user, it appears as if the processor is
"running" more than one program at the same time, and indeed, there actually
are multiple programs loaded into memory. But the CPU can execute only one
of these programs at a time. The OS maintains the llusion of concurrency by
rapidly switching between running programs at a fixed interval, called a time
slice. The time slice has to be small enough that the user doesn't notice any
degradation in the usability and performance of the running programs, and it
has to be large enough that each program has a sufficient amount of CPU time
in which to get useful work done. Most modern operating systems include a
way to change the size of an individual program's time slice. So a program witha larger time slice gets more actual execution time on the CPU relative to its
lower priority peers, and hence it runs faster.
But time-slice multithreading can result in
wasted execution slots but can effectively minimize the effects of long latencies
to memory. Switch-on-event Multithreading would switch threads on long
latency events such as cache misses. This approach can work well for serverapplications that have large numbers of cache misses and where the two threads
are executing similar tasks. However, both the time-slice and the switch-
onevent onevent multi-threading techniques do not achieve optimal overlap of
many sources of inefficient resource usage,such as branch mispredictions,
instruction dependencies, etc.
8/7/2019 Manu (Hyper Threading)
11/53
Finally, there is simultaneous multi-threading, where
multiple threads can execute on a single processor without switching. The
threads execute simultaneously and make much better use of the resources.
This approach makes the most effective use of processor
resources: it maximizes the performance vs. transistor count and power
consumption.
8/7/2019 Manu (Hyper Threading)
12/53
CONCEPT OF SIMULTANEOUS MULTI-THREADING
Simultaneous Multi-threading is a processor design that
combines hardware multithreading with superscalar processor technology to
allow multiple threads to issue instructions each cycle. Unlike other hardware
multithreaded architectures (such as the Tera MTA), in which only a single
hardware context (i.e., thread) is active on any given cycle, SMT permits all
thread contexts to simultaneously compete for and share processor resources.
Unlike conventional superscalar processors, which suffer from a lack of per-
thread instruction-level parallelism, simultaneous multithreading uses multiple
threads to compensate for low single-thread ILP. The performance consequence
is significantly higher instruction throughput and program speedups on a variety
of workloads that include commercial databases, web servers and scientific
applications in both multi programmed and parallel environments.
Simultaneous multithreading has already had impact in both the academic and
commercial communities.
It has been found that the throughput gains from
simultaneous multithreading can be achieved without extensive changes to a
conventional wide-issue superscalar, either in hardware structures or sizes. It is
seen that the architecture for simultaneous multi threading achieves three
goals: (1) it minimizes the architectural impact on the conventional superscalar
design, (2) it has minimal performance impact on a single thread executing
alone, and (3) it achieves significant throughput gains when running multiple
threads.The simultaneous multi threading enjoys a 2.5-fold improvement over
an unmodified superscalar with the same hardware resources. This speedup is
enabled by an advantage of multithreading previously unexploited in other
architectures: the ability to favor for fetch and issue those threads most
efficiently using the processor each cycle, thereby providing the "best"
8/7/2019 Manu (Hyper Threading)
13/53
instructions to the processor. Several heuristics that helps to identify and use
the best threads for fetch and issue are found, and show that such heuristics can
increase throughput by as much as 37%. Using the best fetch and issue
alternatives, we then use bottleneck analysis to identify opportunities for furthergains on the improved architecture.
Simultaneous Multithreading: Maximizing On-Chip
Parallelism
The increase in component density on modern microprocessors has led
to a substantial increase in on-chip parallelism. In particular, modern
superscalar RISCs can issue several instructions to independent functional units
each cycle. However, the benefit of such superscalar architectures is ultimately
limited by the parallelism available in a single thread.
Simultaneous Multi threading is a technique permitting several independent
threads to issue instructions to a super-scalar's multiple functional units in a
single cycle. In the most general case, the binding between thread and
functional unit is completely dynamic. We present several models of
simultaneous multithreading and compare them with wide superscalar, fine-
grain multithreaded, and single-chip, multiple-issue multiprocessing
architectures. To perform these evaluations, simultaneous multithreadedarchitecture was simulated based on the DEC Alpha 21164 design, and execute
code generated by the Multi flow trace scheduling compiler. The results show
that: (1) No single latency-hiding technique is likely to produce acceptable
utilization of wide superscalar processors. Increasing processor utilization will
therefore require a new approach, one that attacks multiple causes of processor
idle cycles. (2) Simultaneous multithreading is such a technique. With our
8/7/2019 Manu (Hyper Threading)
14/53
machine model, an 8-thread, 8-issue simultaneous multithreaded processor
sustains over 5 instructions per cycle, while a single-threaded processor can
sustain fewer than 1.5 instructions per cycle with similar resources and issue
bandwidth. (3) Multithreaded workloads degrade cache performance relative tosingle-thread performance, as previous studies have shown. We evaluate several
cache configurations and demonstrate that private instruction and shared data
caches provide excellent performance regardless of the number of threads. (4)
Simultaneous multithreading is an attractive alternative to single-chip
multiprocessors. We show that simultaneous multithreaded processors with a
variety of organizations are all superior to conventional multiprocessors with
similar resources.
While simultaneous multithreading has excellent potential to increase processor
utilization, it can add substantial complexity to the design.
8/7/2019 Manu (Hyper Threading)
15/53
CONVERTING TLP TO ILP VIA SMT
To achieve high performance, contemporarycomputer systems rely on two forms of parallelism: instruction-
level parallelism (ILP) and thread-level parallelism (TLP).
Although they correspond to different granularities of
parallelism, ILP and TLP are fundamentally identical: they both
identify independent instructions that can execute in parallel
and therefore can utilize parallel hardware. Wide-issue
superscalar processors exploit ILP by executing multiple
instructions from a single program in a single
cycle.Multiprocessors exploit TLP by executing different threads
in parallel on different processors. Unfortunately, neither
parallel processing style is capable of adapting to dynamically
changing levels of ILP and TLP, because the hardware enforces
the distinction between the two types of parallelism. Amultiprocessor must statically partition its resources among the
multiple CPUs (see Figure 1); if insufficient TLP is available,
some of the processors will be idle. A superscalar executes only
a single thread; if insufficient ILP exists, much of that
processors multiple-issue hardware will be wasted.
Simultaneous multithreading (SMT) [Tullsen et al. 1995; 1996;
Gulati etal. 1996; Hirata et al. 1992] allows multiple threads to
compete for and share available processor resources every
cycle. One of its key advantages when executing parallel
applications is its ability to use thread-level parallelism and
instruction-level parallelism interchangeably. By allowing
multiple threads to share the processors functional units
simultaneously, thread-level parallelism is essentially
converted into instruction-level parallelism.An SMT processor
8/7/2019 Manu (Hyper Threading)
16/53
can therefore accommodate variations in ILP and TLP. When a
program has only a single thread (i.e., it lacks TLP) all of the
SMT processors resources can be dedicated to that thread;
when more TLP exists, this parallelism can compensate for alack of per-thread ILP.
figure1
Fig. A comparison of issue slot (functional unit) utilization in various
architectures. Each square corresponds to an issue slot, with white
squares signifying unutilized slots. Hardware utilization suffers when
a program exhibits insufficient parallelism or when available
parallelism is not used effectively. A superscalar processor achieves
low utilization because of low ILP in its single thread. Multiprocessors
physically partition hardware to exploit TLP, and therefore
performance suffers when TLP is low (e.g., in sequential portions of
parallel programs). In contrast, simultaneous multithreading avoids
resource partitioning. Because it allows multiple threads to compete
for all resources in the same cycle, SMT can cope with varying levels
of ILP and TLP; consequently utilization is higher, and performance is
better.
8/7/2019 Manu (Hyper Threading)
17/53
An SMT processor can uniquely exploit whichever type of
parallelism is available, thereby utilizing the functional units
more effectively to achieve the goals of greater throughput and
significant program speedups.
SIMULTANEOUS MULTITHREADING AND MULTIPROCESSORS
Both the SMT(simultaneous multithreading)
processor and the on-chip shared-memory MPs we examine
are built from a common out-of-order superscalar base
processor. The multiprocessor combines several of these
superscalar CPUs in a small-scale MP, whereas simultaneous
multithreading uses a wider-issue superscalar,and then adds
support for multiple contexts
Base Processor Architecture
The base processor is asophisticated, out-of-order superscalar processor with a
dynamic scheduling core similar to the MIPS R10000 [Year
1996].Figure 2 illustrates the organization of this processor,
and Figure 3 shows its processor pipeline. On each cycle, the
processor fetches a block of instructions from the instruction
cache. After decoding these instructions,the register-renaming
logic maps the logical registers to a pool of physical renaming
8/7/2019 Manu (Hyper Threading)
18/53
registers to remove false dependencies. Instructions are then
fed to either the integer or floating-point instruction queues.
When their operands become available, instructions are issued
from these queues to the corresponding functional units.Instructions are retired in order.
figure-2
figure-3
SMT ARCHITECTURE
The SMT architecture, which can simultaneously
execute threads from up to eight hardware contexts, is a
8/7/2019 Manu (Hyper Threading)
19/53
straightforward extension of the base processor. To support
simultaneous multithreading, the base processor architecture
requires significant changes in only two primary areas: the
instruction fetch mechanism and the register file. Aconventional system of branch prediction hardware (branch
target buffer and pattern history table) drives instruction
fetching, although we now have 8 program counters and 8
subroutine return stacks (1 per context). On each cycle, the
fetch mechanism selects up to 2 threads (among threads not
already incurring I-cache misses) and fetches up to 4
instructions from each thread (the 2.4 scheme from Tullsen et
al. [1996]). The total fetch bandwidth of 8 instructions is
therefore equivalent to that required for an 8-wide superscalar
processor, and only 2 I-cache ports are required.Additional
logic, however, is necessary in the SMT to prioritize thread
selection. Thread priorities are assigned using the icount
feedback technique , which favors threads that are usingprocessor resources most effectively. Under icount, highest
priority is given to the threads that have the least number of
instructions in the decode, renaming,and queue pipeline
stages. This approach prevents a single thread from
clogging the instruction queue, avoids thread starvation, and
provides a more even distribution of instructions from all
threads, thereby heightening interthread parallelism. The peak
throughput of our machine is limited by the fetch and decode
bandwidth of 8 instructions per cycle.
Following instruction fetch and decode,
register renaming is performed,as in the base processor. Each
thread can address 32 architectural integer (and FP) registers.
The register-renaming mechanism maps these architecturalregisters (1 set per thread) onto the machines physical
8/7/2019 Manu (Hyper Threading)
20/53
registers. An 8-context SMT will require at least 8 * 32 5 256
physical registers, plus additional physical Registers for register
renaming. With a larger register file, longer access times will be
required, so the SMT processor pipeline is extended by 2 cyclesto avoid an increase in the cycle time. Figure 3 compares the
pipeline for the SMT versus that of the base superscalar. On the
SMT, register reads take 2 pipe stages and are pipelined.
Writers to the register file behave in a similar manner, also
using an extra pipeline stage.In practice, we found that the
lengthened pipeline degraded performance by less than 2%
when running a single thread. The additional pipe stage
requires an extra level of bypass logic, but the number of
stages has a smaller impact on the complexity and delay of this
logic (O(n)) than the issue width (O(n2)) [Palacharla et al.
1997]. Our previous study contains more details regarding the
effects of the two-stage register read/write pipelines on the
architecture and performance.In addition to the new fetch
mechanism and the larger register file and longer pipeline, only
three processor resources are replicated or added to support
SMT: per-thread instruction retirement, trap mechanisms, and
an additional thread id field in each branch target buffer entry.
No additional hardware is required to perform multithreaded
scheduling of instructions to functional units. The register-
renaming phase removes any apparent interthread register
dependencies, so that the conventional instruction ACM
Transactions on Computer Systems queues can be used to
dynamically schedule instructions from multiple threads.
Instructions from all threads are dumped into the instruction
queues, and an instruction from any thread can be issued fromthe queues once its operands become available.In this design,
8/7/2019 Manu (Hyper Threading)
21/53
few resources are statically partitioned among
contexts;consequently, almost all hardware resources are
available even when only one thread is executing. The
architecture allows us to achieve the performance advantagesof simultaneous multithreading, while keeping intact the design
and single-thread peak performance of the dynamically
scheduled CPU core present in modern superscalar
architectures.
Single-Chip Multiprocessor Hardware
configurations
In our analysis of SMT and multiprocessing, we focus on a
particular region of the MP design space, specifically, small-
scale, single-chip, shared-memory
multiprocessors. As chip densities increase, single-chip
multiprocessing will be possible, and some architects have
already begun to investigate this use of chip real estate
[Olukotun et al. 1996]. An SMT processor and a small-scale, on-
chip multiprocessor have many similarities: for example,both
have large numbers of registers and functional units, on-chip
caches,and the ability to issue multiple instructions each cycle.
In this study, we keep these resources approximately similar
for the SMT and MP comparisons,and in some cases we give a
hardware advantage to the MP.1We look at both two- and four-
processor multiprocessors, partitioning the scheduling unit
resources of the multiprocessor CPUs (the functional
units,instruction queues, and renaming registers) differently for
each case. In the two-processor MP (MP2), each processor
receives half of the on-chip execution resources previously
described, so that the total resources relative to an SMT are
8/7/2019 Manu (Hyper Threading)
22/53
comparable (Table I). For a four-processor MP (MP4), each
processor contains approximately one-fourth of the chip
resources. The issue width for each processor in these two MP
models is indicated by the total number of functional units.Note that even within the MP design space, these two
alternatives (MP2 and MP4) represent an interesting tradeoff
between TLP and ILP. The two-processor machine can exploit
more ILP, because each processor has more functional units
than its MP4 counterpart, whereas MP4 has additional
processors to take advantage of more TLP.Table I also includes
several multiprocessor configurations in which we increase
hardware resources. These configurations are designed to
reduce bottlenecks in resource usage in order to improve
aggregate performance.MP2fu (MP4fu), MP2q (MP4q), and MP2r
(MP4r) address bottlenecks of functional units, instruction
queues, and renaming registers, respectively.MP2a increases
all three of these resources,so that the total executionresources of each processor are equivalent to a single SMT
processor.2 (MP4a is similarly augmented in all three resource
classes, so that the entire MP4a multiprocessor also has twice
as many resources as our SMT.)For all MP configurations, the
base processor uses the out-of-order )scheduling processor
described earlier and the base pipeline from Figure 3 . Each MP
processor only supports one context; therefore its register file
will be smaller, and access will be faster than the SMT. Hence
the shorter pipeline is more appropriate.
8/7/2019 Manu (Hyper Threading)
23/53
Synchronization Mechanisms and Memory Hierarchy
SMT has three key advantages over multiprocessing:
flexible use of ILP and TLP, the potential for fast
synchronization, and a shared L1 cache.This study focuses on
SMTs ability to exploit the first advantage by determining the
costs of partitioning execution resources; we therefore allow
the multiprocessor to use SMTs synchronization mechanisms
and cache hierarchy to avoid tainting our results with effects of
the latter two.We implement a set of synchronization primitives
for thread creation and termination, as well as hardware
blocking locks. Because the threads in an SMT processor share
the same scheduling core, inexpensive hardware blocking locks
can be implemented in a synchronization functional unit.Thischeap synchronization is not available to multiprocessors,
8/7/2019 Manu (Hyper Threading)
24/53
because the distinct processors cannot share functional units.
In our workload, most interthread synchronization is in the form
of barrier synchronization or
simple locks, and we found that synchronization time is notcritical to performance. Therefore, we allow the MPs to use the
same cheap synchronization techniques, so that our
comparisons are not colored by synchronization effects. The
entire cache hierarchy, including the L1 caches, is shared by all
threads in an SMT processor. Multiprocessors typically do not
use a shared L1 cache to exploit data sharing between parallel
threads. Each processor in an MP usually has its own private
cache (as in the commercial multiprocessors described by Sun
Microsystems [1997], Slater [1992], IBM [1997],and Silicon
Graphics [1996]) and therefore incurs some coherence
overhead when data sharing occurs. Because we allow the MP
to use the SMTs shared L1 cache, this coherence overhead is
eliminated. Although multiple threads may have working setsthat interfere in a shared cache, that interthread interference is
not a problem.
Comparing SMT and MP
In comparing the total hardware dedicated to our
multiprocessor and SMT configurations, we have not taken into
account chip area required for buses or the cycle-time effects
of a wider-issue machine. In our study, the intent is not to claim
that SMT has an absolute xpercent performance advantage
over MP, but instead to demonstrate that SMT can overcome
some fundamental limitations of multiprocessors, namely, their
inability to exploit changing levels of ILP and TLP. We believe
8/7/2019 Manu (Hyper Threading)
25/53
that in the target design space we are studying, the intrinsic
flaws resulting from resource partitioning in MPs will limit their
effectiveness relative to SMT, even taking into consideration
cycle time.
PARALLEL APPLICATION
SMT is most effective when threads have complementary
hardware resource requirements. Multiprogrammed workloads
and workloads consisting of parallel applications both provide
TLP via independent streams of control, but they compete forhardware resources differently. Because a multiprogrammed
workload does not share memory references across threads, it
places more stress on the caches. Furthermore, its threads
have different instruction execution patterns, causing
interference in branch prediction hardware. On the other hand,
multiprogrammed workloads are less likely to compete for
identical functional units.Although parallel applications have
the benefit of sharing the caches and branch prediction
hardware, they are an interesting and different test of SMT for
several reasons. First, unlike the multiprogrammed workload,
all threads in a parallel application execute the same code and,
therefore, have similar execution resource requirements,
memory reference patterns, and levels of ILP. Because all
threads tend to have the same resource needs at the same
time, there is potentially more contention for these resources
compared to a multiprogrammed workload.
8/7/2019 Manu (Hyper Threading)
26/53
Secondly, parallel applications illustrate the promise of SMT as
an architecture
for improving the performance of single applications. By using
threads to parallelize programs, SMT can improve processor
utilization,but more important, it can achieve program
speedups. Finally, parallel applications are a natural workload
for traditional parallel architectures and therefore serve as a
fair basis for comparing SMT and multiprocessors.
Effectively Using Parallelism on an SMT
Processor
Rather than adding more execution resources to improve
performance, SMT boosts performance and improves
utilization of existing resources by using parallelism more
effectively.Unlike multiprocessors that suffer from rigid
8/7/2019 Manu (Hyper Threading)
27/53
partitioning, simultaneous multithreading permits dynamic
resource sharing, so that resources can be flexibly partitioned
on a per-cycle basis to match the ILP and TLP needs of the
program. When a thread has a lot of ILP, it can access allprocessor resources; and TLP can compensate for a lack of per-
thread ILP.
As more threads are used , speedups increase (up to 2.68 on
average with 8 threads), exceeding the performance gains
attained by the enhanced MP configurations.The degree of
SMTs improvement varies across the benchmarks, depending
on the amount of per-thread ILP. The five programs with the
least ILP (radix, tomcatv, hydro2d, water-spatial, and shallow)
get the five largest speedups for SMT.T8, because TLP
compensates for low ILP;programs that already have a large
amount of ILP (LU and FFT) benefit less from using additional
threads, because resources are already busy executing useful
instructions. In linpack, performance tails off after two threads,because the granularity of parallelism in the program is very
small. The gain from parallelism is outweighed by the overhead
of parallelization (not only thread creation, but also the work
required to set up the loops in each thread).
8/7/2019 Manu (Hyper Threading)
28/53
IMPLICITLY-MULTITHREADED PROCESSORS[IMT]. IMT executes compiler-specified speculative threads from a sequential
program on a wide-issue SMT pipeline. IMT is based on the fundamental
observation that Multi scalars execution model i.e., compiler-specified
speculative threads [11] can be decoupled from the processor organization
i.e., distributed processing cores. Multi scalar [11] employs sophisticated
specialized hardware, the register ring and address resolution buffer, which are
strongly coupled to the distributed core organization. In contrast, IMT proposes
to map speculative threads on to generic SMT.IMT differs fundamentally from
prior proposals, TME and DMT, for speculative threading on SMT. While TME
executes multiple threads only in the uncommon case of branch mispredictions,
IMT invokes threads in the common
case of correct predctions, thereby enhancing execution parallelism. Unlike
IMT, DMT creates threads in hardware. Because of the lack of compile-time
information, DMT uses value prediction to break data dependence across
threads. Unfortunately, inaccurate value prediction incurs frequent
misspeculation stalls, prohibiting DMT from extracting thread-level parallelism
effectively. Moreover,selective recovery from misspeculation in DMT requires
fast and frequent searches through prohibitively large (e.g., ~1000 entries)
custom instruction trace buffers that are difficult to implement efficiently.In this
paper, we find that a naive mapping of compiler-specified speculative threadsonto SMT performs poorly. Despite using an advanced compiler [14] to
8/7/2019 Manu (Hyper Threading)
29/53
generate threads, a Naive IMT (N-IMT) implementation performs only
comparably to an aggressive superscalar. NIMTs key shortcoming is its
indiscriminate approach to fetching/executing instructions from threads,
without accounting for resource availability, thread resource usage, and inter-thread dependence information. The resulting poor utilization of pipeline
resources (e.g., issue queue, load/store queues, and register file) in N-IMT
negatively offsets the advantages of speculative threading.
Implicitly-MultiThreaded (IMT) processor utilizes SMTs support for
multithreading by executing speculative threads. Figure 3 depicts the anatomy
of an IMT processor derived from SMT. IMT uses the rename tables for
register renaming, the issue queue for out-of-order scheduling, the per-context
load/store queue (LSQ) and active list for memory dependences and instruction
reordering prior to commit. As in SMT, IMT shares the functional units,
physical registers, issue queue,and memory hierarchy among all contexts.IMT
exploits implicit parallelism, as opposed to programmer- specified, explicit
parallelism exploited by conventional SMT and ultiprocessors. Like
Multiscalar, IMT predicts the threads in succession and maps them to execution
resources, with the earliest thread as the onspeculative(head) thread, followed
by subsequent speculative threads [11]. IMT honors the inter-thread control-
flow and register dependences specified by the compiler.IMT uses the LSQ to
enforce inter-thread memory dependences. Upon completion, IMT commits the
threads in program order.Two IMT variations are presented: (1) aNaive IMT
(NIMT) that performs comparably to an aggressive superscalar, and (2) anOptimized IMT (O-IMT) that uses novel micro architectural techniques to
enhance performance.
8/7/2019 Manu (Hyper Threading)
30/53
8/7/2019 Manu (Hyper Threading)
31/53
order. For instance, later (program-order) threads may result in resource (e.g.,
physical registers, issue queue and LSQ entries) starvation in earlier threads,
forcing the later threads to squash and relinquish the resources for use by earlier
threads. Unfortunately, frequent thread squashing due to indiscriminateresource allocation without regards to demand incurs high overhead. Moreover,
treating (control-and data-) dependent and independent threads alike is
suboptimal. Fetching and executing instructions from later threads that are
dependent on earlier threads may be counter-productive because it increases
inter-thread dependence delays by taking away front-end fetch and processing
bandwidth from earlier threads. Finally, dependent instructions from later
threads exacerbate issue queue contention because they remain in the queue
until the dependences are resolved.
To mitigate the above shortcomings, O-IMT employs a novel resource- and
dependence-based fetch policy that is bimodal. In the dependent mode, the
policy biases fetch towards the non-speculative thread when the threads are
likely to be dependent, fetching sequentially to the highest extent possible. In
the independent mode, the policy uses ICOUNT when the threads are
potentially independent, enhancing overlap among multiple threads. Because
loop iterations are typically independent, the policy employs anInter-Thread
Dependence Heuristic (ITDH) to identify loop iterations for the independent
mode, otherwise considering threads to be dependent. ITDH predicts that
subsequent threads are loop iterations if the next two threads start PCs are the
same as the nonspeculative (head) threads start PC.To reduce resourcecontention among threads, the policy employs aDynamic Resource Predictor
(DRP) to initiate fetch from an invoked thread only if the available hardware
resources exceed the predicted demand by the thread. The DRP dynamically
monitors the threads activity and allows fetch to be initiated from newly
invoked threads when earlier threads commit and resources become available.
Figure 4 (a) depicts an example of DRP. O-IMT indexes into a table using the
start PC of a thread. Each table entry holds the numbers of active list and LSQ
8/7/2019 Manu (Hyper Threading)
32/53
slots,and physical registers used by the threads last four execution instances.
The pipeline monitors a threads resource needs, and upon thread commit,
pdates the threads DRP entry. DRP supplies the maximum among the four
instances for each resource as the prediction for the next instances resourcerequirement. The policy alleviates enterthread data dependence by processing
producer instructions earlier and decreasing instruction execution stalls,thereby
reducing pipeline resource contention.In contrast to O-IMT, prior proposals for
speculative threading using SMT use variants of conventional fetch policies.
TME uses biased-ICOUNT, a variant of ICOUNT that does not consider
resource availability and thread-level independence. DMTs fetch policy
statically partitions two fetch ports, and allocates one port for the non-
speculative thread and the other for speculative threads in a round-robin
manner. However, DMT does not suffer from resource contention because the
design assumes prohibitively large custom instruction trace buffers (holding
thousands of instructions) allowing for threads to make forward progress
without regards to resource availability and thread-level
independence.Unfortunately, frequent associative searches through such large
buffers are slow and impractical.
8/7/2019 Manu (Hyper Threading)
33/53
figure 4.using(a) and context multiplexing(b).
Multiplexing Hardware Contexts
Much like prior proposals, N-IMT assigns a single thread to a hardware context.
Because many programs have short threads [14] and real SMT implementations
are bound to have only a few (e.g., 2-8) contexts, this approach often leads to
insufficient instruction overlap.Larger threads, however, increase both the
likelihood of dependence misspeculation [14] and the number of instructions
discarded per misspeculation, and cause speculative buffer overflow [5].Instead,
to increase instruction overlap without the unwanted side-effects of large
threads, O-IMT multiplexes the hardware contexts by mapping as many threads
as allowed by the resources in one context (typically 3-6 threads for SPEC2K).
Context multiplexing requires for each context only an additional fetch PC
register and rename table pointer per thread for a given maximum number of
threads per context. Context multiplexing differs from prior proposals for
mapping multiple threads on to a single processing core [12,3] to alleviate load
8/7/2019 Manu (Hyper Threading)
34/53
imbalance,in that multiplexing allows instructions from multiple threads within
a context to execute and share resources simultaneously.Two design
complexities arise due to sharing resources in context multiplexing. First,
conventional active list and LSQ designs assume that instructions enter thesequeues in (the predicted) program order. Such an assumption enables the active
list to be a non-searchable (potentially large) structure, and allows honoring
emory dependences via an ordered (associative) search in the LSQ. If care is
not taken, multiplexing would invalidate this assumption if multiple threads
were to place instructions out of program order in the shared active list and
LSQ. Such out-of-order placement would require an associative search on the
active list to determine the correct instruction(s) to be removed upon commit or
misspeculation.In the case of the LSQ, the requirements would be even more
complicated. A memory access would have to search through the LSQ for an
address match among the entries from the accessing thread, and then
(conceptually)repeat the search among entries from the thread preceding the
accessing thread, working towards older threads.Unfortunately, the active list
and LSQ cannot afford these additional design complications because active
lists are made large and therefore non-searchable by design and the LSQs
ordered, associative search is already complex and time-critical.Second,
allowing a single context to have multiple out-of-program-order threads
complicates managing interthread dependence. Because two in-program-order
threads may be mapped to different contexts, honoring memory dependences
would require memory accesses to search through multiple contexts thereby
prohibitively increasing LSQ search time and design complexity.O-IMT avoidsthe second design complexity by mapping threads to a context in program
order. Inter-thread and intra-thread dependences within a single context are
treated similarly. Figure 3 (b) shows how in-programorder threads X and X+1
are mapped to a context. In addition to program order within contexts, O-IMT
tracks the global program order among the contexts themselves for precise
interrupts.
8/7/2019 Manu (Hyper Threading)
35/53
8/7/2019 Manu (Hyper Threading)
36/53
frequency reduction, and low power circuit design, optimizations can be applied
at the architecture level as well.The other environment where power
consumption is important is that of high performance processors. These
processors are used in environments where energy supply is not typically alimitation. For high performance computing, clock frequencies continue to
increase, causing the power dissipated to reach the thresholds of current
packaging technology.When the maximum power dissipation becomes a critical
design constraint, that architecture which maximizes the performance/power
ratio thereby maximizes performance. Multithreading is a processor
architecture technique which has been shown to provide significant
performance advantage over conventional architectures which can only follow a
single stream of execution. Simultaneous multithreading (SMT) [15, 14] can
provide up to twice the throughput of a dynamic superscalar single-threaded
processor. Announced architectures that will feature multithreading include the
Compaq Alpha EV8 [5] and the Sun MAJC Java processor [12], joining the
existing Tera MTA supercomputer architecture [1]. We can show that a
multithreading processor is attractive in the context of low-power or power-
constrained devices for many of the same reasons that enable its high
throughput. First, it supplies extra parallelism via multiple threads, allowing the
processor to rely much less heavily on speculation; thus, it wastes fewer
resources on speculative, never-committed instructions. Second, it provides
both higher and more even parallelism over time when running multiple
threads, wasting less power on underutilized execution resources. A
multithreading architecture also allows different design decisions than areavailable in a single threaded processor, such as the ability to impact power
through the thread selection mechanism.
Modelling Power
To describe the power and energy results of the processor,a particular power
model can be identified. This power model is integrated into a detailed cycle-
by-cycle instruction-level architectural simulator, allowing the model to
accurately model both useful and non-useful (incorrectly speculated) use of all
8/7/2019 Manu (Hyper Threading)
37/53
processor resources.The power model utilized for this study is an area-based
model. In this power model,the microprocessor is divided into several high-
level units,and the corresponding activity of each unit is used in the
computation of the overall power consumption. The following processor unitsare modeled: L1 instruction cache, L1 data cache, L2 unified cache, fetch unit,
integer and floating point instruction queues, branch predictor, instruction TLB,
data TLB, load-store queue, integer and floating-point functional units, register
file, register renamer, completion queue, and return stack.The total processor
energy consumption is the summation of the unit energies where each unit
energy is equal to:
An activity factor is a defined statistic measuring how many architectural eventsa program or workload generates for a particular unit. For a given unit, there
may be multiple activity factors each representing a different action that
can occur within the unit. Activity factors represent high level actions and
therefore are independent of the data values causing a particular unit to become
active. We can compute the overall power consumption of the processor on a
cycle-by-cycle basis by summing the energy usage of all units in the processor.
The entire power model consists of 44 activity factors.Each of these activity
factors is a measurement of the activity of a particular unit or function of the
processor (e.g. number of instruction queue writes, L2 cache reads). The model
does not assume that a given factor or combination of factors exercises a
microprocessor unit in its entirety, but instead is modeled as exercising a
fraction of the area depending on the particular activity factor or combination of
activity factors. The weight of how much of the unit is exercised by a particularactivity is an estimate based on knowledge of the units functionality and
8/7/2019 Manu (Hyper Threading)
38/53
assumed implementation. Each unit is also assigned a relative area value which
is an estimate of the units size given its parameters. In addition to the overall
area of the unit, each unit is modeled as consisting of a particular ratio of 4
different circuit types: dynamic logic, static logic, memory, and clockcircuitry.Each circuit type is given an energy density. The ratio of circuit types
for a given logic unit are estimates based on engineering experience and
consultation with processor designers. The advantage of an area-based design is
its adaptability. It can model an aggregate of several existing processors, as
long as average area breakdowns can be derived for each of the units. And it
can be adapted for future processors for which no circuit implementations exist,
as long as an estimate of the area expansion can be made. In both cases, we
rely on the somewhat coarse assumption that the general circuit makeup of
these units remains fairly constant across designs. This assumption should be
relatively valid for the types of high-level architectural comparisons made in
this paper, but might not be for more low-level design options. Our power
model is an architecture-level power model,and is not intended to produce
precise estimates of absolute whole-processor power consumption. For
example, it does not include dynamic leakage factors, I/O power (e.g.,pin
drivers), etc. However, the power model is useful for obtaining relative power
numbers for comparison against results obtained from the same power model.
The numbers that are obtained from the power model are independent of clock
frequency (again, because we focus on relative values).
Power Consumption of a Multithreaded Architecture
The power-energy characteristics of multi threaded
execution can be examined. It examines performance (IPC), energy efficiency
(energy per useful instruction executed,E/UI), and power (the average power
utilized during each simulation run) for both single-thread and multithreadexecution.All results (including IPC) are normalized to a baseline (in each case,
8/7/2019 Manu (Hyper Threading)
39/53
the lowest single-thread value). This is done for two reasons first, the power
model is not intended to be an indicator of absolute power or energy, so this
allows us to focus on the relative values; second, it allows us to view the
diverse metrics on a single graph. Later results are normalized to other baselinevalues. The single-thread results (Figure 5) show diversity in almost all metrics.
Performance and power appear to be positively correlated; the more of the
processor used, the more power used. However, performance and energy per
useful instruction are somewhat negatively correlated; the fewer unused
resources and the fewer wastefully used resources, the more efficient the
processor.Also included in the graph (the Ideal E/UI bar) is the energy
efficiency of each application with perfect branch prediction. This gives an
indication of the energy lost to incorrect speculation (when compared with the
E/UI result).Gcc and goes suffer most from low branch prediction
accuracy.Gcc, for example, only commits 39% of fetched instructions and 56%
of executed instructions. In cases such as gcc and go, the energy effects of
mispredicted branches is quite large (35% - 40%).Figure 6 shows that
multithreaded execution has significantly improved energy efficiency over
conventional processor execution. Multithreading attacks the two primary
sources of wasted energy/power unutilized resources,and most importantly,
wasted resources due to incorrect speculation. Multithreaded processors rely far
less on speculation to achieve high throughput than conventional processors .
8/7/2019 Manu (Hyper Threading)
40/53
figure-5.the relative performance,energy efficiency and
power
results for the six bench marks when done.
figure-6.the relative performance ,energy efficiency and powerresults as the number of theads varies.
With multithreading, then, we achieve as much as a 90% increase in
performance (the 4-thread result), while actually decreasing the energydissipation per useful instruction of a given workload. This is achieved via a
relatively modest increase in power. The increase in power occurs because of
the overall increase in utilization and throughput of the processor that is,
because the processor completes the same work in a shorter period of time.
Figure 3 shows the effect of the reduced misspeculation on the individual
components of the power model. The greatest reductions come in the front of
the pipeline (e.g. fetch), which is always the slowest to recover from a branch
misprediction. The reduction in energy is achieved despite an increase in L2
cache utilization and power. Multithreading,particularly a multiprogrammed
workload, reduces the locality of accesses to the cache, resulting in more cache
misses; however, this has a bigger impact on L2 power than L1. The L1 caches
see more cache fills, while the L2 sees more accesses per instruction. The
increase in L1 cache fills is also mitigated by the same overall efficiency gains,
as it sees fewer speculative accesses.
8/7/2019 Manu (Hyper Threading)
41/53
figure 7.contributors to overall energy efficiency of differentthreads
The power advantage of simultaneous
multithreading can provide a significant gain in both the mobile and high
performance domains. The E/UI values demonstrate that a multithreaded
processor operates more efficiently, which is desirable in a mobile computing
environment. For a high performance processor, the constraint is more likely to
be average power or peak power. SMT can make better use of a given average
8/7/2019 Manu (Hyper Threading)
42/53
power budget, but it really shines in being able to take greater advantage of a
peak power constraint. That is, an SMT processor will achieve sustained
performance that is much closer to the peak performance (power) the processor
was designed for than a single threaded processor.
Power Optimizations
In this section, we examine power optimizations that will either reduce the
overall energy usage of a multithreading processor or will reduce the power
consumption of the processor. We examine the following possible power
optimizations, each proving practical in some way by multithreading: reduced
execution bandwidth, dynamic power consumption controls, and thread fetch
optimizations. Reduced execution bandwidth exploits the higher execution
efficiency of multithreading to achieve higher performance while maintain
moderate power consumption. Dynamic power consumption controls allow a
high performance processor to gracefully degrade activity when power
consumption thresholds are exceeded. The thread fetch optimization looks to
achieve better processor resource utilization via thread selection algorithms.
Reduced Execution Bandwidth
Prior studies have shown that an SMT processor has the potential to achieve
double the instruction throughput of a single-threaded processor with similar
execution resources . This implies that a given performance goal can be met
with a less aggressive architecture, possibly consuming less power. Therefore, a
multi-threaded processor with a smaller execution bandwidth may achievecomparable performance to a more aggressive single-threaded processor while
consuming less power. In this section, for example, we compare the power
consumption and energy efficiency of an 8-issue single-threaded processor with
a 4-issue multithreaded processor.
Controlling Power Consumption via Feedback
8/7/2019 Manu (Hyper Threading)
43/53
We strive to reduce the peak power dissipation
while still maximizing performance, a goal that is likely to face future high-
performance architects. In this section, the processor is given feedback
regarding power dissipation in order to limit its activity.Such a feedbackmechanism does indeed provide reduced average power, but the real attraction
of this technique is the ability to reduce peak power to an arbitrary level. A
feedback mechanism that guaranteed that the actual power consumption did not
exceed a threshold could make peak and achieved power consumption
arbitrarily close.Thus, a processor could still be 8-issue (because that width is
useful some fraction of the time), but might still be designed knowing that the
peak power corresponded to 5 IPC sustained, as long as there is some
mechanism in the processor that can guarantee sustained power does not exceed
that level. This section models a mechanism, applied to an SMT processor,
which utilizes the power consumption of the processor as the feedback input.
For this optimization, we would like to set an average power consumption
target and have the processor stay within a reasonable tolerance of the target
while achieving higher performance than the single threaded processor could
achieve. Our target power, in this case, is exactly the average power of all the
benchmarks running on the single-threaded processor. This is a somewhat
arbitrary threshold, but makes it easier to interpret the results. The feedback is
an estimated value for the current power the processor is utilizing. An
implementation of this mechanism could use either on-die sensors or a power
value which is computed from some performance counters. The advantage of
using performance counters is that the lag time for the value is much less thanthose obtained from a physical temperature sensor. For the experiments listed
here, the power value used is the number obtained from the simulator power
model. We used a power sampling interval of 5 cycles (this is basically the
delay for when the feedback information can be used by the processor. In each
of the techniques we define thresholds whereby if the power consumption
exceeds a given threshold, fetching or branch prediction activity will be
modified to curb power consumption. The threshold values are fixed throughout
8/7/2019 Manu (Hyper Threading)
44/53
a given simulation. In all cases, fetching is halted for all threads when power
consumption reaches 110% of the desired power value.
figure-8 the average over all bench marks of performance andpower of 2 and 4 threads with feedback
Power feedback control is a technique which enables the designer to lower the
peak power constraints on the processor.This technique is particularly effective
in a multithreading environment for two reasons. First, even with the drastic
step of eliminating speculation, even for fetch, an SMT processor can still make
much better progress than a singlethreaded processor. Second, we have more
dimensions on which to scale back executionthe results were best when
we took advantage of the opportunity to scale back threads incrementally.
Thread Selection
8/7/2019 Manu (Hyper Threading)
45/53
This optimization examines the effect of thread selection algorithms on power
and performance. The ability to select from among multiple threads provides
the opportunity to optimize for power consumption and operating efficiency
when making thread fetch decisions.The optimization we model involves twothreads that are selected each cycle to attempt to fetch instructions from the I-
cache. The heuristic used to select which threads fetch can have a large impact
on performance [14]. This mechanism can also impact power if we use it to bias
against the most speculative threads. This heuristic does not, in fact, favor low-
power threads over high-power threads, because that only delays the running of
the high-power threads. Rather, it favors less speculative threads over more
speculative threads. This works because the threads that get stalled become less
speculative over time (as branches are resolved in the processor), and quickly
become good candidates for fetch. We modify the ICOUNT thread selection
scheme, from [14], by adding a branch confidence metric to thread fetch
priority. This could be viewed as a derivative of pipeline gating [11] applied to
a multithreaded processor;however, we use it to change the fetch decision, not
to stop fetching altogether. The low-confscheme biases heavily against threads
with more unresolved low-confidence branches, while all branches biases
against the threads with more branches overall in the processor (regardless of
confidence). Both use ICOUNT as the secondary metric.
8/7/2019 Manu (Hyper Threading)
46/53
figure-9. the performance, power and energy effects of usingbranch status to direct thread selection
Figure 9 shows that this mechanism has the potential to improve both raw
performance and energy efficiency at the same time, particularly with theconfidence predictor. For the 4 thread simulations, performance increased by
6.3% and 4.9% for the low-conf and all branches schemes, respectively.The
efficiency gains should be enough to outweigh the additional power required for
the confidence counters (not modeled), assuming the architecture did not
already need them for other purposes. If not, the technique without the
confidence counters was equally effective at improving power efficiency,
lagging only slightly in performance.This figure shows an improvement even
with two threads, because when two threads are chosen for fetch in a cycle, one
is still chosen to have higher priority and possibly consume more of the
available fetch bandwidth .This figure shows an improvement even with two
threads, because when two threads are chosen for fetch in a cycle, one is still
chosen to have higher priority and possibly consume more of the available fetch
bandwidth .
8/7/2019 Manu (Hyper Threading)
47/53
8/7/2019 Manu (Hyper Threading)
48/53
The operating system essentially manages the Hyper-Threaded processor as if it
was running in a true MP arrangement, with some Minor modifications. The
Hyper-Threaded processor can be toggled by the OS between MultiTask (MT)
mode or two Single Task (ST) modes, ST0, and ST1.
In cases where a logical processor is idle, the operating system (must
be Ring 0 privilege level) may issue a HLT (Halt) instruction which enables
either ST0 or ST1 mode, and the partitioned processor resources can be
recombined and applied to the single active thread. This is in line with Intels
comments where a single-threaded application should run as fast on a Hyper-
Threaded processor as on a non-Hyper-Threaded processor. An interrupt sent tothe Halted thread can wake it up and move the processor back to MT mode.
In cases where more than one physical Hyper-Threaded
processor is installed, the OS should schedule threads across the physical
processors before scheduling two threads on a single Processor. In a dual-
processor system the BIOS might aid in this procedure by assigning processor
mappings as follows:
Processor 1 = Physical processor1, logical processor1
Processor 2 = Physical processor2, logical processor1
Processor 3 = Physical processor1, logical processor2
Processor 4 = Physical processor2,logical processor2
Looking at this functionality in MP-capable Windows Oss unfortunately
Windows 2000 has some limitations in the way it counts physical and logical
processors. In a four HT CPU-licensed Windows 2000 server, only the first
logical processor of each of the four physical processor is usable. The second
logical processor on each physical processor goes unused. In contrast, under
Windows XP Pro or .NET server using a dual-HT CPU configuration, all four
logical processor will be recognized and used. A four-HT CPU licensed .NET
server can use 8 logical processors. This is the key reason why Intelrecommends XP in the Windows arena..
8/7/2019 Manu (Hyper Threading)
49/53
Hyper Threading implementation in Windows XP
Hyper threading( HT) is a feature of the CPU itself and not the motherboard,
chipset, or other component, that allows the operating system to see more than
one logical CPU.Its called logical because there is only one CPUit just
appears as two. HT does this by providing a special set of protocols to the OS
which can be executed/queried at bootup. The OS determines if HT is available
then installs itself accordingly. There is a distinct difference between HT and
SMP (Symmetric Multi-Processingmultiple CPUs on a main board). SMP is
chipset-dependent and requires a coordinated effort by many factors off the
CPU itself (mostly for cache coherency). This is one of the main reasons
Microsoft has been complaining lately about implementing HT in its
Windows NT-based operating systems (NT, 2K, and XP). These versions of
Windows use something called a HAL (Hardware Abstraction Layer) which
logically isolates everything a PC can do from the PC itself. HAL is specifically
designed to accommodate the cumulative capabilities derived from any and allPCs Microsoft supports (if any PC out there can do it then theres a part of the
HAL which accommodates its range of features). The problem with T is that it
doesnt fit into their existing HAL model. Since HT is a chip component in and
of itself the basic CPU aspects of the HAL have to be extended or modified to
accommodate it. This is where the problem lies. It is ironic, actually, as just
about the time Microsoft releases their greatest operating system (Windows XP)
the very HAL its designed on will have to be modified extensively to
accommodate HT.
HT PLATFORM REQUIREMENTS
8/7/2019 Manu (Hyper Threading)
50/53
Intel specifies a P4 3.06GHz or higher processor isrequired to support Hyper-Threading on the desktop. Even though older P4
desktop processors had HT technology built-in, it was permanently disabled at
the fab. All 533 MHz FSB-capable Intel chipsets will support Hyper-Threadingexcept the A1 (initial) stepping of the 845G. The newer chipsets implement
necessary APIC (Advanced Programmable Interrupt Controller) updates for
Hyper-Threaded processors per Intel (to support two logical processors within a
physical processor we suspect, as each logical processor has its own Local
APIC).
Other Hyper-Threading requirements:
Motherboards must meet P4 3.06 GHz power and
thermal specifications, and have an updated BIOS and drivers. As of this
writing, Intel only recommends Windows XP (both Pro and Home editions) and
Linux 2.4.x for use in Hyper-Threading-enabled PCs.
CONCLUSION
8/7/2019 Manu (Hyper Threading)
51/53
Intels Hyper-Threading Technology brings the concept of
simultaneous multi-threading to the Intel Architecture. This is a
significant new technology direction for Intels processors. It will
become increasingly important for all future applications as it adds a newtechnique for obtaining additional performance for lower transistor and
power costs.The first implementation of Hyper-Threading Technology was
done on the Intel Xeon processor MP. In this implementation there
were two logical processors on each physical processor. The logical
processors had their own independent architecture state, but they shared
nearly all the physical execution and hardware resources of the processor.
The goal was to implement the technology at minimum cost while
ensuring forward progress on logical processors, even if the other is stalled,
and to deliver full performance even when there is only one active logical
processor. These goals were achieved through efficient logical processor
selection algorithms and the creative partitioning and recombining
algorithms of many key resources. Measured performance on the Intel Xeon
processor MP with Hyper-Threading Technology shows performance gains
of up to 30% on common server application benchmarks for this
technology.
The startling growth of the Internet and telecommunications
demands increasingly higher levels of processor performance. But this
enhancement cannot be at the expense of power and cost. Therefore previous
methods to attain higher processor performance cant be afforded.In this context
Intels Hyper threading becomes very important as it achieves maximumprocessor performance using minimum hardware complexity and thereby much
reduced cost and acceptable power consumption.
REFERENCES
8/7/2019 Manu (Hyper Threading)
52/53
1.T. N. Vijaykumar and Gurindar S. Sohi. Task selection for a multiscalar
Processor. In Proceedings of the 31st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO 31), December 1998.
2.S. Wallace, B. Calder, and D. M. Tullsen. Threaded multiple path execution.
In Proceedings of the 25th Annual International Symposium on Computer
Architecture, June 1998.
3. M. Gulati and N. Bagherzadeh, Performance Study of a
Multithreaded Superscalar Microprocessor, Proc. Intl Symp.High-
Performance Computer Architecture, ACM, 1996.
4. Yamamoto and M. Nemirovsky, Increasing Superscalar Performance
through Multistreaming, Proc. Intl Conf.Parallel Architectures and
Compilation Techniques, IFIP.
5.H. Hirata et al., An Elementary Processor Architecture with
Simultaneous Instruction Issuing from Multiple Threads,Proc. Intl
Symp. Computer Architecture, Assoc. of Computing Machinery, N.Y.
1992.
6.A. Agarwal, B.H. Lim, D. Kranz and J. Kubiatowicz, APRIL:A
processor Architecture for Multiprocessing, in Proceedings of
the 17th Annual International Symposium on Computer
Architectures, May 1990.
7.R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A.Porter,
and B. Smith, The TERA Computer System, in International
Conference on Supercomputing, June 1990.
8/7/2019 Manu (Hyper Threading)
53/53
8.L. A. Barroso et. al., Piranha: A Scalable Architecture Based
on Single-Chip Multiprocessing, in Proceedings of the 27th
Annual , June 2000.
9.M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y.Gurevich,
and W. Lee, The M-Machine Multicomputer,in 28th Annual
International Symposium on Microarchitecture, Nov. 1995.
10.L. Hammond, B. Nayfeh, and K. Olukotun, A Single-Chip
Multiprocessor, Computer, September 1997.
11.D. J. C. Johnson, HP's Mako Processor, Microprocessor
Forum, October 2001.