11
The VLIW and EPIC processor architectures Nicholas FitzRoy-Dale Document Revision: 1 Date: 2005/07/1 22:31:24 [email protected] http://www.cse.unsw.edu.au/ disy/ Operating Systems and Distributed Systems Group School of Computer Science and Engineering The University of New South Wales UNSW Sydney 2052, Australia

The VLIW and EPIC processor architecturescs9244/06/seminars/02-nfd.pdf · processor architecture to the programmer in the interests of backwards compatibility — and do a lot of

  • Upload
    others

  • View
    53

  • Download
    0

Embed Size (px)

Citation preview

The VLIW and EPIC processorarchitectures

Nicholas FitzRoy-Dale

Document Revision: 1 Date: 2005/07/1 22:31:24

[email protected]://www.cse.unsw.edu.au/∼disy/

Operating Systems and Distributed Systems GroupSchool of Computer Science and Engineering

The University of New South WalesUNSW Sydney 2052, Australia

1 VLIW and EPIC architectures

Modern superscalar processors are complex, power-hungry devices that present an antiquated view ofprocessor architecture to the programmer in the interests of backwards compatibility — and do a lotof work to achieve high performance while maintaining this illusion. The alternative to superscalar is aVLIW architecture, but these have traditionally been actively backwards-incompatible, with performancehighly dependent on the (frequently mediocre) abilities ofthe compiler.

Neither VLIW nor superscalar are perfect architectures: each has its own set of trade-offs. Thisreport discusses the relative strengths and weaknesses of the two, focusing on the benefits of VLIW andthe closely-related EPIC architecture as used in Intel’s Itanium processor family.

An introduction to the motivation behind VLIW is given, VLIWand EPIC are discussed in detail,and then two case studies are presented: the Analog Devices SHARC family of DSPs, demonstrating theVLIW influences present in a modern DSP; and Intel’s Itanium processor family, which is to date theonly implementation of EPIC.

2 Instruction-level parallelism

A common design goal for general-purpose processors is to maximisethroughput, which may be definedbroadly as the amount of work performed in a given time.

Average processor throughput is a function of two variables: the average number of clock cyclesrequired to execute an instruction, and the frequency of clock cycles. To increase throughput, then,a designer could increase the clock rate of the architecture, or increase the averageinstruction-levelparallelism(ILP) of the architecture.

Modern processor design has focused on executing more instructions in a given number of clockcycles, that is, increasing ILP. A number of techniques may be used. One technique,pipelining, isparticularly popular because it is relatively simple, and can be used in conjunction with superscalar andVLIW techniques. All modern CPU architectures are pipelined.

2.1 Pipelining

All instructions are executed in multiple stages. For example, a simple processor may have five stages:first the instruction must be fetched from cache, then it mustbe decoded, the instruction must be executed,and any memory referenced by the instruction must be loaded or stored (Figure1). Finally the result ofthe instruction is stored in registers. The output from one stage serves as the input to the next stage,forming a pipeline of instruction implementation. These stages are frequently independent of each other,so, if separate hardware is used to perform each stage, multiple instructions may be “in flight” at once,with each instruction at a different stage in the pipeline. Ignoring potential problems, the theoreticalincrease in speed is proportional to the length of the pipeline: longer pipelines means more simultaneousin-flight instructions and therefore fewer average cycles per instruction.

The major potential problem with pipelining is the potential for hazards. A hazard occurs when aninstruction in the pipeline cannot be executed. Hennessey and Patterson identify three types of hazards:structural hazards, where there simply isn’t sufficient hardware to execute allparallelisable instructionsat once;data hazards, where an instruction depends on the result of a previous instruction; andcontrolhazards, which arise from instructions which change the program counter (ie, branch instructions). Var-ious techniques exist for managing hazards. The simplest ofthese is simply tostall the pipeline until theinstruction causing the hazard has completed.

2.1.1 Designing for pipelining

Despite pipelining being an almost-universal practise, some architectures are more amenable to pipelin-ing than others. Architectures that are designed for pipelining tend to require all instructions to take

2 2 INSTRUCTION-LEVEL PARALLELISM

Figure 1: The 5-stage MIPS pipeline

roughly the same amount of time to complete, so that the pipeline can operate at the same speed regard-less of what is being executed. In order to keep this speed high, instructions tend to be simple. A smallset of simple, quick-to-execute instructions is the hallmark of RISC (reduced instruction set computing)architectures. It is no surprise, then, that RISCs tend to bewell-suited to pipelining.

Nowadays, the RISC/non-RISC distinction is somewhat outdated. The key metric is performance,and an architecture can be fast without being RISC. Similarly, it is possible to apply tradtional RISC tech-niques such as pipelining to non-RISC instruction sets suchas Intel’s x86 instruction set. This instructionset is not well-suited to pipelining, for two reasons. Firstly, it contains a number of instructions that takea long time to execute. For example, all x86 compatibles support instructions to work with binary-codeddecimal (BCD) numbers. Secondly, x86 supports so-calledcomplex addressing modes. For example, itis possible to move data from one location in main memory to another location in main memory, withoutrequiring the data to go through a CPU register as an intermediate stage. Such instructions are difficultto pipeline, because memory accesses are slow, and data should ideally be present before it is required.

To deal with these sorts of obstacles to pipelining, the pipelined processor translates CISC instruc-tions into an internal instruction set before pipelining the instructions. This translation phase adds to thecomplexity of the architecture.

2.2 Superscalar

Usually, the execution phase of the pipeline takes the longest. On modern hardware, the executionof the instruction may be performed by one of a number of functional units. For example, integerinstructions may be executed by the ALU, whereas floating-point operations are performed by the FPU.On a traditional, scalar pipelined architecture, either one or the other of these units will always be idle,depending on the instruction being executed. On asuperscalararchitecture, instructions may be executedin parallel on multiple functional units. The pipeline is essentially split after instruction issue.

Executing multiple instructions simultaneously brings several problems. The first problem is that thepossibility of hazards is increased, because more instructions are in-flight at once. Secondly, instructionsmust must beretired (executed, and their results written back) in the correct order, to correctly follow thesemantics of a scalar machine. On a superscalar machine, where the instruction stream may be re-orderedon-the-fly by the hardware (so-calleddynamic scheduling), the processor must still retire instructions in

2.3 The branch problem 3

� � � � � � � � � � � � � �

� � � � � � � � � � � � �� � � � � � � �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � �

� � � � � � � � � � �

� � � � �

� � � � � � � � � � �

Figure 2: Superscalar architecture

program order.

2.3 The branch problem

Typical program code contains a branch every six or seven instructions. For a pipelined superscalararchitecture, this represents a potentially significant performance problem. It would be helpful if theprocessor knew whether a branch would be taken ahead of time,so that the correct instructions could befetched and pipelined. Obviously this is not possible, but it is possible to predict the target of a branchwith a probability of more than chance by storing the previous results of the branch and using it to predictthe target of future branches.

Suchbranch predictionhardware helps to reduce the cost, in terms of instructions unnecessarilypipelined, of branches. However, the hardware must now dealwith the consequences of incorrectly-predicted branches when “wrong” instructions werespeculatively executed. If these speculated instruc-tions were permitted to modify hardware registers or memory, the program would be in an inconsistentstate. The simple solution, employed by superscalar Pentiums, is simply not to allow speculatively-executed instructions to modify real state until the branchtarget is known. However, for better perfor-mance it is necessary to implement areorder buffer, to store the results of instructions executed specula-tively (and out-of-order). Information stored in the reorder buffer is only used to update real state whentheir instructions are known to be correct.

Figure2 shows a typical superscalar architecture. Superscalar’s complexity is evident in the instruc-tion decoder and the reorder buffer.

4 3 VLIW

� � � � � � � � � � � � � �

� � � � � � � � � � �� � � � � � � �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � � � � �

� � � � �

Figure 3: VLIW architecture

3 VLIW

All this additional hardware is complex, and contributes tothe transistor count of the processor. All otherthings being equal, more transistors equals more power consumption, more heat, and less on-die spacefor cache.

Thus it seems beneficial to expose more of the architecture’sparallelism to the programmer. Thisway, not only is the architecture simplified, but programmers have more control over the hardware, andcan take better advantage of it.

VLIW is an architecture designed to help software designersextract more parallelism from theirsoftware than would be possible using a traditional RISC design. It is an alternative to better-knownsuperscalar architectures. VLIW is a lot simpler than superscalar designs, but has not so far been com-mercially successful.

Figure3 shows a typical VLIW architecture. Note the simplified instruction decode and dispatchlogic, and the lack of a reorder buffer.

3.1 ILP in VLIW

VLIW and superscalar approach the ILP problem differently.The key difference between the two iswhere instruction scheduling is performed: in a superscalar architecture, scheduling is performed inhardware (and is calleddynamic scheduling, because the schedule of a given piece of code may differdepending on the code path followed), whereas in a VLIW scheduling is performed in software (staticscheduling, because the schedule is “built in to the binary” by the compiler or assembly language pro-grammer).

A traditional VLIW architecture divides the instruction word into separate regions, each of whichcorresponds to a dedicated processor unit. For example, a VLIW architecture may contain two integer

3.2 Interlocking 5

Figure 4: Instruction word for the Multiflow Trace 7 series

ALUs, two floating-point units, a memory access unit, and a branch unit. The instruction word may thenbe divided into an integer portion, a floating-point portion, a memory load/store portion, and a branchportion. Each portion of the instruction word constitutes a“mini-instruction” for the processing unit towhich it refers. All mini-instructions areimplicitly parallel, which gives the processor greater flexibilityin scheduling the instructions among available execution units. Figure4 shows the 256-bit instructionword of an early VLIW, the MultiFlow Trace 7 series. This machine supported seven operations perinstruction word.

Making the architecture this explicit provides several advantages in terms of performance and re-duced die size. The job of arranging code so that the processor is best utilised is left, to a large extent, tothe compiler. Thus VLIW architectures can execute code strictly in order, without requiring schedulinghardware or redorder buffers. This makes for – theoretically at least – a simpler, less power-hungry chip.The downside is that writing a good compiler for a VLIW is muchmore difficult than for a superscalararchitecture, and the difference between a good compiler and a bad one is far more noticeable.

Another problem with traditional VLIW is code size. Often itis simply not possible to completelyutilise all processor execution units at once. Thus many instructions contain no-ops in portions of theinstruction word with a corresponding increase in the size of the object code. Increased code size has ob-vious implications for the efficacy of caches and bus bandwidth. Modern VLIWs deal with this problemin different ways. One simple method is to offer several instruction templates, and allow the compiler topick the most appropriate one – in this case, the one that utilises the most bits of the instruction word.Another is to employ traditional data-compression techniques to the code.

Because it exposes more information about the processor’s architecture to the programmer than asuperscalar does, VLIW instruction sets are architecture-specific to a significant degree. In a superscalarimplementation (presenting the illusion of a scalar architecture to the programmar), hardware designersare free to add, for example, additional ALUs, increasing parallelism without affecting existing programs.In a VLIW the obvious solution is, however, to increase the length of instruction word. This obvioussolution has similarly-obvious compatibility problems. Another alternative is to add more instructionsto the instruction word, making use of the extra processing unit without causing compatibility problems.This is the approach taken by the Analog Devices SHARC family. Yet another alternative is to remove theimplicit parallelism implied by the end of the instruction word, and instead make the limits of parallelismexplicit. This is the basis of the EPIC architecture, present in the Intel Itanium.

3.2 Interlocking

Another architecture feature present in some RISC and VLIW architectures but never in superscalars islack of interlocks. In a pipelined processor, it is important to ensure that a stall somewhere in the pipelinewon’t result in the machine performing incorrectly. This could happen if later stages of the pipelinedo not detect the stall, and thus proceed as if the stalled stage had completed. To prevent this, mostarchitectures incorporateinterlockson the pipeline stages. Removing interlocks from the architecture isbeneficial, because they complicate the design and can take time to set up, lowering the overall clockrate. However, doing so means that the compiler (or assembly-language programmer) must know details

6 3 VLIW

about the timing of pipeline stages for each instruction in the processor, and insert NOPs into the codeto ensure correctness. This makes code incredibly hardware-specific. Both the architectures studied indetail below are fully interlocked, though Sun’s ill-fatedMAJC architecture was not, and relied on fast,universal JIT compilation to solve the hardware problems1.

3.3 Code generation

The realisation behind VLIW is that the compiler (or, occassionally, the assembly-language programmer)has more opportunity than the processor to exploit softwareparallelism, because it has better knowledgeof the code. Lack of global knowledge means that scheduling performed by the processor (calleddynamicscheduling) must be conservative to ensure safety. Thus compilers are in a position to perform more andbetter optimisations than the hardware could. Unfortunately compiler writers have been slow to fully-exploit VLIW, resulting in a number of lackluster implementations. This section discusses a number ofoptimisations that compilers could perform for VLIW that are traditionally performed by hardware insuperscalar implementations.

3.3.1 Loop parallelism

An important compiler technique, perhaps the most important compiler technique, for exploiting ILP isloop parallelism. This simply refers to finding parallelisable loops and generating parallel code. De-termining whether a loop is parallelisable for a given system may become arbitrary complex, becausebefore it can parallelise the instructions the compiler must ensure that there are no data dependenciesbetween them. For example, consider the simple case of a loopcopying one array to another. Ensuringthat the two arrays do not overlap could involve dataflow analysis along many code paths. However thecompiler is certainly better-equipped to do this job than scheduling hardware in a superscalar processor,because more information is available, such as array bounds. Also, compilers can spend significantlymore time performing the analysis than can a processor.

3.3.2 Branch speculation supports

Branching in a VLIW can reduce throughput in a VLIW for the same reasons as throughput may bereduced in a superscalar architecture. VLIWs reduce the cost in two ways. First, instructions may bepredicated– i.e., a section of the instruction word is devoted to a conditional test, and the instructionis only executed if the condition is true. Handling a single predicated instruction is far cheaper thanhandling a branch, because the program counter does not change in unpredictable ways. On multiple-issue architectures, predication may allow, for example, mutually-exclusive branches of code to executesimultaneously (such as both theif clause and theelse clause of anif statement), with only the correctbranch actually being executed: the incorrect branch effectively becomes a sequence of no-ops. Obvi-ously there is a point where the benefit of avoiding branches is outweighed by the performance lost bydiscarding incorrect predicated instructions, but for short sequences predicates can dramatically improveperformance.

Note that the advantages of predication aren’t limited to VLIW processors. Superscalars also standto benefit. The reason is that by turning a control dependency(a branch) into a data dependency (acondition check), predication moves the decision as to whatto do with the instruction from near the startof the pipeline (in instruction decode and execution) to theend (in writeback).

In addition to predication, compilers may implementspeculation. That is, the compiler may useheuristics to determine the likely outcome of a branch. If the architecture supports it, the compiler mayspeculatively load, and even speculatively execute, instructions at the likely branch target. The problemof register renaming encountered by superscalar architectures can be solved simply by providing moreregisters, so the compiler is not as constrained and need notre-use names.

1The problems associated with removing interlocks are not unique to VLIW. The most famous non-interlocked RISC archi-tecture, MIPS, addressed the hardware-specificity problemby mandating that each instruction take exactly one clock cycle

3.4 History 7

If the hardware supports software speculation, it needs to provide mechanisms to ensure that spec-ulated instructions that raise an exception do not affect the state of the machine until they become non-speculative. VLIWs deal with these problems in various ways. The simplest method is never to raiseexceptions for speculated instructions, though other techniques exist, such as deferring the exceptions.

3.3.3 Code libraries

An alternative to smart compilers is to utilise a collectionof highly-optimised code libraries written inassembly language by platform experts. These libraries arecommon on superscalars to support special-purpose SIMD instruction sets such as AltiVec on the Power family.

3.4 History

VLIW is not a new architecture. In fact, it was originally more popular than superscalar designs (whicharose to alleviate backwards compatibility problems). Oneof the first was the AP-120B, described byCharlesworth in 1981. The mid 80s saw several attempts to introduce VLIW processors, notably theMultiflow Trace and the Cydrome Cydra 5. The compiler for the Multiflow Trace was the first processorto employ trace scheduling as invented by Multiflow’s founder, Joseph Fischer. The Cydra 5 includedhardware support for software pipelining very similar to that now present on the Itanium. Specifically,it supported aniteration frame pointerwhich could be used as an offset into its register data file. Thesimilarity is not a co-incidence: Cydrome chief architect,Bob Rau, was later involved in the developmentof Itanium.

The Trace and the Cydra 5 both had very large instruction words: 256 bits for both machines, withlater versions of the Trace supporting even larger words((The earliest Trace machines supported 7 in-structions per 256-bit word. Later models supported 14 or even 28 instructions per word). The machinesdealt with the corresponding bandwidth and storage problems in different ways: the Trace supportedinstruction compression, and the Cydra 5 supported a sequential mode, where each of the 7 opcodescontained within its instruction word were executed one at atime. Interesting, the Multiflow and theCydra 5 did not contain caches of any description.

These products failed commercially; the prevailing opinion seems to be due to their position as start-ups. Thus, several small technical mistakes combined with the difficulty of selling expensive machinesfrom a pooly-established company, rather than any major architectural deficiencies, caused their down-fall.

The importance of efficient compilers also contributed to poor early acceptance of VLIW architec-tures. For example, Intel’s i860 RICS processor, introduced in the early 1990s, had a simple VLIW modewhere each instruction consisted of a integer portion and a floating-point portion. Compilers for the i860were expected to carefully order instructions in order to keep pipelines filled, but unfortunately were notof sufficient quality to produce good code for the chip.

4 Case study: Analog Devices SHARC ADSP-2136x family

The SHARC family of DSPs are aimed at real-time audio and visual applications. Because embeddeddevices are generally not user-programmable, and run only avery limited set of applications, there isless of a requirement to maintain a legacy instruction set, acondition sometimes referred to as lowinstruction-set inertia[?]. Like many DSP manufacturers, Analog Devices has taken advantage of lowinstruction set inertia to create their own instruction setfor the SHARC, using VLIW techniques to allowprogrammers to get the most from the design while keeping thearchitecture simple.

The SHARC contains two separate ALUs, named PEx and PEy. Eachhas its own register file. Theseprocessing elements are not separately accessible but worktogether when the chip is placed in SIMDmode for increased throuput.

The focus on data throughput extends throughout the architecture. Address generation is separate tothe processing units, allowing addresses to be generated inparallel with data processing. The chip also

8 5 CASE STUDY: INTEL ITANIUM

uses asuper Harvardarchitecture, making use of multiple instruction, data, and I/O buses. Almost allregisters are general-purpose, and the type of the register(fixed-point, or floating point) is determined bybits set in the instruction (not all operations are permitted for all number formats)

4.1 Instruction format

The SHARC uses a number of instructiongroups. Each group contains a related family of instruc-tions, such as ALU-related instructions (Groups I and II) and memory access (Group III). There is somecommonality between bits allocated in the instruction wordinside groups, but allocation is not whollyregular.

Despite containing two ALUs, SHARC opcodes do not permit independent addressing of these func-tional units. Instead, they work in tandem in SIMD mode. Thisis perhaps because the ADSP-2136x is anevolutionary advance on previous processors in the SHARC family, which contained only a single ALU.In the default (SISD) mode, the secondary ALU is disabled. SIMD mode may be enabled by setting abit in the CPU status registerMODE1.

Many instructions include a 23-bit section called thecompute field. This “mini-instruction” is es-sentially the part of the instruction directing an ALU, and thus supports its own set of opcodes for allarithmetic operations. As mentioned above, in SIMD mode, one compute field controls both ALUs;the source and destination of data for PEy is defined by the architecture as offset from the source anddestination of PEx. Many instructions which include a compute field also support predicated execution.

4.2 Instruction-level parallelism in the SHARC family

Despite the lack of independently-addressible ALUs, the SHARC is capable of several examples ofparallelism, mostly concerned with the efficient transformation of large amounts of data. Generally,compute operations may be combined with predicates, data movement to / from memory, and registermanipulations such as shifts and transfers.

SHARC does not support speculation, but has a relatively short pipeline consisting of 5 stages, so thecost of a branch is low.

5 Case study: Intel Itanium

Itanium is based around the explicitly-parallel instruction computer (EPIC) architecture, a fairly recentarchitecture that emerged, circa 1997, from Hewlett-Packard’s PlayDoh research architecture. The EPICarchitecture is based on VLIW, but was designed to overcome the key limitations of VLIW (in particular,hardware dependence) while simultaneously giving more flexibility to compiler writers. So far the onlyimplementation of this architecture is as part of the IA-64 processor architecture in the Itanium family ofprocessors.

5.1 Instruction bundles

The major problem addressed by EPIC is hardware dependence.VLIW is designed around the conceptthat the limits of a processor’s parallelism is addressed bya single instruction word. Thus, processorscapable of a greater degree of parallelism require a different instruction set. EPIC’s solution to thisproblem is to define several reasonably-abstract categories of mini-instructions, such as ALU operations,floating-point operations, and branches. Mini-instructions are combined in groups of three into abundle.In addition to three 41-bit mini-instructions, bundles contain a 5-bittemplate typefor a total bundle sizeof 128 bits. Figure5 shows the general format of an EPIC bundle.

5.2 Instruction-level parallelism in EPIC 9

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � �

Figure 5: An EPIC bundle

Template 0: MIITemplate 1: MII*Template 2: MI*ITemplate 3: MI*I*

Figure 6: The first four EPIC bundle templates. Stops are indicated by asterisks.

5.2 Instruction-level parallelism in EPIC

Crucially, the length of a bundle does not define the limits ofparallelism; the bundle template typeindicates, by the presence or absence of astop, whether instructions following the bundle can executein parallel with instructions in the bundle. The claim of EPIC is that as the processor family evolves,processors with greater support for parallelism will simply issue more bundles simultaneously.

Figure6 illustrates the first four EPIC bundle templates, of the 32 available. Note that the first 4bundles all contain memory (M) and integer ALU (I) mini-instructions. higher-numbered bundles con-tain different types of mini-instructions. Template 0 contains one memory and two integer instructionsand no stops, meaning that a sequence of template-0 bundles will be executed in parallel as much asthe hardware is capable, whereas template 1 contains a stop after the second integer instruction but isotherwise identical. The hardware ensures that all instructions before the stop have been retired beforeexecuting instructions after the stop.

To put it another way, compilers simply target a theoreticalprocessor with support for an infiniteamount of parallelism (or, at least, a register-limited amount of parallelism), and the implementationperforms as much as it can. For example, all current Itaniumsissue two bundles at a time through atechnique known asdispersal. The first part of a bundle is issued, and then the bundle is logically shiftedso that the next part of the bundle is available for execution. If the mini-instruction cannot be executed,split issueoccurs. The bundle continues to occupy its bundle slot, and another bundle is loaded to occupythe next slot. Since some instructions from the bundle have been executed, leaving them in the bundleslot reduces parallelism. EPIC trades this performance decrease against the relatively simple hardwarerequired to implement dispersion.

Itanium relies heavily on predication. All Itanium instructions are predicated, reducing the cost ofa branch significantly if it can be rewritten to predicated instructions. Itanium has 64 one-bitpredicateregisters, meaning that 6 bits of every mini-instruction are devoted to specifying a predicate register.

Multiple predicted streams can execute in parallel, but only those whose predictate is true are retired.The reasoning behind predication is that even though modernbranch predictors are very efficient, thecost of misprediction is still high because branches occur so frequently.

Itanium does not perform automatic speculation in hardware. This decision is definitely in the spiritof VLIW. The cost of speculation is thus moved to program development time, where there are moreresources to deal with it. The architecuture supportsspeculative loads, where a given load instructionmay be moved further away from where it is required. The advantage of this is that it hides memorylatency. The potential problem is that the load may fail, triggering an exception.

Speculative load exceptions are handled usingpoison bits, an extra bit on every register that is set ofthe result of a speculative load triggered an exception. Thebit, named NaT (“not a thing”) for integerregisters and NaTVal (“not a thing value”) for floating-point registers, may be copied to other registersthrough store operations, but any other attempt to make use of it results in an exception.

If all else fails, and Itanium is forced to take an unexpectedbranch, the cost is reduced by its

10 6 CONCLUSIONS

relatively-short pipeline: 8 stages in Itanium 2, comparedwith 30 for later revisions of the Pentium4.

5.3 Problems with EPIC

Despite the advantages of EPIC over VLIW, IA-64 does not solve all of VLIW’s problems. The foremostproblem is program size: It is not always possible to completely fill all slots in a bundle, and empty slotsare filled with NOPs (IA-64 does not perform compression further than that offered by bundle templates).As discussed above, increases in code size negatively impact cache performance and result in more bustraffic. Itanium compensates for this by using large, fast caches on-die. Cache is relatively easy to add toItanium, because the lack of hardware dedicated to speculation, dynamic scheduling and the like resultsin a small core size. However cache increases die size and power consumption – though cache consumesfar less power than core logic.

Another problem common to VLIWs in general is the importanceof compiler optimisation. Poorcompiler support can significantly impact the performance of EPIC code. Historically this has been aproblem for Itanium, but should improve in the future as compiler support improves.

6 Conclusions

Despite its history, VLIW has yet to see significant commercial success in general-purpose computers.One reason for this is backwards-compatibility issues, which newer architectures, such as EPIC, are start-ing to address. Another potential problem facing VLIWs is the widening gap beween CPU performanceand memory bandwidth. Perhaps updates to the EPIC architecture, or some future VLIW-based archi-tecture, will see optional support for some form of instruction compression, to reduce Itanium’s relianceon caching for performance.