31
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Embed Size (px)

Citation preview

Page 1: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Compressed Instruction Cache

Prepared By:

Nicholas Meloche,David Lautenschlager,

andPrashanth Janardanan

Team Lugnuts

young
Define the problem... Embedded systems ---we need a better motivation
Page 2: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Introduction

We want to prove that a processor’s instruction code can be compressed after compilation, and decompressed real time during a processor’s fetch cycle.

The encode/decode is performed by a software encoder and a hardware decoder.

Page 3: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Introduction

Software Hardware

Memory

Processor

The encoder processes the machine code and compresses it. It also inserts a small set of instructions to tell the decoder how to decode.

At run time, the decoder decompresses the machine code and the processor receives the original instructions.

Executable

AssemblerCompiler

Cache

Page 4: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Motivation

Previous work has focused on either encoding instructions1, decoding instructions2, or both - but without implementation3.

1 Reference: Cool Code for Hot Risc - Hampton and Zhang2 Reference: Instruction Cache Compression for Embedded Systems – Jin and Chen3 Reference: A Compression/Decompression Scheme for Embedded Systems – Nikolova, Chouliaras,

and Nunez-Yanez

Page 5: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

CACHEFULL!

Loading Instructions Into

Cache

Motivation

Instruction CacheProgram Instructions

Let’s remember this amount: the

amount not stored in cache.

Fit more memory into cache at a time to decrease the likelihood of memory misses during the fetch cycle.

FETCH!

Page 6: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Motivation

Instruction CacheProgram Instructions Now Try With

Encoded Files

Fit more memory into cache at a time to decrease the likelihood of memory misses during the fetch cycle.

Page 7: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

CACHEFULL!

Loading Instructions Into

Cache

Motivation Fit more memory into cache at a time to

decrease the likelihood of memory misses during the fetch cycle.

Instruction CacheProgram Instructions Now Try With

Encoded Files

Page 8: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Motivation Fit more memory into cache at a time to

decrease the likelihood of memory misses during the fetch cycle.

Instruction CacheProgram Instructions

More Instructions were Encoded this

time!

Page 9: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Motivation

More code fits in cache = less cache misses.

Less cache misses = faster average fetch time.

This is useful for time critical systems such as real time embedded systems.

Page 10: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Hardware Design Decisions We used a VHDL model of the LEON2 processor

provided under the GNU License.

The decoder was implemented in VHDL to easily integrate it with the LEON2 processor.

Page 11: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Decoder Implementation

The Decoder has three modesNo_Decode – Each 32-bit fetch from memory

is passed to the Instruction Fetch logic unchanged.

Algorithm_Load – The header block on code in memory is processed to load the decode algorithm for the following code.

Decode – Memory is decoded and reconstructed 32-bit instructions are passed to the Instruction fetch logic.

Page 12: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Decoder Implementation

A variable shifter provides the required realignment

Two lookup and shift operations are performed for each clock cycle to produce one 32 bit result per cycle

The Decoder contains input buffering to ensure one instruction output per clock cycle unless there are sustained uncompressible instructions in the input.

Page 13: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

CAM sample path

Register

Mux Mux

16 bits 16 bits

128 x 20 RAM

TCAM

PC Increment Logic

Shift Logic

Shift Logic

Shift 16

Logic

Shift 16

Logic

128 x 20 RAM

TCAM

PC Increment Out

Decoded Instruction

Data in

Page 14: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Decoder Implementation

The core of the Decoder is a CAM (Content Addressable Memory)8 bits of the incoming code is used to address

the CAM

Page 15: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

CAM sample path

Register

Mux Mux

16 bits 16 bits

128 x 20 RAM

TCAM

PC Increment Logic

Shift Logic

Shift Logic

Shift 16

Logic

Shift 16

Logic

128 x 20 RAM

TCAM

PC Increment Out

Decoded Instruction

Data in

Page 16: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Decoder Implementation

The core of the Decoder is a CAM (Content Addressable Memory)8 bits of the incoming code is used to address

the CAMThe CAM returns a corresponding 16 bit

decode

Page 17: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

CAM sample path

Register

Mux Mux

16 bits 16 bits

128 x 20 RAM

TCAM

PC Increment Logic

Shift Logic

Shift Logic

Shift 16

Logic

Shift 16

Logic

128 x 20 RAM

TCAM

PC Increment Out

Decoded Instruction

Data in

Page 18: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Decoder Implementation

The core of the Decoder is a CAM (Content Addressable Memory)8 bits of the incoming code is used to address

the CAMThe CAM returns a corresponding 16 bit

decodeThe CAM also returns the required shift to left-

align the next encoded instruction

Page 19: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

CAM sample path

Register

Mux Mux

16 bits 16 bits

128 x 20 RAM

TCAM

PC Increment Logic

Shift Logic

Shift Logic

Shift 16

Logic

Shift 16

Logic

128 x 20 RAM

TCAM

PC Increment Out

Decoded Instruction

Data in

Page 20: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Encoding Scheme

The computer is no better than its program.

~ Elting Elmore Morison

Page 21: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Encoder Implementation

The encoder was created in C++. It chooses an encoding scheme based on an

analysis of the file content. The input file is a set of instructions for the

LEON2 processor, and the output is the set of encoded instructions for the decoder to decode.

The encoder adds a set of instructions to the beginning of each output file. This communicates the decoding algorithm.

Page 22: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

C

B

A

Encoding Algorithm

We experimented with using a Huffman Tree to encode the files.

A

B

But with a Huffman Tree, the encoding can become 2N bits deep (where N is the number of bits encoded)…. A lot!

C

Page 23: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Encoding Algorithm

We experimented with using a Huffman Tree to encode the files.

A

B

But with a Huffman Tree, the encoding can become 2N bits deep (where N is the number of bits encoded)…. A lot!

C

Page 24: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Encoding Algorithm

We experimented with using a Huffman Tree to encode the files.

A

B

C

Instead we cut the tree off short and lump everything below the point into an “uncompressed” case

Uncompressed Case

Since A, B, and C are still common, and

encoded in a short number of bits, we

still get savings!

Page 25: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Encoding Implementation

Empirical evidence suggested we encode 16 bits at a time.

We chop off our Huffman tree at a tree depth of 8 (8 bits final encoding).

Uncompressed code is 8 encode bits + the original 16 bits for a total of 24 bits. We make up for this with other compression.

Page 26: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Encoding Implementation

3 pass encoding. First pass – Analyze instructions in 16 bit

chunks and record locations of branch instructions and targets.

Page 27: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Encoding Implementation

Second pass – Encode the instructions. Place the target addresses at the

beginning of a new instruction word. Leave Jump algorithms un-encoded.

Analyze where new target instructions will be located.

Third Pass – Write the encoding to an output file.

Page 28: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Compression Analysis We used test instruction sets that came with the

VHDL LEON2 processor GNU licensing.

Savings Gained on the LEON2

0.0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% 14.0%

fram.dat

mmram.dat

mram.dat

ram.dat

rom.dat

romsd.dat

romsdm.dat

En

cod

ed F

ile

Percent Saved

Series1

Page 29: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Results

We are seeing 5% to 12% savings in instructions size.

More compression could be realized if the algorithm descriptions are compressed

Savings Gained on the LEON2

0.0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% 14.0%

fram.dat

mmram.dat

mram.dat

ram.dat

rom.dat

romsd.dat

romsdm.dat

En

cod

ed F

ile

Percent Saved

Series1

~ 5%-12%

Page 30: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Conclusions

There is an obtainable gain by pursuing compression this way.

Hardware implementation is unobtrusive. A compiler could include the encoder after

link time easily. Savings is positive.

Page 31: Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Questions?

Team Lugnuts

young
Define the problem... Embedded systems ---we need a better motivation