Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel

Submitters: Vitaly PanorTal Joffe

Instructors: Zvika GuzKoby Gottlieb

Software LaboratoryElectrical Engineering Faculty

Technion, Israel

Project GoalProject Goal

Gain knowledge on software optimization.Gain knowledge on software optimization.

Learn and implement different optimization Learn and implement different optimization techniques.techniques.

Get acquainted with different performance Get acquainted with different performance analysis tools. analysis tools.

Optimization ApproachesOptimization Approaches

Multithreading (main part).Multithreading (main part).

Implementation considerations.Implementation considerations.

Architectonic considerations.Architectonic considerations.

Chosen programChosen program

Called EOCF.Called EOCF.

Implements the “Burrows – Wheeler lossless Implements the “Burrows – Wheeler lossless compression algorithm”, by M. Burrows and D.J. compression algorithm”, by M. Burrows and D.J. Wheeler.Wheeler.

Can compress and decompress files.Can compress and decompress files.

We chose to work on the compression part.We chose to work on the compression part.

. .

Algorithm DescriptionAlgorithm Description

Compression:Compression:

Source file is read in blocks of bytes.Source file is read in blocks of bytes.

Burrows – Wheeler transform followed by Move-To-Burrows – Wheeler transform followed by Move-To-Front transform applied on each block.Front transform applied on each block.

Each processed block is written to a temp file.Each processed block is written to a temp file.

After all the blocks have been written, Huffman After all the blocks have been written, Huffman compression is applied on the temp file.compression is applied on the temp file.

. .

AlgorithmAlgorithm DescriptionDescription

Decompression:Decompression:

Performing the compression algorithm in reverse order.Performing the compression algorithm in reverse order.

EOCF – Program StructureEOCF – Program Structure

Block processing section (for each block in the file)

Read Block

Write BlockTo temporary file

BW transformation

MTF transformation

Block Processing section

Huffman compression

Input File

Output File

Temp File

Code AnalysisCode Analysis

The following two functions are The following two functions are performed about 2/3 of the runtime:performed about 2/3 of the runtime:


Call graph:Call graph:


The conclusion:The conclusion:

The code spends most of the runtime in The code spends most of the runtime in performing the transformations.performing the transformations.

Multi ThreadingMulti Threading

Based on the results, the block process Based on the results, the block process section was multi-threaded.section was multi-threaded.

Huffman compression section was not Huffman compression section was not multi-threaded.multi-threaded.

Data decomposition approach was used.Data decomposition approach was used.

Data DecompositionData DecompositionData decomposition approach in general:Data decomposition approach in general:

Func1InputOutputFunc2 FuncN

Func1Input/N

Output

Func2 FuncN

Input/N

Input/N

Func1 Func2 FuncN

Func1 Func2 FuncN

Thread 1

Thread K

Thread 2

Data DecompositionData DecompositionData decomposition approach on EOCF:Data decomposition approach on EOCF:

HuffmanInput OutputTemp

FileBW MTF

Thread 1

BW MTF

InputBuffers

InputFile Output

Buffers

HuffmanOutputFile

Temp File

Thread 2

BW MTF

Thread n

BW MTF

Thread DesignThread Design

Read a block from the input buffer.Read a block from the input buffer.

Perform the transformations.Perform the transformations.

Write to the output buffer.Write to the output buffer.

Fill input buffer our empty output buffer if Fill input buffer our empty output buffer if needed.needed.

Thread DesignThread Design

yes

no

Fill buffer from input file

Current block is the last

block?

no

finish

yes

Read next input block

Perform transformations

Current write Buffer is

full?

yes Write buffer to temp file

no

Write block to output buffer

Current Read buffer is

empty?

finish

ImplementationImplementation

WIN32 API was used rather than WIN32 API was used rather than openMP API.openMP API.

Yields better performance, Yields better performance, according to research based on according to research based on previous projects and internet previous projects and internet articles.articles.

SynchronizationSynchronization

A Critical Section objects were A Critical Section objects were used.used.

Provides a slightly faster, more Provides a slightly faster, more efficient mechanism for mutual efficient mechanism for mutual exclusion than Mutex object.exclusion than Mutex object.

Thread PerformanceThread Performance

Threads share the load almost equally and about 2/3 Threads share the load almost equally and about 2/3 of the time we spend in the parallel section, as of the time we spend in the parallel section, as expected.expected.

Thread CheckerThread Checker

Thread checker found no errors.Thread checker found no errors.**The warning is due to the fact that we have a thread The warning is due to the fact that we have a thread

that waits for the worker-threads to finish (main).that waits for the worker-threads to finish (main).

Number of ThreadsNumber of Threads

Best performance when number of Best performance when number of threads equals number of cores.threads equals number of cores.

On Dual Core:On Dual Core:

Input BuffersInput Buffers

Implementing the double buffering Implementing the double buffering technique.technique.

When a buffer is being filled, other When a buffer is being filled, other threads continue to read from the second threads continue to read from the second buffer.buffer.

Output BuffersOutput Buffers

To comply with the decompression To comply with the decompression algorithm, sequential output had to be algorithm, sequential output had to be achieved.achieved.

Based on empiric observation, we hold Based on empiric observation, we hold enough buffers, so that each thread can enough buffers, so that each thread can write at least four blocks.write at least four blocks.

The minimum number of buffers is two.The minimum number of buffers is two.

Buffer SizeBuffer Size

Based on our observations when using Based on our observations when using Dual Core processor, the optimal buffer Dual Core processor, the optimal buffer size is 16KB:size is 16KB:

Data Sharing and AlignmentData Sharing and Alignment

To eliminate False Sharing the following steps To eliminate False Sharing the following steps were taken:were taken:

Moving as much as possible of shared data Moving as much as possible of shared data to each threads private data.to each threads private data.

Aligning to a cache line size shared arrays of Aligning to a cache line size shared arrays of data, when each individual element is being data, when each individual element is being accessed by a different thread.accessed by a different thread.

Data Sharing and AlignmentData Sharing and Alignment

Runtime without cache alignment: Runtime without cache alignment:

198.2 sec198.2 sec

Runtime with cache alignment:Runtime with cache alignment:

197.4 sec197.4 sec

Overall improvement of 0.5%Overall improvement of 0.5%

SIMDSIMD

Usage of SIMD was not implemented.Usage of SIMD was not implemented.

Optimization AchievedOptimization Achieved

Using a Dual Core processor, the ideal Using a Dual Core processor, the ideal speed-up would be X2.speed-up would be X2.

Since we have multi-threaded only about 2/3 Since we have multi-threaded only about 2/3 of the code, we could expect speed-up of:of the code, we could expect speed-up of:

Optimization AchievedOptimization Achieved

We have achieved speed up of:We have achieved speed up of:

X1.47X1.47

Unavoidably, we loose time on managing Unavoidably, we loose time on managing and synchronizing threads.and synchronizing threads.

Comparison to Other Intel Comparison to Other Intel ArchitecturesArchitectures

We ran our program on 2 other computers:We ran our program on 2 other computers: IntelIntel® ® CoreCore™2 ™2 QuadQuad IntelIntel® Core™ ® Core™ i7i7

X1.47

X1.9

X2.17X1.96

Thank YouThank You

Documents

Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel