View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Submitters: Vitaly PanorTal Joffe
Instructors: Zvika GuzKoby Gottlieb
Software LaboratoryElectrical Engineering Faculty
Technion, Israel
Project GoalProject Goal
Gain knowledge on software optimization.Gain knowledge on software optimization.
Learn and implement different optimization Learn and implement different optimization techniques.techniques.
Get acquainted with different performance Get acquainted with different performance analysis tools. analysis tools.
Optimization ApproachesOptimization Approaches
Multithreading (main part).Multithreading (main part).
Implementation considerations.Implementation considerations.
Architectonic considerations.Architectonic considerations.
Chosen programChosen program
Called EOCF.Called EOCF.
Implements the “Burrows – Wheeler lossless Implements the “Burrows – Wheeler lossless compression algorithm”, by M. Burrows and D.J. compression algorithm”, by M. Burrows and D.J. Wheeler.Wheeler.
Can compress and decompress files.Can compress and decompress files.
We chose to work on the compression part.We chose to work on the compression part.
. .
Algorithm DescriptionAlgorithm Description
Compression:Compression:
Source file is read in blocks of bytes.Source file is read in blocks of bytes.
Burrows – Wheeler transform followed by Move-To-Burrows – Wheeler transform followed by Move-To-Front transform applied on each block.Front transform applied on each block.
Each processed block is written to a temp file.Each processed block is written to a temp file.
After all the blocks have been written, Huffman After all the blocks have been written, Huffman compression is applied on the temp file.compression is applied on the temp file.
. .
AlgorithmAlgorithm DescriptionDescription
Decompression:Decompression:
Performing the compression algorithm in reverse order.Performing the compression algorithm in reverse order.
EOCF – Program StructureEOCF – Program Structure
Block processing section (for each block in the file)
Read Block
Write BlockTo temporary file
BW transformation
MTF transformation
Block Processing section
Huffman compression
Input File
Output File
Temp File
Code AnalysisCode Analysis
The following two functions are The following two functions are performed about 2/3 of the runtime:performed about 2/3 of the runtime:
Code AnalysisCode Analysis
The conclusion:The conclusion:
The code spends most of the runtime in The code spends most of the runtime in performing the transformations.performing the transformations.
Multi ThreadingMulti Threading
Based on the results, the block process Based on the results, the block process section was multi-threaded.section was multi-threaded.
Huffman compression section was not Huffman compression section was not multi-threaded.multi-threaded.
Data decomposition approach was used.Data decomposition approach was used.
Data DecompositionData DecompositionData decomposition approach in general:Data decomposition approach in general:
Func1InputOutputFunc2 FuncN
Func1Input/N
Output
Func2 FuncN
Input/N
Input/N
Func1 Func2 FuncN
Func1 Func2 FuncN
Thread 1
Thread K
Thread 2
Data DecompositionData DecompositionData decomposition approach on EOCF:Data decomposition approach on EOCF:
HuffmanInput OutputTemp
FileBW MTF
Thread 1
BW MTF
InputBuffers
InputFile Output
Buffers
HuffmanOutputFile
Temp File
Thread 2
BW MTF
Thread n
BW MTF
Thread DesignThread Design
Read a block from the input buffer.Read a block from the input buffer.
Perform the transformations.Perform the transformations.
Write to the output buffer.Write to the output buffer.
Fill input buffer our empty output buffer if Fill input buffer our empty output buffer if needed.needed.
Thread DesignThread Design
yes
no
Fill buffer from input file
Current block is the last
block?
no
finish
yes
Read next input block
Perform transformations
Current write Buffer is
full?
yes Write buffer to temp file
no
Write block to output buffer
Current Read buffer is
empty?
finish
ImplementationImplementation
WIN32 API was used rather than WIN32 API was used rather than openMP API.openMP API.
Yields better performance, Yields better performance, according to research based on according to research based on previous projects and internet previous projects and internet articles.articles.
SynchronizationSynchronization
A Critical Section objects were A Critical Section objects were used.used.
Provides a slightly faster, more Provides a slightly faster, more efficient mechanism for mutual efficient mechanism for mutual exclusion than Mutex object.exclusion than Mutex object.
Thread PerformanceThread Performance
Threads share the load almost equally and about 2/3 Threads share the load almost equally and about 2/3 of the time we spend in the parallel section, as of the time we spend in the parallel section, as expected.expected.
Thread CheckerThread Checker
Thread checker found no errors.Thread checker found no errors.**The warning is due to the fact that we have a thread The warning is due to the fact that we have a thread
that waits for the worker-threads to finish (main).that waits for the worker-threads to finish (main).
Number of ThreadsNumber of Threads
Best performance when number of Best performance when number of threads equals number of cores.threads equals number of cores.
On Dual Core:On Dual Core:
Input BuffersInput Buffers
Implementing the double buffering Implementing the double buffering technique.technique.
When a buffer is being filled, other When a buffer is being filled, other threads continue to read from the second threads continue to read from the second buffer.buffer.
Output BuffersOutput Buffers
To comply with the decompression To comply with the decompression algorithm, sequential output had to be algorithm, sequential output had to be achieved.achieved.
Based on empiric observation, we hold Based on empiric observation, we hold enough buffers, so that each thread can enough buffers, so that each thread can write at least four blocks.write at least four blocks.
The minimum number of buffers is two.The minimum number of buffers is two.
Buffer SizeBuffer Size
Based on our observations when using Based on our observations when using Dual Core processor, the optimal buffer Dual Core processor, the optimal buffer size is 16KB:size is 16KB:
Data Sharing and AlignmentData Sharing and Alignment
To eliminate False Sharing the following steps To eliminate False Sharing the following steps were taken:were taken:
Moving as much as possible of shared data Moving as much as possible of shared data to each threads private data.to each threads private data.
Aligning to a cache line size shared arrays of Aligning to a cache line size shared arrays of data, when each individual element is being data, when each individual element is being accessed by a different thread.accessed by a different thread.
Data Sharing and AlignmentData Sharing and Alignment
Runtime without cache alignment: Runtime without cache alignment:
198.2 sec198.2 sec
Runtime with cache alignment:Runtime with cache alignment:
197.4 sec197.4 sec
Overall improvement of 0.5%Overall improvement of 0.5%
Optimization AchievedOptimization Achieved
Using a Dual Core processor, the ideal Using a Dual Core processor, the ideal speed-up would be X2.speed-up would be X2.
Since we have multi-threaded only about 2/3 Since we have multi-threaded only about 2/3 of the code, we could expect speed-up of:of the code, we could expect speed-up of:
Optimization AchievedOptimization Achieved
We have achieved speed up of:We have achieved speed up of:
X1.47X1.47
Unavoidably, we loose time on managing Unavoidably, we loose time on managing and synchronizing threads.and synchronizing threads.
Comparison to Other Intel Comparison to Other Intel ArchitecturesArchitectures
We ran our program on 2 other computers:We ran our program on 2 other computers: IntelIntel® ® CoreCore™2 ™2 QuadQuad IntelIntel® Core™ ® Core™ i7i7
X1.47
X1.9
X2.17X1.96