Upload
delilah-gordon
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
LYU0703 Parallel Distributed Programming on PS3
1
LYU0703Parallel Distributed Programming on PS3
Huang Hiu Fung 05700512Wong Chung Hoi 05596742
Supervised by Prof. Michael R. Lyu
Department of Computer Science and Engineering, CUHK2007-2008 Final Year Project Presentation (1st term)
LYU0703 Parallel Distributed Programming on PS3
2
Agenda
• Background Information• Architecture of PlayStation®3• Principals of Parallel Programming• Optimization of the ADVISER program:
1. Sequential Approach2. Parallel Approach
• Conclusion• Future Works• Q&A
LYU0703 Parallel Distributed Programming on PS3
3
Background Information
Limitation of single-core processor:
1. Memory Access Latency
2. Wire Delays
3. Power Consumption
LYU0703 Parallel Distributed Programming on PS3
4
Power Consumption
P = powerC = capacitance V = voltageF = processor frequency (cycles per second)
LYU0703 Parallel Distributed Programming on PS3
5
Development of Multi-Core Processor
Fig. 1.4 Growth of No. of Cores in Processors
LYU0703 Parallel Distributed Programming on PS3
6
Development of Multi-Core Processor
• Reduce power consumption- use multiple cores with low frequency instead of one with high frequency
• Efficient processing of multiple tasks- divide the computation work- execute among the cores concurrently
LYU0703 Parallel Distributed Programming on PS3
7
Project Objectives
• Need of parallel programming to optimize intensive-computation applications
• Study features of parallel programming, compare sequential and parallel approach
• Optimize an application, showing great improvement by parallel programming
LYU0703 Parallel Distributed Programming on PS3
8
Architecture of PlayStation®3 (PS3)
• A multi-core machine produced by Sony, with the Cell Broadband Engine
• Strong Computation Power
• Opened platform for other applications and development
LYU0703 Parallel Distributed Programming on PS3
9
Cell Broadband Engine (Cell BE)
PPE – Power Processor Element
SPE – Synergistic Processor Element
EIB – Element Interconnect Bus
LYU0703 Parallel Distributed Programming on PS3
10
Power Processor Element (PPE)
• 64-bit PowerPC architecture based
• General purpose operation• Designed as control-
intensive• Control I/O of main
memeory and other devices by the OS
• Control over all 8 SPEs
Fig. 2.5 Design of PPE
LYU0703 Parallel Distributed Programming on PS3
11
Synergistic Processor Element (SPE)
• Designed to provide computation performance
• SPU – perform allocated task• LS – the only memory• MFC – control data transfer• Totally 8 SPEs in Cell• Only 6 acessisble• 1 reserved for system software
1 disabled Fig. 2.6 Design of a SPE
LYU0703 Parallel Distributed Programming on PS3
12
Element Interconnect Bus (EIB)
• Internal communication bus inside Cell
• Connect different elements: PPE, SPEs. Memory controller
Fig. 2.7 Data Flow and Program Control
LYU0703 Parallel Distributed Programming on PS3
13
Principal of Parallel Programming
Parallel algorithm Serial algorithm
multiple processing units single processing unit
communication overhead no communication overhead
higher complexity in code straight forward code
ensure load balance between PU everything is done by CPU
LYU0703 Parallel Distributed Programming on PS3
14
Concept of Load Balance
• Distribute data evenly
• Total runtime depends on
the busiest processing
element
• Wasting computation
time on idling processing
element
LYU0703 Parallel Distributed Programming on PS3
15
Method of parallelism
• Data parallelism
• Task parallelism
Parallel Architecture
Flynn's taxonomy
Single
InstructionMultiple
Instruction
Single Data
SISD MISD
Multiple Data
SIMD MIMD
LYU0703 Parallel Distributed Programming on PS3
16
SISD
• Traditional Computer
• von Neumann model
LYU0703 Parallel Distributed Programming on PS3
17
SIMD
• Same instruction on all data
• Data parallelism
• SIMD intrinsic function
LYU0703 Parallel Distributed Programming on PS3
18
MISD
• No well known system
• Mention for completeness
LYU0703 Parallel Distributed Programming on PS3
19
MIMD
• Different instruction on
different data
• Task parallelism
• Further break down to– Shared Memory System– Distributed Memory System
LYU0703 Parallel Distributed Programming on PS3
20
Shared Memory System
• Access to central
memory for data
• PS3 :Achieve by
MFC issuing DMA
command
LYU0703 Parallel Distributed Programming on PS3
21
Distributed Memory System
• Each PE has its
own memory
• PS3: Each SPE
has 256KB Local Store
• PS3 is hybrid shared-distributed memory system
LYU0703 Parallel Distributed Programming on PS3
22
ADVISER
• Comparing 2 video clips
1.Generating meaningful data (in form of numbers) of frames from the video
2.Comparing and looking for the most similar frames
3.Locating the similar segment which consist of a series of very similar frames
LYU0703 Parallel Distributed Programming on PS3
23
Input
• 2 Folder, “Repository” & “Target”
• hl3 file = vector of 1024 double precision values
LYU0703 Parallel Distributed Programming on PS3
24
Input No. of hl3 files
Target directory 5473
Repository directory 7547
Processing
• hl3 file = vector of 1024 double precision values
• File P
• File Q
• Similarity =
• Smaller the better
LYU0703 Parallel Distributed Programming on PS3
25
Output
• M “Target”, N “Repository”
• O ( M * N )
• Computation time = 633 sec
• Flash demo
LYU0703 Parallel Distributed Programming on PS3
26
target hl3 1 most match repository A difference value = ??target hl3 2 most match repository B difference value = ??target hl3 3 most match repository C difference value = ??
Parallel Version
• Data parallelism
• Split data to 6 SPEs evenly
• Computation time for 6 SPEs = 330 sec
• Flash demo
LYU0703 Parallel Distributed Programming on PS3
27
Parallel Version
• Expected speed up 6X
• Actual speed up 2X
• PC and PPU, SPE all run at different speed
• Computation time with CPU = 633 sec
• Computation time with 1 SPE = 1928 sec
• Computation time with PPU = 3119 sec
• CPU > SPE > PPULYU0703 Parallel Distributed Progra
mming on PS328
Time Attack
1. SIMD intrinsic function
2. Changing data type
3. Double Buffering
4. Parallel Read
5. Distributing Job to idling PPE
6. SIMD on loop counter
7. Loop unrollingLYU0703 Parallel Distributed Progra
mming on PS329
SIMD intrinsic function
• Addition, subtraction,
multiplication, etc.
• Operates on 128 bits
registers
• Date type: double (64 bits)
• Speed up 2X
LYU0703 Parallel Distributed Programming on PS3
30
Changing Data Type to int
• Precision not important
• Major speed up from
SIMD intrinsic
• Data type: int (32 bits)
• Total Speed up 4X
• Computation time
= 71 sec
LYU0703 Parallel Distributed Programming on PS3
31
Running Time of Parallel PS3, with SIMD Version
0
500
1000
1500
2000
2500
1 2 3 4 5 6No. of SPU used
Sec
Parallel version
Parallel + SIMDversion
Changing Data Type to float
• SPE specified for high
precision computation
• No intrinsic for int data
type at all
• Data Type: float (32 bits)
• Save data conversion time
• Speed up by 30%• Computation time = 49 sec
LYU0703 Parallel Distributed Programming on PS3
32
Running Time of Parallel, with SIMD, float input PS3version
0
100
200
300
400
500
1 2 3 4 5 6
No. of SPU used
Sec
Parallel + SIMD + int
Parallel + SIMD +float
Double buffering
• Save communication time
• MFC and SPU
• 2 buffers– Prefetching– Processing
• Not heavy in communication
• Minor speed up
LYU0703 Parallel Distributed Programming on PS3
33
LYU0703 Parallel Distributed Programming on PS3
34
Parallel Reading for All Files
• Read “Target” and “Repository” concurrently
• Share file reading job among SPEs
• Not improve as predicted, even slower
• Reason: hard disk cannot cannot handle concurrent request
• Failed Attempt
LYU0703 Parallel Distributed Programming on PS3
35
Distributing Job to Idling PPE
• PPE current job: read files, distribute files, collect result
• Use stall time to do some computation
• Relatively low computation power of PPE
• No significant improvement
• Increase program complexity
• Abandon this approach
LYU0703 Parallel Distributed Programming on PS3
36
Applying SIMD for Loop Counter
Major computation power consumed in:
• initialize i = 0, diff = (0, 0, 0, 0).• for i < Number of float numbers in a file / Number of
floats packed in a register
A. temp = SIMD subtraction on vector i in “Target” and “Repository” file.
B. diff = SIMD addition (SIMD multiplication (temp, temp) , diff).
• i = i + 1.• Loop back to 2.
LYU0703 Parallel Distributed Programming on PS3
37
Applying SIMD for Loop Counter
• Try to optimize step 3
• Apply SIMD to the loop counter
• Addition and comparison operations are reduced by 8 times
LYU0703 Parallel Distributed Programming on PS3
38
Applying SIMD for Loop Counter
• initialize i = (0,1,2,3,4,5,6,7) , diff = (0, 0, 0, 0).• for i[0] < Number of float numbers in a file / Number of floats packed in a
registertemp = SIMD subtraction on vector i[0] in “Target” and “Repository” file.
diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[1] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[2] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[3] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[4] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[5] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[6] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[7] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).
• i = SIMD addition (i, (8, 8, 8, 8, 8, 8, 8, 8)).• Loop back to 2.
LYU0703 Parallel Distributed Programming on PS3
39
Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version
No. of SPU used
1 2 3 4 5 6
Read input time (sec)
4 5 3 4 4 4
Total Elapsed time (sec)
286 146 97 75 60 51
Net Elapsed time (sec)
282 141 94 71 56 47
LYU0703 Parallel Distributed Programming on PS3
40
Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version
Running Time of Parallel, with SIMD, float input, SIMD forloop counter PS3 version
0
100
200
300
400
1 2 3 4 5 6
No. of SPU used
Sec
Parallel+SIMD+float
Parallel+SIMD+float+SIMD for i
LYU0703 Parallel Distributed Programming on PS3
41
Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version
• little improvement (about 4%).
• shows the possibility to have faster performance by further loop unrolling.
• The best performance becomes 47 sec
LYU0703 Parallel Distributed Programming on PS3
42
Loop Unrolling
• Proved that optimizing the loop can improve performance
• Completely loop unrolling
• More obvious speed up
LYU0703 Parallel Distributed Programming on PS3
43
Result of the parallel, with SIMD, float input, loop unrolling PS3 version
No. of SPU used
1 2 3 4 5 6
Read input time (sec)
3 4 3 3 4 3
Total Elapsed time (sec)
159 82 55 42 35 30
Net Elapsed time (sec)
156 78 52 39 31 27
LYU0703 Parallel Distributed Programming on PS3
44
Result of the parallel, with SIMD, float input, loop unrolling PS3 version
Running Time of Parallel, with SIMD, float input, loopunrolling PS3 version
0
50100
150
200250
300
1 2 3 4 5 6
No. of SPU used
Sec
Parallel+SIMD+float+SIMD for i
Parallel+SIMD+float+loop unrolling
LYU0703 Parallel Distributed Programming on PS3
45
Result of the parallel, with SIMD, float input, loop unrolling PS3 version
• 45% faster
• ultimate best performance becomes 27 sec
LYU0703 Parallel Distributed Programming on PS3
46
Conclusion of Optimization
• PC version:663 sec
• PS3 with 1 SPU (i.e. sequential version on PS3):1928 sec
• Final optimized version of PS3:27 sec
23 times faster than PC version71 times faster than sequential version on PS3
LYU0703 Parallel Distributed Programming on PS3
47
Conclusion of Optimization
Elapsed time change with difference approach applied, in a6 SPU condition
0
50
100
150
200
250
300
350
paral
lelSIM
D
float
type
SIMD fo
r i
loop u
nrollin
g
sec
Elapsed time
LYU0703 Parallel Distributed Programming on PS3
48
Future Works
• Port the whole ADVISER application on PlayStation®3
• Optimization throughout the whole application
LYU0703 Parallel Distributed Programming on PS3
49
Q&A
LYU0703 Parallel Distributed Programming on PS3
50
The End