50
LYU0703 Parallel Distribu ted Programming on PS3 1 LYU0703 Parallel Distributed Programming on PS3 Huang Hiu Fung 05700512 Wong Chung Hoi 05596742 Supervised by Prof. Michael R. Lyu Department of Computer Science and Engineering, CUHK 2007-2008 Final Year Project Presentation (1st term)

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Embed Size (px)

Citation preview

Page 1: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

1

LYU0703Parallel Distributed Programming on PS3

Huang Hiu Fung 05700512Wong Chung Hoi 05596742

Supervised by Prof. Michael R. Lyu

Department of Computer Science and Engineering, CUHK2007-2008 Final Year Project Presentation (1st term)

Page 2: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

2

Agenda

• Background Information• Architecture of PlayStation®3• Principals of Parallel Programming• Optimization of the ADVISER program:

1. Sequential Approach2. Parallel Approach

• Conclusion• Future Works• Q&A

Page 3: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

3

Background Information

Limitation of single-core processor:

1. Memory Access Latency

2. Wire Delays

3. Power Consumption

Page 4: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

4

Power Consumption

P = powerC = capacitance V = voltageF = processor frequency (cycles per second)

Page 5: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

5

Development of Multi-Core Processor

Fig. 1.4 Growth of No. of Cores in Processors

Page 6: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

6

Development of Multi-Core Processor

• Reduce power consumption- use multiple cores with low frequency instead of one with high frequency

• Efficient processing of multiple tasks- divide the computation work- execute among the cores concurrently

Page 7: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

7

Project Objectives

• Need of parallel programming to optimize intensive-computation applications

• Study features of parallel programming, compare sequential and parallel approach

• Optimize an application, showing great improvement by parallel programming

Page 8: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

8

Architecture of PlayStation®3 (PS3)

• A multi-core machine produced by Sony, with the Cell Broadband Engine

• Strong Computation Power

• Opened platform for other applications and development

Page 9: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

9

Cell Broadband Engine (Cell BE)

PPE – Power Processor Element

SPE – Synergistic Processor Element

EIB – Element Interconnect Bus

Page 10: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

10

Power Processor Element (PPE)

• 64-bit PowerPC architecture based

• General purpose operation• Designed as control-

intensive• Control I/O of main

memeory and other devices by the OS

• Control over all 8 SPEs

Fig. 2.5 Design of PPE

Page 11: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

11

Synergistic Processor Element (SPE)

• Designed to provide computation performance

• SPU – perform allocated task• LS – the only memory• MFC – control data transfer• Totally 8 SPEs in Cell• Only 6 acessisble• 1 reserved for system software

1 disabled Fig. 2.6 Design of a SPE

Page 12: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

12

Element Interconnect Bus (EIB)

• Internal communication bus inside Cell

• Connect different elements: PPE, SPEs. Memory controller

Fig. 2.7 Data Flow and Program Control

Page 13: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

13

Principal of Parallel Programming

Parallel algorithm Serial algorithm

multiple processing units single processing unit

communication overhead no communication overhead

higher complexity in code straight forward code

ensure load balance between PU everything is done by CPU

Page 14: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

14

Concept of Load Balance

• Distribute data evenly

• Total runtime depends on

the busiest processing

element

• Wasting computation

time on idling processing

element

Page 15: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

15

Method of parallelism

• Data parallelism

• Task parallelism

Page 16: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Parallel Architecture

Flynn's taxonomy

 Single

InstructionMultiple

Instruction

Single Data

SISD MISD

Multiple Data

SIMD MIMD

LYU0703 Parallel Distributed Programming on PS3

16

Page 17: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

SISD

• Traditional Computer

• von Neumann model

LYU0703 Parallel Distributed Programming on PS3

17

Page 18: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

SIMD

• Same instruction on all data

• Data parallelism

• SIMD intrinsic function

LYU0703 Parallel Distributed Programming on PS3

18

Page 19: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

MISD

• No well known system

• Mention for completeness

LYU0703 Parallel Distributed Programming on PS3

19

Page 20: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

MIMD

• Different instruction on

different data

• Task parallelism

• Further break down to– Shared Memory System– Distributed Memory System

LYU0703 Parallel Distributed Programming on PS3

20

Page 21: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Shared Memory System

• Access to central

memory for data

• PS3 :Achieve by

MFC issuing DMA

command

LYU0703 Parallel Distributed Programming on PS3

21

Page 22: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Distributed Memory System

• Each PE has its

own memory

• PS3: Each SPE

has 256KB Local Store

• PS3 is hybrid shared-distributed memory system

LYU0703 Parallel Distributed Programming on PS3

22

Page 23: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

ADVISER

• Comparing 2 video clips

1.Generating meaningful data (in form of numbers) of frames from the video

2.Comparing and looking for the most similar frames

3.Locating the similar segment which consist of a series of very similar frames

LYU0703 Parallel Distributed Programming on PS3

23

Page 24: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Input

• 2 Folder, “Repository” & “Target”

• hl3 file = vector of 1024 double precision values

LYU0703 Parallel Distributed Programming on PS3

24

Input No. of hl3 files

Target directory 5473

Repository directory 7547

Page 25: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Processing

• hl3 file = vector of 1024 double precision values

• File P

• File Q

• Similarity =

• Smaller the better

LYU0703 Parallel Distributed Programming on PS3

25

Page 26: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Output

• M “Target”, N “Repository”

• O ( M * N )

• Computation time = 633 sec

• Flash demo

LYU0703 Parallel Distributed Programming on PS3

26

target hl3 1 most match repository A difference value = ??target hl3 2 most match repository B difference value = ??target hl3 3 most match repository C difference value = ??

Page 27: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Parallel Version

• Data parallelism

• Split data to 6 SPEs evenly

• Computation time for 6 SPEs = 330 sec

• Flash demo

LYU0703 Parallel Distributed Programming on PS3

27

Page 28: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Parallel Version

• Expected speed up 6X

• Actual speed up 2X

• PC and PPU, SPE all run at different speed

• Computation time with CPU = 633 sec

• Computation time with 1 SPE = 1928 sec

• Computation time with PPU = 3119 sec

• CPU > SPE > PPULYU0703 Parallel Distributed Progra

mming on PS328

Page 29: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Time Attack

1. SIMD intrinsic function

2. Changing data type

3. Double Buffering

4. Parallel Read

5. Distributing Job to idling PPE

6. SIMD on loop counter

7. Loop unrollingLYU0703 Parallel Distributed Progra

mming on PS329

Page 30: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

SIMD intrinsic function

• Addition, subtraction,

multiplication, etc.

• Operates on 128 bits

registers

• Date type: double (64 bits)

• Speed up 2X

LYU0703 Parallel Distributed Programming on PS3

30

Page 31: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Changing Data Type to int

• Precision not important

• Major speed up from

SIMD intrinsic

• Data type: int (32 bits)

• Total Speed up 4X

• Computation time

= 71 sec

LYU0703 Parallel Distributed Programming on PS3

31

Running Time of Parallel PS3, with SIMD Version

0

500

1000

1500

2000

2500

1 2 3 4 5 6No. of SPU used

Sec

Parallel version

Parallel + SIMDversion

Page 32: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Changing Data Type to float

• SPE specified for high

precision computation

• No intrinsic for int data

type at all

• Data Type: float (32 bits)

• Save data conversion time

• Speed up by 30%• Computation time = 49 sec

LYU0703 Parallel Distributed Programming on PS3

32

Running Time of Parallel, with SIMD, float input PS3version

0

100

200

300

400

500

1 2 3 4 5 6

No. of SPU used

Sec

Parallel + SIMD + int

Parallel + SIMD +float

Page 33: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

Double buffering

• Save communication time

• MFC and SPU

• 2 buffers– Prefetching– Processing

• Not heavy in communication

• Minor speed up

LYU0703 Parallel Distributed Programming on PS3

33

Page 34: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

34

Parallel Reading for All Files

• Read “Target” and “Repository” concurrently

• Share file reading job among SPEs

• Not improve as predicted, even slower

• Reason: hard disk cannot cannot handle concurrent request

• Failed Attempt

Page 35: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

35

Distributing Job to Idling PPE

• PPE current job: read files, distribute files, collect result

• Use stall time to do some computation

• Relatively low computation power of PPE

• No significant improvement

• Increase program complexity

• Abandon this approach

Page 36: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

36

Applying SIMD for Loop Counter

Major computation power consumed in:

• initialize i = 0, diff = (0, 0, 0, 0).• for i < Number of float numbers in a file / Number of

floats packed in a register

A. temp = SIMD subtraction on vector i in “Target” and “Repository” file.

B. diff = SIMD addition (SIMD multiplication (temp, temp) , diff).

• i = i + 1.• Loop back to 2.

Page 37: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

37

Applying SIMD for Loop Counter

• Try to optimize step 3

• Apply SIMD to the loop counter

• Addition and comparison operations are reduced by 8 times

Page 38: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

38

Applying SIMD for Loop Counter

• initialize i = (0,1,2,3,4,5,6,7) , diff = (0, 0, 0, 0).• for i[0] < Number of float numbers in a file / Number of floats packed in a

registertemp = SIMD subtraction on vector i[0] in “Target” and “Repository” file.

diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[1] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[2] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[3] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[4] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[5] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[6] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[7] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).

• i = SIMD addition (i, (8, 8, 8, 8, 8, 8, 8, 8)).• Loop back to 2.

Page 39: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

39

Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version

No. of SPU used

1 2 3 4 5 6

Read input time (sec)

4 5 3 4 4 4

Total Elapsed time (sec)

286 146 97 75 60 51

Net Elapsed time (sec)

282 141 94 71 56 47

Page 40: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

40

Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version

Running Time of Parallel, with SIMD, float input, SIMD forloop counter PS3 version

0

100

200

300

400

1 2 3 4 5 6

No. of SPU used

Sec

Parallel+SIMD+float

Parallel+SIMD+float+SIMD for i

Page 41: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

41

Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version

• little improvement (about 4%).

• shows the possibility to have faster performance by further loop unrolling.

• The best performance becomes 47 sec

Page 42: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

42

Loop Unrolling

• Proved that optimizing the loop can improve performance

• Completely loop unrolling

• More obvious speed up

Page 43: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

43

Result of the parallel, with SIMD, float input, loop unrolling PS3 version

No. of SPU used

1 2 3 4 5 6

Read input time (sec)

3 4 3 3 4 3

Total Elapsed time (sec)

159 82 55 42 35 30

Net Elapsed time (sec)

156 78 52 39 31 27

Page 44: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

44

Result of the parallel, with SIMD, float input, loop unrolling PS3 version

Running Time of Parallel, with SIMD, float input, loopunrolling PS3 version

0

50100

150

200250

300

1 2 3 4 5 6

No. of SPU used

Sec

Parallel+SIMD+float+SIMD for i

Parallel+SIMD+float+loop unrolling

Page 45: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

45

Result of the parallel, with SIMD, float input, loop unrolling PS3 version

• 45% faster

• ultimate best performance becomes 27 sec

Page 46: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

46

Conclusion of Optimization

• PC version:663 sec

• PS3 with 1 SPU (i.e. sequential version on PS3):1928 sec

• Final optimized version of PS3:27 sec

23 times faster than PC version71 times faster than sequential version on PS3

Page 47: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

47

Conclusion of Optimization

Elapsed time change with difference approach applied, in a6 SPU condition

0

50

100

150

200

250

300

350

paral

lelSIM

D

float

type

SIMD fo

r i

loop u

nrollin

g

sec

Elapsed time

Page 48: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

48

Future Works

• Port the whole ADVISER application on PlayStation®3

• Optimization throughout the whole application

Page 49: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

49

Q&A

Page 50: LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer

LYU0703 Parallel Distributed Programming on PS3

50

The End