21
CSC 4250 Computer Architectures October 20, 2006 Chapter 3. Instruction-Level Parallelism & Its Dynamic Exploitation

CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Embed Size (px)

Citation preview

Page 1: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

CSC 4250Computer Architectures

October 20, 2006

Chapter 3. Instruction-Level Parallelism

& Its Dynamic Exploitation

Page 2: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

One More Example on Tomasulo’s Algorithm

L.D F0,0(R0)

ADD.D F0,F0,F2

MUL.D F0,F0,F4

ADD.D F0,F0,F2

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4,F2

Page 3: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

IBM 360 Assembly Language

Only two operands. Advantage? Disadvantage? Example:

L.D F0,0(R0)

ADD.D F0,F2

MUL.D F0,F4

ADD.D F0,F2

MUL.D F0,F4

S.D F0,0(R0)

… …

Page 4: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.1Instruction Issue Execute Write Result

L.D F0,0(R0) √

ADD.D F0,F0,F2

MUL.D F0,F0,F4

ADD.D F0,F0,F2

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 No

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Load1

Page 5: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.2Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4

ADD.D F0,F0,F2

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 No

Add3 No

Mult1 No

Mult2 No

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add1

Page 6: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.3Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 No

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 No

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1

Page 7: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.4Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 No

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add2

Page 8: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.5Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

S.D F0,0(R0)

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 Yes Mult Reg[F4] Add2

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult2

Page 9: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.6Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

S.D F0,0(R0) √

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 Yes Mult Reg[F4] Add2

Store1 Yes Store Mult2 0+Reg[R0]

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult2

Page 10: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.7Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

S.D F0,0(R0) √

ADD.D F0,F4,F2 √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 Yes Add Reg[F4] Reg[F2]

Mult1 Yes Mult Reg[F4] Add1

Mult2 Yes Mult Reg[F4] Add2

Store1 Yes Store Mult2 0+Reg[R0]

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add3

Page 11: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.8Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

S.D F0,0(R0) √

ADD.D F0,F4,F2 √ √ √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 Yes Mult Reg[F4] Add2

Store1 Yes Store Mult2 0+Reg[R0]

F0 F2 F4 F6 F8 F10 F12 … F30

Qi

Page 12: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Modified Loop-Based Example

Loop: L.D F0,0(R1)

MUL.D F0,F0,F2

ADD.D F0,F0,F4

S.D F0,0(R1)

DADDIU R1,R1,#−8

BNE R1,R2,Loop

Page 13: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.1. One active iteration of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F0,F0,F2 1 √

ADD.D F0,F0,F4 1 √

S.D F0,0(R1) 1 √

L.D F0,0(R1) 2

MUL.D F0,F0,F2 2

ADD.D F0,F0,F4 2

S.D F0,0(R1) 2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 No

Add1 Yes Add Reg[F4] Mult1

Add2 No

Mult1 Yes Mult Reg[F2] Load1

Mult2 No

Store1 Yes Store Add1 Reg[R1]

Store2 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add1

Page 14: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.2. Two active iterations of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F0,F0,F2 1 √

ADD.D F0,F0,F4 1 √

S.D F0,0(R1) 1 √

L.D F0,0(R1) 2 √ √

MUL.D F0,F0,F2 2 √

ADD.D F0,F0,F4 2 √

S.D F0,0(R1) 2 √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 Yes Load Reg[R1]-8

Add1 Yes Add Reg[F4] Mult1

Add2 Yes Add Reg[F4] Mult2

Mult1 Yes Mult Reg[F2] Load1

Mult2 Yes Mult Reg[F2] Load2

Store1 Yes Store Add1 Reg[R1]

Store2 Yes Add2 Reg[R1]-8

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add2

Page 15: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 0.2. Two active iterations of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F0,F0,F2 1 √

ADD.D F0,F0,F4 1 √

S.D F0,0(R1) 1 √

L.D F0,0(R1) 2 √ √

MUL.D F0,F0,F2 2 √

ADD.D F0,F0,F4 2 √

S.D F0,0(R1) 2 √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 Yes Load Reg[R1]-8

Add1 Yes Add Reg[F4] Mult1

Add2 Yes Add Reg[F4] Mult2

Mult1 Yes Mult Reg[F2] Load1

Mult2 Yes Mult Reg[F2] Load2

Store1 Yes Store Add1 Reg[R1]

Store2 Yes Add2 Reg[R1]-8

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add2

Page 16: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Dynamic Branch Prediction

Static branch prediction in Appendix A Branch Prediction Buffer: a small memory

indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not

The prediction bit may have been placed there by another instruction

Page 17: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.14. A Branch Prediction Buffer Use the 4 low-order

address bits of the branch (word address) to choose a row.

Page 18: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Nested Loops

Loop1: L.D F2,1600(R1)DADDIU R2,R0,#80

Loop2: L.D F0,1000(R2)ADD.D F0,F0,F2S.D F0,1000(R2)DADDIU R2,R2,#−8BNEZ R2,Loop2DADDIU R1,R1,#−8BNEZ R1,Loop1

Page 19: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.7. States in 2-bit Prediction Scheme

Page 20: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.8. Prediction Accuracy of 4096-entry 2-bit Prediction Buffer for SPEC89 Benchmarks

Page 21: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.9. Prediction Accuracy of 4096-entry 2-bit Prediction Buffer versus an infinite 2-bit Prediction Buffer for SPEC89