28
Status – Week Status – Week 265 265 Victor Moya Victor Moya

Status – Week 265 Victor Moya. Summary ShaderEmulator ShaderEmulator ShaderFetch ShaderFetch ShaderDecodeExecute ShaderDecodeExecute Communication storage

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

Status – Week Status – Week 265265

Victor MoyaVictor Moya

SummarySummary

ShaderEmulatorShaderEmulator ShaderFetchShaderFetch ShaderDecodeExecuteShaderDecodeExecute Communication storage classesCommunication storage classes GPU designGPU design PS2PS2 PS3PS3 ImagineImagine

ShaderEmulatorShaderEmulator

Decoded information for emulation Decoded information for emulation moved from ShaderEmulator to moved from ShaderEmulator to ShaderInstruction.ShaderInstruction.

Used macros for shader instructions.Used macros for shader instructions. Functions renamed.Functions renamed. Not tested yet.Not tested yet. Shader assembler?Shader assembler?

ShaderShaderBoxBoxDiagramDiagram

COMMAND

SHADERFETCH

SHADERDECODEEXECUTE

CONSUMER

ShaderStateShaderCommand

ShaderInstruction

ShaderPC

ShaderDecState

ShaderOutput ConsumerState

ShaderFetchShaderFetch

Parameters:Parameters: numThreads:numThreads:

Number of threads and input buffers.Number of threads and input buffers. numActiveThreads:numActiveThreads:

Number of threads.Number of threads. issueRate:issueRate:

Max. instructions fetched per cycle.Max. instructions fetched per cycle. retireRate:retireRate:

Max. newPC inputs per signal???Max. newPC inputs per signal???

ShaderFetch SignalsShaderFetch Signals Input:Input:

ShaderCommand:ShaderCommand: NEW_INPUT: Inputs to the shader.NEW_INPUT: Inputs to the shader. LOAD_PROGRAM: Load a new Shader Program.LOAD_PROGRAM: Load a new Shader Program. LOAD_PARAMETERS: Load new parameters values.LOAD_PARAMETERS: Load new parameters values.

ShaderNewPC:ShaderNewPC: Thread flow changes.Thread flow changes. End of thread (EXIT).End of thread (EXIT).

ShaderDecodeState:ShaderDecodeState: Decoder/Execute ready/busy state.Decoder/Execute ready/busy state.

ConsumerState:ConsumerState: Consumer state: ready/busy to receive shader outputs.Consumer state: ready/busy to receive shader outputs.

Shader Fetch SignalsShader Fetch Signals

Output:Output: ShaderState:ShaderState:

Shader ready to receive new inputs.Shader ready to receive new inputs. Shader ready to receive new program or new Shader ready to receive new program or new

parameters (all previous inputs must have parameters (all previous inputs must have finished).finished).

ShaderInstruction:ShaderInstruction: Shader instructions fetched.Shader instructions fetched.

ShaderOutput:ShaderOutput: Shader output (an array of QuadFloats) sent to Shader output (an array of QuadFloats) sent to

the consumer.the consumer.

ShaderFetch ClockShaderFetch Clock

Read from ShaderCommand:Read from ShaderCommand: Panic if Shader not ready for command.Panic if Shader not ready for command. NEW_INPUT:NEW_INPUT:

Get free thread or free input buffer.Get free thread or free input buffer. Reset thread state.Reset thread state.

LOAD_PROGRAM:LOAD_PROGRAM: Stores new program in Shader Emulator.Stores new program in Shader Emulator.

LOAD_PARAMETERS:LOAD_PARAMETERS: Stores new parameters in Shader Stores new parameters in Shader

Emulator.Emulator.

ShaderFetch ClockShaderFetch Clock

Read ShaderNewPC:Read ShaderNewPC: N reads.N reads. NEW_PC:NEW_PC:

Update the thread PC.Update the thread PC. END_THREAD:END_THREAD:

Mark thread as finished.Mark thread as finished.

Read ShaderDecodeState:Read ShaderDecodeState: If decoder busy:If decoder busy:

Update Shader state.Update Shader state. Write ShaderState.Write ShaderState.

ShaderFetch ClockShaderFetch Clock

Fetch new instructions:Fetch new instructions: Fetch N instructions.Fetch N instructions. Check if thread is ready (not finished, not Check if thread is ready (not finished, not

free).free). Fetch instruction from ShaderEmulator.Fetch instruction from ShaderEmulator. Write ShaderInstruction.Write ShaderInstruction. Update thread PC (+1).Update thread PC (+1). Update thread instruction counter.Update thread instruction counter.

If crosses instruction execution limit:If crosses instruction execution limit:– Finish thread.Finish thread.

Update nextThread pointer (Round Robin).Update nextThread pointer (Round Robin).

ShaderFetch ClockShaderFetch Clock

Check shader output state:Check shader output state: Output transmission in progress.Output transmission in progress.

Send more data. Write to ShaderOutput.Send more data. Write to ShaderOutput. Update state.Update state.

Output to send.Output to send. Check consumer state.Check consumer state. Start sending data. Write to ShaderOutput.Start sending data. Write to ShaderOutput. Set state as send in progress.Set state as send in progress.

NOT IMPLEMENTED YET.NOT IMPLEMENTED YET. Update Shader state:Update Shader state:

Write ShaderSate.Write ShaderSate.

ShaderFetch CommentsShaderFetch Comments

Finished threads act as buffers for Finished threads act as buffers for output.output. Add output buffers?Add output buffers?

Consumer/Output protocol.Consumer/Output protocol. Output sizes.Output sizes. Output latency.Output latency. Multicycle transmission.Multicycle transmission.

Order of operations in clock().Order of operations in clock().

ShaderFetch CommentsShaderFetch Comments

‘‘Scheduling’ policy for threads?Scheduling’ policy for threads? Pure Round Robin:Pure Round Robin:

Free/Finished threads fetch NOPs.Free/Finished threads fetch NOPs. Round Robin with priority:Round Robin with priority:

Free/Finished threads are skipped.Free/Finished threads are skipped. Other?Other?

Memory:Memory: THREAD_BLOCK/THREAD_RESUME from THREAD_BLOCK/THREAD_RESUME from

Shader Memory box or from Shader Memory box or from ShaderDecodeExecute box.ShaderDecodeExecute box.

New thread state: blocked.New thread state: blocked.

ShaderDecodeExecuteShaderDecodeExecute

Parameters:Parameters: numThreads:numThreads:

ShaderEmulator threads.ShaderEmulator threads. issueRate: issueRate:

Max. number of instructions received per Max. number of instructions received per cycle.cycle.

retireRate:retireRate: Max. number of NewPC signals per cycle.Max. number of NewPC signals per cycle.

ShaderDecodeExecuteShaderDecodeExecute

Signals:Signals: Input:Input:

ShaderInstruction:ShaderInstruction:– New Instructions from Fetch.New Instructions from Fetch.

ShaderExec:ShaderExec:– Instructions that end their execution.Instructions that end their execution.

ShaderDecodeExecuteShaderDecodeExecute

Output:Output: ShaderDecodeState:ShaderDecodeState:

– Decoder ready/busy.Decoder ready/busy. ShaderNewPC:ShaderNewPC:

– NEW_PC: NEW_PC: PC changes from branch, call and ret PC changes from branch, call and ret

instructions.instructions. Instruction refetch? -> Thread blocked?Instruction refetch? -> Thread blocked?

– END_THREAD:END_THREAD: EXIT instruction.EXIT instruction.

ShaderExec:ShaderExec:– Instructions that start their execution.Instructions that start their execution.

ShaderDecodeExecute ShaderDecodeExecute ClockClock

Read ShaderExec:Read ShaderExec: N finished instructions.N finished instructions. Update dependence tables.Update dependence tables. Free resource tables (limited Ufs?).Free resource tables (limited Ufs?).

ShaderDecodeExecute ShaderDecodeExecute ClockClock

If decode is blocked:If decode is blocked: Check resources for block instruction.Check resources for block instruction. If resources available:If resources available:

Execute instruction in ShaderEmulator.Execute instruction in ShaderEmulator. Update dependence table.Update dependence table. Read exec. latency table.Read exec. latency table. Write ShaderExec.Write ShaderExec. Unblock decoder.Unblock decoder.

Write ShaderDecodeState.Write ShaderDecodeState.

ShaderDecodeExecuteShaderDecodeExecute Read ShaderInstruction:Read ShaderInstruction:

N instructions.N instructions. Check dependences.Check dependences. If instruction has dependencies:If instruction has dependencies:

Threw away instruction.Threw away instruction. Send NewPC with the same PCSend NewPC with the same PC..

Check resource tables.Check resource tables. If there are no resources for the instruction:If there are no resources for the instruction:

Block decoder.Block decoder. Write ShaderDecodeState.Write ShaderDecodeState. Exit.Exit.

Execute instruction in ShaderEmulator.Execute instruction in ShaderEmulator. Read instruction exec. latency table.Read instruction exec. latency table. Write ShaderInstruction.Write ShaderInstruction.

ShaderDecodeExecute ShaderDecodeExecute CommentsComments

Additional features to implement:Additional features to implement: Out-of-order support.Out-of-order support. Instruction variable latency support.Instruction variable latency support.

Simulate UF latency:Simulate UF latency:– Ex. DIV/SQRT unit.Ex. DIV/SQRT unit.

Load/Store support.Load/Store support. Add Memory box:Add Memory box:

– Variable latency.Variable latency.– Thread block at first level memory miss.Thread block at first level memory miss.

Signal THREAD_BLOCK (to fetch and decode).Signal THREAD_BLOCK (to fetch and decode).– Thread wakeup when read ends.Thread wakeup when read ends.

Signal THREAD_RESUME (to fetch and decode).Signal THREAD_RESUME (to fetch and decode).

ShaderDecodeExecute ShaderDecodeExecute CommentsComments

Resource hazards:Resource hazards: None:None:

All UFs are fully pipelined with 1 cycle input All UFs are fully pipelined with 1 cycle input latency).latency).

Supported:Supported: Block decoder.Block decoder.

– Other threads could use free UFs.Other threads could use free UFs. Block thread.Block thread.

– Signal to fetch (ShaderNewPC): BLOCK_THREAD.Signal to fetch (ShaderNewPC): BLOCK_THREAD.– Needs a RESUME_THREAD signal?Needs a RESUME_THREAD signal?– Fetch hardware implementation? (not pure Round Fetch hardware implementation? (not pure Round

Robin?).Robin?).

ShaderDecodeExecute ShaderDecodeExecute CommentsComments

Dependence hazards:Dependence hazards: NoneNone::

Same latency for all instructions.Same latency for all instructions. Enough threads to fill instruction latency.Enough threads to fill instruction latency. ‘‘Empty’ (free or already finished) threads fetch Empty’ (free or already finished) threads fetch

NOPs (pure Round Robin).NOPs (pure Round Robin). Hardware is less complex.Hardware is less complex.

ShaderDecodeExecute ShaderDecodeExecute CommentsComments

Dependence hazards:Dependence hazards: Supported:Supported:

Block decoder:Block decoder:– With multithread Shader it is waste.With multithread Shader it is waste.

Block thread:Block thread:– Send a BLOCK_THREAD signal to fetch.Send a BLOCK_THREAD signal to fetch.– Needs a RESUME_THREAD signal.Needs a RESUME_THREAD signal.

Ignore:Ignore:– Just ignore current instruction.Just ignore current instruction.– Send REFETCH (ShaderNewPC) with old PC.Send REFETCH (ShaderNewPC) with old PC.

Instruction limit counter must not be Instruction limit counter must not be updated!!!updated!!!

Communication storageCommunication storage

Communication between boxes:Communication between boxes: ShaderExecInstructionShaderExecInstruction ShaderCommandShaderCommand ShaderDecodeCommandShaderDecodeCommand

Dynamic: creation/destruction.Dynamic: creation/destruction. Class model or struct model?Class model or struct model? Inherit from a ‘dynamic data’ class.Inherit from a ‘dynamic data’ class.

Modified new/delete implementation.Modified new/delete implementation.

GPU designGPU design Target architecture?Target architecture?

NV30NV30 DX9DX9 DX10DX10 OpenGL2OpenGL2 PS3PS3 ImagineImagine VectorVector MultithreadedMultithreaded

Are we really going for it?Are we really going for it? Do we really know what we are doing?Do we really know what we are doing?

PS2PS2

I got the EE, VU and GS I got the EE, VU and GS programming manuals :).programming manuals :).

PS3PS3

Sony patent.Sony patent. I haven’t read it yet.I haven’t read it yet.

ImagineImagine

‘‘Computer Graphics on a Stream Computer Graphics on a Stream Architecture’, John Douglas Owens, Architecture’, John Douglas Owens, PhD dissertation.PhD dissertation.

Not read yet either.Not read yet either.