26
A ug. 15, 2005 1 Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved. Programming and Performance Programming and Performance Evaluation of the Cell Processor Evaluation of the Cell Processor Ryuji Sakai, Seiji Maeda, Christopher Crookes, Mitsuru Shimbayashi, Katsuhisa Yano, Tadashi Nakatani, Hirokuni Yano, Shigehiro Asano, Masaya Kato, Hiroshi Nozue, Tatsunori Kanai, Tomofumi Shimada and Koichi Awazu Toshiba Corporation

HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Embed Size (px)

Citation preview

Page 1: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 1Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Programming and PerformanceProgramming and Performance

Evaluation of the Cell ProcessorEvaluation of the Cell Processor

Ryuji Sakai, Seiji Maeda, Christopher Crookes,

Mitsuru Shimbayashi, Katsuhisa Yano, Tadashi Nakatani,

Hirokuni Yano, Shigehiro Asano, Masaya Kato, Hiroshi Nozue,

Tatsunori Kanai, Tomofumi Shimada and Koichi Awazu

Toshiba Corporation

Page 2: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 2Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

OutlineOutline

Cell ProcessorCell Processor

SPE SchedulingSPE Scheduling

Programming Model

Real-time Resource Scheduler

Memory Bandwidth Reservation

SPE Programming (H.264 encoder as an example)SPE Programming (H.264 encoder as an example)

SPE Programming Steps

Parallel Execution Model for Sub Modules

Evaluation of the H.264 Encoder

ConclusionsConclusions

Page 3: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 3Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Cell ProcessorCell Processor

Heterogeneous Multi-core processorHeterogeneous Multi-core processor1 Power Processor Element (PPE)

8 Synergistic Processor Elements (SPEs)

2ch XDR DRAM Memory

Super Companion ChipXDR DRAM

Cell Processor

EIB

PPE PPUL2 L1 MIC IOC

SPE

SPU

LS

MFC

SPE

SPU

LS

MFC

SPE

SPU

LS

MFC

SPE

SPU

LS

MFC

SPE

SPU

LS

MFC

SPE

SPU

LS

MFC

SPE

SPU

LS

MFC

SPE

SPU

LS

MFC

Page 4: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 4Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Multi-layered Programming ModelMulti-layered Programming Model

SP

E S

chedulin

gS

PE

Pro

gra

mm

ing

SPE Thread Layer

SPE 0

SPE 1

SPE 2

PPE/SPE Module Layer

SPE Overlay Layer

Audio

DecoderStream

DemuxVideo

Decoder

Video

Proc.

AV

RendererTuner

SPE Thread

SPE Thread

SPE Thread

SPE Module

Function A Function B Function C Function D

SPE Thread

FunctionLocal

Storage

XDR

DRAM

Page 5: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 5Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Multi-layered Programming ModelMulti-layered Programming Model

SPE moduleSPE moduleDesigned for parts of stream processing model

ex. viewing digital TV, trans-coding digital contents, …

Multi SPE threads can be used to utilize multi SPEs

SPE threadSPE threadAn execution context running on an SPE

Context save and restore can be applicable

Context includes SPE registers, LS, outstanding DMA requests

SPE overlaySPE overlayText and/or data on LS can be replaced withoutintervention of PPE

Symbol resolution is supported

Page 6: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 6Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Real-time Resource SchedulerReal-time Resource Scheduler

Extends existing OS running on PPEExtends existing OS running on PPE

Existing OS only needs to implement glue codes

to manage specific features of Cell Processor

Schedules Schedules ““ResourcesResources”” with 2 sched. types with 2 sched. types

Resources: SPEs and Memory bandwidth

Sched. types: Time-shared and Dedicated

Ensures real-time constraints of all SPEEnsures real-time constraints of all SPE

modulesmodules

Detects whether scheduling is feasible or not

before accepting new SPE modules

Page 7: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 7Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Issue: Performance StabilityIssue: Performance Stability

To stabilize performance, processing rate and dataTo stabilize performance, processing rate and datarate must be adjustedrate must be adjusted

Processor performance increases continuouslyProcessor performance increases continuously4-way SIMD processing requires 4 times wider bandwidth

Multi-core processor multiplies access rate by the numberof cores

Cell processor tackles this issueCell processor tackles this issueSPE has 128 registers and high speed local storage

Bandwidth of 2-ch XDR DRAM is up to 25.6 GB/s

Leveling memory access to XDR DRAM is a key forLeveling memory access to XDR DRAM is a key forperformance stabilityperformance stability

Peak memory access initiated by 8 SPEs exceeds evenwide bandwidth accomplished by XDR DRAM

Page 8: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 8Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

AnsAns: Memory Bandwidth Reservation: Memory Bandwidth Reservation

Resource Scheduler treats memory bandwidth as aResource Scheduler treats memory bandwidth as atime-shared resourcetime-shared resource

Resources: SPEs and Memory bandwidth

Bandwidth is only reserved while SPE thread is running

Memory bandwidth can be specified as one ofMemory bandwidth can be specified as one ofscheduling parameters of SPE modulescheduling parameters of SPE module

Scheduling parameters:

The number of SPE threads, Processing ratio of SPEs,Memory bandwidth, Precedence constraints between SPEmodules

Real-time Resource Scheduler ensures that totalReal-time Resource Scheduler ensures that totalbandwidth never exceeds max capacity of XDR DRAMbandwidth never exceeds max capacity of XDR DRAM

Keeping other constraints such as precedence constraints

Page 9: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 9Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

ex. Trans-coder from MPEG-2 to H.264ex. Trans-coder from MPEG-2 to H.264

2 trans-coders are running on a Cell Processor2 trans-coders are running on a Cell Processor

Focus on H.264 encoder in following sheets

Trans-coder (MPEG-2 to H.264)

MPEG-2

Decoder

Demux

AAC

Decoder

AC3

Encoder

HDD

Writer

HDD

Reader

H.264

Encoder

Mux

H.264 Encoder: 3 SPE threads

Page 10: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 10Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Scheduling ImageScheduling Image

100%

t

Inte

rfere

SPE 0

SPE 1

SPE 2

SPE 3

SPE 4

SPE 5

Mem

ory

Bandw

idth

Thr

BW of

H.264 enc 1

a) without BW reservation b) with BW reservation

Thread 1

Thread 2

Thread 3

H.264 enc 1

Thread 1

Thread 2

Thread 3

H.264 enc 2

Thr

ThrThr Thr

BW ofH.264 enc 2

BW of

H.264 enc 1

BW ofH.264 enc 2

Thread 3

Thread 1

Thread 2

H.264 enc 1

Thread 1

Thread 2

Thread 3

H.264 enc 2

enough

room

2 encodersstart at the same

time

BW shortagedeclines performance

Performance isstabilized with reserved

BW

ThrThr

Thr ThrThr

ThrThr

ThrThr Thr

ThrThr

ThrThr Thr

H.264 enc 2runs after H.264 enc 1

All threads arescheduled properly

Page 11: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 11Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Evaluation on Cell ProcessorEvaluation on Cell Processor

Performance monitoring features arePerformance monitoring features areimplemented in Cell Processorimplemented in Cell Processor

Hardware counters and event flags can be monitoredand recorded sequentially

Performance monitoring tool is already availablePerformance monitoring tool is already availableThe tool is very flexible with configuration file

SPE behaviors and utilization of memorySPE behaviors and utilization of memorybandwidth had been monitoredbandwidth had been monitored

Utilization of memory bandwidth of H.264 encodershad been monitored

SPE behaviors are presented later

Page 12: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 12Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Utilization of Memory BandwidthUtilization of Memory Bandwidth

H.264 enc 2H.264 enc 1H.264 enc 2H.264 enc 1

H.264 enc 2

H.264 enc 1

H.264 enc 2

H.264 enc 1

with BW reservation

without BW

reservation

Slower performancecauses wider access

Page 13: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 13Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Performance of H.264 EncodersPerformance of H.264 Encoders

With bandwidth reservation,performance scales in a linear fashion

Without bandwidthreservation, performance

declines by 50%

Page 14: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 14Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

SP

E P

rogra

mm

ing

SPE 0

SPE 1

SPE 2

Multi-layered Programming ModelMulti-layered Programming Model

SPE Thread Layer

PPE/SPE Module Layer

SPE Overlay Layer

H.264 Encoder

SPE Thread

SPE Thread

SPE Thread

Function A Function B Function C Function D

SPE Thread

FunctionLocal

Storage

XDR

DRAM

MPEG-2

Decoder

Demux

AAC

Decoder

AC3

Encoder

HDD

Writer

HDD

ReaderH.264

Encoder

Mux

Page 15: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 15Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

OutlineOutline

Cell ProcessorCell Processor

SPE SchedulingSPE Scheduling

Programming Model

Real-time Resource Scheduler

Memory Bandwidth Reservation

SPE ProgrammingSPE Programming

SPE Programming Steps

Parallel Execution Model for Sub Modules

Evaluation of the H.264 Encoder

ConclusionsConclusions

Page 16: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 16Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

SPE Programming StepsSPE Programming Steps

Partition program code into sub modulesPartition program code into sub modulesNot only for parallelization but also for overlay

Implement sub modules by using the extended CImplement sub modules by using the extended CNo assembler programming needed

Coding in high level language is important forperformance tuning step

Since sophisticated algorithm easily applied, higherperformance than assembler can be achieved

Assign sub modules to the SPE threadsAssign sub modules to the SPE threadsAlso define program overlay order

Configure parallel execution scheme

Performance tuningPerformance tuningEffectively repeats the cycle of evaluation, refinement anddebugging for performance tuning

Page 17: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 17Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Tips for Implementing Sub ModulesTips for Implementing Sub Modules

DMA latency hidingDMA latency hidingReorder the program segment execution

Double buffer or execution queue techniquesare useful

Effective use of 128 registersEffective use of 128 registersUse local (auto) variables and inline functions

Loop unrolling in compiler optimization realizethat local defined array can be allocated toregister

Reduce the penalty of branch missReduce the penalty of branch missEliminate conditional branch by using logicaloperation or select instruction

Page 18: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 18Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Parallel Execution Model for Sub ModuleParallel Execution Model for Sub Module

Program

SPE 0

SPE 1

SPE 2

Data

Data

SPE 0

SPE 1

SPE 2

&Data

Data

Data ParallelPipelining

Page 19: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 19Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Structure of H.264 EncoderStructure of H.264 Encoder

CABAC

ME

IntraPred

T/Qー

Filter

IQ/IT

MC

GenRef

Input Video Data

Reference Frame

Output Encoded Bits

Motion Estimation

Motion Compensation

Transform/Quantization

Intra Prediction

Context-based

Adaptive Binary

Arithmetic Coding

Inverse Transform/

Inverse Quantization

Generate Reference

Deblocking Filter

Very fine grain

data dependency,

100 – 1000 cycles

Intra Prediction

predicts pixel data

based on surroundings

Page 20: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 20Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Partition the H.264 Encoder CodePartition the H.264 Encoder Code

CABAC

ME

IntraPredT/Q

Filter

IQ/IT+

MC

GenRef

ME1 ME2 ME3

Input Video Data

Reference Frame

Coding module

was composed

from multiple

encoder functions

ME module was

divided into 3

sub modules to

give priority to

the execution

speed

Output Encoded Bits

Motion Estimation

is the hotspot of

the H.264 encoder

Page 21: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 21Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Assign Sub Modules to the SPE ThreadsAssign Sub Modules to the SPE Threads

CABAC

ME

Filter

MC

GenRef

ME1 ME2 ME3

Coding

Input Video Data

Reference Frame

Output Encoded Bits

Thread 1

Thread 2

Thread 3

Page 22: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 22Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Overlay by Frame PartitioningOverlay by Frame Partitioning

GenRef

Filter

spu_overlayspu_overlay spu_overlay

CABACCoding

MCThread 1 Thread 2 Thread 3

ME3ME2

ME1

Video frames are partitioned into processing unitsfor pipelining and program code overlay.

H.264 bit stream

CABAC generated encoded bitstream

Thread 3 waits until the 2nd

partition is available, because

Thread 3 refers to the top of the

next partition

Page 23: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 23Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

3 Thread Execution Image3 Thread Execution Image

Thread 3

Thread 1

Thread 2

Time

12345

1 Frame

divided into 5

partitions

11 11 22 22 33 33 44 44 55 5555 55

33 3333 44 4444 55 555522 2222

44 44 44 55 55 5533 33 33

Processing

elements for

1 Frame

11 11 22 22 33 33 44 44 55 55 11 11

11 1111 22 2222 33 3333 44 4444 55 5555 11 1111 22 2222 33 3333

11 11 11 22 22 22 33 33 33 44 44 44 55 55 55 11 11 11 22 22 22 33 33 44 44 4433

Reference

frame

dependency

Page 24: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 24Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Evaluation of the H.264 EncoderEvaluation of the H.264 Encoder

SPE Core Cycle UtilizationSPE Core Cycle Utilization

ME3ME2

ME1Thread 1

spu_ov erla

y

100

80

60

40

20

05 10 15 20 25 30 35

time [ms]

SP

E C

ore

Cycle

Ratio [

%]

branch

miss penalty

DMA

channel

stallsingle commit

dual commit

1 Frame

1 Partition

ME1 ME3

ME2

ME1 ME3

ME2 ME2 ME2 ME2

ME1 ME1 ME1ME3 ME3 ME3

Page 25: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 25Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

Evaluation of the H.264 EncoderEvaluation of the H.264 Encoder

Memory Bandwidth by SPE ReadMemory Bandwidth by SPE Read

5 10 15 20 25 30 35

time [ms]

ME3ME2

ME1Thread 1

spu_ov erla

y

10

8

6

4

2

0

Mem

ory

Bandw

idth

[G

B/s

]

spu_ov erla

y

CABACCoding

MC

Thread 2

DMA read

from Thread 1

DMA read

from Thread 2

1 Frame

ME3 ME3 ME3 ME3 ME3

1 Partition

Page 26: HC17.S1T4 Programming and Performance … HC17.S1T4 Programming and Performance Evaluation of the Cell Processor.ppt Author: Gordon Garb Created Date: 7/27/2013 12:49:00 PM

Aug. 15, 2005 26Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.

ConclusionsConclusions

Real-time Resource Scheduler with MemoryReal-time Resource Scheduler with MemoryBandwidth Reservation can level memory accessBandwidth Reservation can level memory accessand stabilize the performance of SPE modulesand stabilize the performance of SPE modules

H.264 Encoder was successfully implemented onH.264 Encoder was successfully implemented onSPE thread and overlay programming environment,SPE thread and overlay programming environment,and evaluatedand evaluated

Effectiveness of SPE scheduling and SPEEffectiveness of SPE scheduling and SPEprogramming is evaluated on programming is evaluated on ““realreal”” Cell Processor Cell Processorwith performance monitoring toolwith performance monitoring tool

Using this infrastructure, next generationUsing this infrastructure, next generation’’s multi-s multi-stream media applications will be accomplished onstream media applications will be accomplished onthe Cell Processor.the Cell Processor.

Have fun programming on the Cell!Have fun programming on the Cell!