Upload
doanthuy
View
222
Download
0
Embed Size (px)
Citation preview
Aug. 15, 2005 1Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Programming and PerformanceProgramming and Performance
Evaluation of the Cell ProcessorEvaluation of the Cell Processor
Ryuji Sakai, Seiji Maeda, Christopher Crookes,
Mitsuru Shimbayashi, Katsuhisa Yano, Tadashi Nakatani,
Hirokuni Yano, Shigehiro Asano, Masaya Kato, Hiroshi Nozue,
Tatsunori Kanai, Tomofumi Shimada and Koichi Awazu
Toshiba Corporation
Aug. 15, 2005 2Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
OutlineOutline
Cell ProcessorCell Processor
SPE SchedulingSPE Scheduling
Programming Model
Real-time Resource Scheduler
Memory Bandwidth Reservation
SPE Programming (H.264 encoder as an example)SPE Programming (H.264 encoder as an example)
SPE Programming Steps
Parallel Execution Model for Sub Modules
Evaluation of the H.264 Encoder
ConclusionsConclusions
Aug. 15, 2005 3Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Cell ProcessorCell Processor
Heterogeneous Multi-core processorHeterogeneous Multi-core processor1 Power Processor Element (PPE)
8 Synergistic Processor Elements (SPEs)
2ch XDR DRAM Memory
Super Companion ChipXDR DRAM
Cell Processor
EIB
PPE PPUL2 L1 MIC IOC
SPE
SPU
LS
MFC
SPE
SPU
LS
MFC
SPE
SPU
LS
MFC
SPE
SPU
LS
MFC
SPE
SPU
LS
MFC
SPE
SPU
LS
MFC
SPE
SPU
LS
MFC
SPE
SPU
LS
MFC
Aug. 15, 2005 4Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Multi-layered Programming ModelMulti-layered Programming Model
SP
E S
chedulin
gS
PE
Pro
gra
mm
ing
SPE Thread Layer
SPE 0
SPE 1
SPE 2
PPE/SPE Module Layer
SPE Overlay Layer
Audio
DecoderStream
DemuxVideo
Decoder
Video
Proc.
AV
RendererTuner
SPE Thread
SPE Thread
SPE Thread
SPE Module
Function A Function B Function C Function D
SPE Thread
FunctionLocal
Storage
XDR
DRAM
Aug. 15, 2005 5Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Multi-layered Programming ModelMulti-layered Programming Model
SPE moduleSPE moduleDesigned for parts of stream processing model
ex. viewing digital TV, trans-coding digital contents, …
Multi SPE threads can be used to utilize multi SPEs
SPE threadSPE threadAn execution context running on an SPE
Context save and restore can be applicable
Context includes SPE registers, LS, outstanding DMA requests
SPE overlaySPE overlayText and/or data on LS can be replaced withoutintervention of PPE
Symbol resolution is supported
Aug. 15, 2005 6Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Real-time Resource SchedulerReal-time Resource Scheduler
Extends existing OS running on PPEExtends existing OS running on PPE
Existing OS only needs to implement glue codes
to manage specific features of Cell Processor
Schedules Schedules ““ResourcesResources”” with 2 sched. types with 2 sched. types
Resources: SPEs and Memory bandwidth
Sched. types: Time-shared and Dedicated
Ensures real-time constraints of all SPEEnsures real-time constraints of all SPE
modulesmodules
Detects whether scheduling is feasible or not
before accepting new SPE modules
Aug. 15, 2005 7Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Issue: Performance StabilityIssue: Performance Stability
To stabilize performance, processing rate and dataTo stabilize performance, processing rate and datarate must be adjustedrate must be adjusted
Processor performance increases continuouslyProcessor performance increases continuously4-way SIMD processing requires 4 times wider bandwidth
Multi-core processor multiplies access rate by the numberof cores
Cell processor tackles this issueCell processor tackles this issueSPE has 128 registers and high speed local storage
Bandwidth of 2-ch XDR DRAM is up to 25.6 GB/s
Leveling memory access to XDR DRAM is a key forLeveling memory access to XDR DRAM is a key forperformance stabilityperformance stability
Peak memory access initiated by 8 SPEs exceeds evenwide bandwidth accomplished by XDR DRAM
Aug. 15, 2005 8Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
AnsAns: Memory Bandwidth Reservation: Memory Bandwidth Reservation
Resource Scheduler treats memory bandwidth as aResource Scheduler treats memory bandwidth as atime-shared resourcetime-shared resource
Resources: SPEs and Memory bandwidth
Bandwidth is only reserved while SPE thread is running
Memory bandwidth can be specified as one ofMemory bandwidth can be specified as one ofscheduling parameters of SPE modulescheduling parameters of SPE module
Scheduling parameters:
The number of SPE threads, Processing ratio of SPEs,Memory bandwidth, Precedence constraints between SPEmodules
Real-time Resource Scheduler ensures that totalReal-time Resource Scheduler ensures that totalbandwidth never exceeds max capacity of XDR DRAMbandwidth never exceeds max capacity of XDR DRAM
Keeping other constraints such as precedence constraints
Aug. 15, 2005 9Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
ex. Trans-coder from MPEG-2 to H.264ex. Trans-coder from MPEG-2 to H.264
2 trans-coders are running on a Cell Processor2 trans-coders are running on a Cell Processor
Focus on H.264 encoder in following sheets
Trans-coder (MPEG-2 to H.264)
MPEG-2
Decoder
Demux
AAC
Decoder
AC3
Encoder
HDD
Writer
HDD
Reader
H.264
Encoder
Mux
H.264 Encoder: 3 SPE threads
Aug. 15, 2005 10Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Scheduling ImageScheduling Image
100%
t
Inte
rfere
SPE 0
SPE 1
SPE 2
SPE 3
SPE 4
SPE 5
Mem
ory
Bandw
idth
Thr
BW of
H.264 enc 1
a) without BW reservation b) with BW reservation
Thread 1
Thread 2
Thread 3
H.264 enc 1
Thread 1
Thread 2
Thread 3
H.264 enc 2
Thr
ThrThr Thr
BW ofH.264 enc 2
BW of
H.264 enc 1
BW ofH.264 enc 2
Thread 3
Thread 1
Thread 2
H.264 enc 1
Thread 1
Thread 2
Thread 3
H.264 enc 2
enough
room
2 encodersstart at the same
time
BW shortagedeclines performance
Performance isstabilized with reserved
BW
ThrThr
Thr ThrThr
ThrThr
ThrThr Thr
ThrThr
ThrThr Thr
H.264 enc 2runs after H.264 enc 1
All threads arescheduled properly
Aug. 15, 2005 11Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Evaluation on Cell ProcessorEvaluation on Cell Processor
Performance monitoring features arePerformance monitoring features areimplemented in Cell Processorimplemented in Cell Processor
Hardware counters and event flags can be monitoredand recorded sequentially
Performance monitoring tool is already availablePerformance monitoring tool is already availableThe tool is very flexible with configuration file
SPE behaviors and utilization of memorySPE behaviors and utilization of memorybandwidth had been monitoredbandwidth had been monitored
Utilization of memory bandwidth of H.264 encodershad been monitored
SPE behaviors are presented later
Aug. 15, 2005 12Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Utilization of Memory BandwidthUtilization of Memory Bandwidth
H.264 enc 2H.264 enc 1H.264 enc 2H.264 enc 1
H.264 enc 2
H.264 enc 1
H.264 enc 2
H.264 enc 1
with BW reservation
without BW
reservation
Slower performancecauses wider access
Aug. 15, 2005 13Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Performance of H.264 EncodersPerformance of H.264 Encoders
With bandwidth reservation,performance scales in a linear fashion
Without bandwidthreservation, performance
declines by 50%
Aug. 15, 2005 14Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
SP
E P
rogra
mm
ing
SPE 0
SPE 1
SPE 2
Multi-layered Programming ModelMulti-layered Programming Model
SPE Thread Layer
PPE/SPE Module Layer
SPE Overlay Layer
H.264 Encoder
SPE Thread
SPE Thread
SPE Thread
Function A Function B Function C Function D
SPE Thread
FunctionLocal
Storage
XDR
DRAM
MPEG-2
Decoder
Demux
AAC
Decoder
AC3
Encoder
HDD
Writer
HDD
ReaderH.264
Encoder
Mux
Aug. 15, 2005 15Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
OutlineOutline
Cell ProcessorCell Processor
SPE SchedulingSPE Scheduling
Programming Model
Real-time Resource Scheduler
Memory Bandwidth Reservation
SPE ProgrammingSPE Programming
SPE Programming Steps
Parallel Execution Model for Sub Modules
Evaluation of the H.264 Encoder
ConclusionsConclusions
Aug. 15, 2005 16Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
SPE Programming StepsSPE Programming Steps
Partition program code into sub modulesPartition program code into sub modulesNot only for parallelization but also for overlay
Implement sub modules by using the extended CImplement sub modules by using the extended CNo assembler programming needed
Coding in high level language is important forperformance tuning step
Since sophisticated algorithm easily applied, higherperformance than assembler can be achieved
Assign sub modules to the SPE threadsAssign sub modules to the SPE threadsAlso define program overlay order
Configure parallel execution scheme
Performance tuningPerformance tuningEffectively repeats the cycle of evaluation, refinement anddebugging for performance tuning
Aug. 15, 2005 17Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Tips for Implementing Sub ModulesTips for Implementing Sub Modules
DMA latency hidingDMA latency hidingReorder the program segment execution
Double buffer or execution queue techniquesare useful
Effective use of 128 registersEffective use of 128 registersUse local (auto) variables and inline functions
Loop unrolling in compiler optimization realizethat local defined array can be allocated toregister
Reduce the penalty of branch missReduce the penalty of branch missEliminate conditional branch by using logicaloperation or select instruction
Aug. 15, 2005 18Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Parallel Execution Model for Sub ModuleParallel Execution Model for Sub Module
Program
SPE 0
SPE 1
SPE 2
Data
Data
SPE 0
SPE 1
SPE 2
&Data
Data
Data ParallelPipelining
Aug. 15, 2005 19Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Structure of H.264 EncoderStructure of H.264 Encoder
CABAC
ME
IntraPred
T/Qー
Filter
IQ/IT
+
MC
GenRef
Input Video Data
Reference Frame
Output Encoded Bits
Motion Estimation
Motion Compensation
Transform/Quantization
Intra Prediction
Context-based
Adaptive Binary
Arithmetic Coding
Inverse Transform/
Inverse Quantization
Generate Reference
Deblocking Filter
Very fine grain
data dependency,
100 – 1000 cycles
…
Intra Prediction
predicts pixel data
based on surroundings
Aug. 15, 2005 20Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Partition the H.264 Encoder CodePartition the H.264 Encoder Code
CABAC
ME
IntraPredT/Q
ー
Filter
IQ/IT+
MC
GenRef
ME1 ME2 ME3
Input Video Data
Reference Frame
Coding module
was composed
from multiple
encoder functions
ME module was
divided into 3
sub modules to
give priority to
the execution
speed
Output Encoded Bits
Motion Estimation
is the hotspot of
the H.264 encoder
…
Aug. 15, 2005 21Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Assign Sub Modules to the SPE ThreadsAssign Sub Modules to the SPE Threads
CABAC
ME
Filter
MC
GenRef
ME1 ME2 ME3
Coding
Input Video Data
Reference Frame
Output Encoded Bits
Thread 1
Thread 2
Thread 3
Aug. 15, 2005 22Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Overlay by Frame PartitioningOverlay by Frame Partitioning
GenRef
Filter
spu_overlayspu_overlay spu_overlay
CABACCoding
MCThread 1 Thread 2 Thread 3
ME3ME2
ME1
Video frames are partitioned into processing unitsfor pipelining and program code overlay.
H.264 bit stream
CABAC generated encoded bitstream
Thread 3 waits until the 2nd
partition is available, because
Thread 3 refers to the top of the
next partition
Aug. 15, 2005 23Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
3 Thread Execution Image3 Thread Execution Image
Thread 3
Thread 1
Thread 2
Time
12345
1 Frame
divided into 5
partitions
11 11 22 22 33 33 44 44 55 5555 55
33 3333 44 4444 55 555522 2222
44 44 44 55 55 5533 33 33
Processing
elements for
1 Frame
11 11 22 22 33 33 44 44 55 55 11 11
11 1111 22 2222 33 3333 44 4444 55 5555 11 1111 22 2222 33 3333
11 11 11 22 22 22 33 33 33 44 44 44 55 55 55 11 11 11 22 22 22 33 33 44 44 4433
Reference
frame
dependency
Aug. 15, 2005 24Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Evaluation of the H.264 EncoderEvaluation of the H.264 Encoder
SPE Core Cycle UtilizationSPE Core Cycle Utilization
ME3ME2
ME1Thread 1
spu_ov erla
y
100
80
60
40
20
05 10 15 20 25 30 35
time [ms]
SP
E C
ore
Cycle
Ratio [
%]
branch
miss penalty
DMA
channel
stallsingle commit
dual commit
1 Frame
1 Partition
ME1 ME3
ME2
ME1 ME3
ME2 ME2 ME2 ME2
ME1 ME1 ME1ME3 ME3 ME3
Aug. 15, 2005 25Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
Evaluation of the H.264 EncoderEvaluation of the H.264 Encoder
Memory Bandwidth by SPE ReadMemory Bandwidth by SPE Read
5 10 15 20 25 30 35
time [ms]
ME3ME2
ME1Thread 1
spu_ov erla
y
10
8
6
4
2
0
Mem
ory
Bandw
idth
[G
B/s
]
spu_ov erla
y
CABACCoding
MC
Thread 2
DMA read
from Thread 1
DMA read
from Thread 2
1 Frame
ME3 ME3 ME3 ME3 ME3
1 Partition
Aug. 15, 2005 26Copyright © 2005 TOSHIBA CORPORATION, All Rights Reserved.
ConclusionsConclusions
Real-time Resource Scheduler with MemoryReal-time Resource Scheduler with MemoryBandwidth Reservation can level memory accessBandwidth Reservation can level memory accessand stabilize the performance of SPE modulesand stabilize the performance of SPE modules
H.264 Encoder was successfully implemented onH.264 Encoder was successfully implemented onSPE thread and overlay programming environment,SPE thread and overlay programming environment,and evaluatedand evaluated
Effectiveness of SPE scheduling and SPEEffectiveness of SPE scheduling and SPEprogramming is evaluated on programming is evaluated on ““realreal”” Cell Processor Cell Processorwith performance monitoring toolwith performance monitoring tool
Using this infrastructure, next generationUsing this infrastructure, next generation’’s multi-s multi-stream media applications will be accomplished onstream media applications will be accomplished onthe Cell Processor.the Cell Processor.
Have fun programming on the Cell!Have fun programming on the Cell!