If you can't read please download the document
Upload
gerd
View
42
Download
0
Embed Size (px)
DESCRIPTION
A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD. CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang. Motivation. SVD Applications Smart antennas Image processing Medical imaging VLIW Trend in high performance embedded computing Vector Out of favor - PowerPoint PPT Presentation
Citation preview
A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD
CS 252Spring 2000Jeff HermanJohn LooXiaoyi Tang
MotivationSVD ApplicationsSmart antennasImage processingMedical imagingVLIW Trend in high performance embedded computingVectorOut of favorFlynn bottleneck is a limiting factor in parallelismKnown for linear algebra performance
C67 Architecture (mapped)Instruction Ram (cache optional)Data Ram (>4 banks)Decode Logic (8-way)A Register FileB Register FileL1S1M1D1D2M2S2L2
C67 ArchitectureSplit Register Files16 registers per register fileOne cross path per register fileInstruction LatenciesBranches - 6 cyclesLoad - 5 cyclesFP add/multiply - 4 cycles
TM 1100 VLIW Processor Core Architecture5-issue VLIW2 FP adders/multipliers2 Load/Store Units128 general purpose 32 bit registers16KB data cache, 32KB instruction cacheInstruction Latencies3 cycles for Branches, Load, FP add/multiply
VIRAM-1 Microarchitecture2-way-issue superscalar MIPS IV coreAsynchronous vector unitCommunication to scalar core through queue32 general purpose vector and flag registers32 scalar and control register2 VAFU, 2 FFU, 1 VMFU4-lane standard configuration
VIRAM-1 Microarchitecture
Testing ConditionsSVD routine from CLAPACKRandom test matrices with a rank of 10Matrix dimension ratio of 10Sizes range from 100x10 to 300x30Suboptimal parameters usedTrends should still holdAssumed 200 Mhz clock rate
Chart6
56039490883714282201118000709470693096
142316344031803606983281000015516881492161
296640097061728069312570500027881312650600
54168431826278313689721999000045091614235554
889863532285485236596751662800071078186614237
TI 'C67 Ideal
TI 'C6711 Cache
TM1100 Cache
TM1100 Ideal
IRAM (4-lane)
IRAM (16-lane)
Columns
Cycles in millions
Columns vs. Cycles
Sheet1
TI 'C67 IdealTI 'C6711 CacheTM1100 CacheTM1100 IdealIRAM (4-lane)IRAM (16-lane)
1056039490883714282201118000709470693096
15142316344031803606983281000015516881492161
20296640097061728069312570500027881312650600
2554168431826278313689721999000045091614235554
30889863532285485236596751662800071078186614237
iram, 250x20 was the 4 lane version
Sheet1
000000
000000
000000
000000
000000
TI 'C67 Ideal
TI 'C6711 Cache
TM1100 Cache
TM1100 Ideal
IRAM (4-lane)
IRAM (16-lane)
Columns
Cycles in millions
Columns vs. Cycles
Sheet2
Sheet3
Ideal C67 and TM 1100 Performance Gap Same memory bottlenecks in both processorsProgramming modelC67Assembly coded kernels1700 linesTM 1100Only C level optimizations
Chart3
2.65457388773.31010464673.78464483354.1898073464.3996927648
3.26240265464.34404161925.30265598346.09763660386.5222303862
3.44775818574.82432550876.15246557647.35503012648.0564702135
3.50742902214.96163240096.3807592397.69248602298.4709696125
3.52920951795.01678304156.47169697437.83014807518.6576764637
100X10
150X15
200X20
250X25
300X30
Lane Count
Gain vs. standard MIPS Core
VIRAM-1 Vector Core Scalability
Sheet1
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924gain
100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30
19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648
2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862
4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135
8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125
166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637
scalar5467011091058186702128907224342629
vector18821725775978129430752444869941879336
efficiency
100X10150X15200X20250X25300X30
10.51064995840.63850787570.71390469570.77216412140.8044166323
20.31378779610.41897539240.50012500150.56188457150.5962451111
40.1658079270.23264897650.29013779750.33887538890.3682506361
80.08433879410.1196352420.15045183180.17721165980.1935984287
160.04243126130.06048251910.07629802570.0901914890.0989327454
Sheet1
00000
00000
00000
00000
00000
100X10
150X15
200X20
250X25
300X30
Lane Count
sustained/peak bandwidth
Utilization of vector core
Sheet2
00000
00000
00000
00000
00000
100X10
150X15
200X20
250X25
300X30
Lane Count
Gain vs. standard MIPS Core
VIRAM-1 Vector Core Scalability
Sheet3
Chart5
0.51064995840.63850787570.71390469570.77216412140.8044166323
0.31378779610.41897539240.50012500150.56188457150.5962451111
0.1658079270.23264897650.29013779750.33887538890.3682506361
0.08433879410.1196352420.15045183180.17721165980.1935984287
0.04243126130.06048251910.07629802570.0901914890.0989327454
100X10
150X15
200X20
250X25
300X30
Lane Count
sustained/peak bandwidth
Utilization of Vector Core
Sheet1
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924gain
100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30
19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648
2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862
4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135
8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125
166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637
scalar5467011091058186702128907224342629
vector18821725775978129430752444869941879336
efficiency
100X10150X15200X20250X25300X30
10.51064995840.63850787570.71390469570.77216412140.8044166323
20.31378779610.41897539240.50012500150.56188457150.5962451111
40.1658079270.23264897650.29013779750.33887538890.3682506361
80.08433879410.1196352420.15045183180.17721165980.1935984287
160.04243126130.06048251910.07629802570.0901914890.0989327454
Sheet1
00000
00000
00000
00000
00000
100X10
150X15
200X20
250X25
300X30
Lane Count
Gain vs. standard MIPS Core
VIRAM-1 Vector Core Scalability
Sheet2
00000
00000
00000
00000
00000
100X10
150X15
200X20
250X25
300X30
Lane Count
sustained/peak bandwidth
Utilization of Vector Core
Sheet3
VIRAM Performance SummaryGains from vector unit limited by Amdahls law.Vector instructions comprise only ~15% of total code.Not much else of SVD can be vectorized.Gains limited by what cannot be vectorized.Perhaps streamline LAPACK or handcode assembly?Sub-linear scalability.Scaling IRAM is cheap but gains diminish.Efficiency and scalability increase with size of data set.
Concluding RemarksLimitations of both architecture are differentVIRAM: Scalar coreVLIW: Memory bandwidthVLIW cannot match performance of VIRAM when computing SVD.VLIW with vector coprocessor?