A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD

A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD

CS 252Spring 2000Jeff HermanJohn LooXiaoyi Tang

MotivationSVD ApplicationsSmart antennasImage processingMedical imagingVLIW Trend in high performance embedded computingVectorOut of favorFlynn bottleneck is a limiting factor in parallelismKnown for linear algebra performance

C67 Architecture (mapped)Instruction Ram (cache optional)Data Ram (>4 banks)Decode Logic (8-way)A Register FileB Register FileL1S1M1D1D2M2S2L2

C67 ArchitectureSplit Register Files16 registers per register fileOne cross path per register fileInstruction LatenciesBranches - 6 cyclesLoad - 5 cyclesFP add/multiply - 4 cycles

TM 1100 VLIW Processor Core Architecture5-issue VLIW2 FP adders/multipliers2 Load/Store Units128 general purpose 32 bit registers16KB data cache, 32KB instruction cacheInstruction Latencies3 cycles for Branches, Load, FP add/multiply

VIRAM-1 Microarchitecture2-way-issue superscalar MIPS IV coreAsynchronous vector unitCommunication to scalar core through queue32 general purpose vector and flag registers32 scalar and control register2 VAFU, 2 FFU, 1 VMFU4-lane standard configuration

VIRAM-1 Microarchitecture

Testing ConditionsSVD routine from CLAPACKRandom test matrices with a rank of 10Matrix dimension ratio of 10Sizes range from 100x10 to 300x30Suboptimal parameters usedTrends should still holdAssumed 200 Mhz clock rate

Chart6

56039490883714282201118000709470693096

142316344031803606983281000015516881492161

296640097061728069312570500027881312650600

54168431826278313689721999000045091614235554

889863532285485236596751662800071078186614237

TI 'C67 Ideal

TI 'C6711 Cache

TM1100 Cache

TM1100 Ideal

IRAM (4-lane)

IRAM (16-lane)

Columns

Cycles in millions

Columns vs. Cycles

Sheet1

TI 'C67 IdealTI 'C6711 CacheTM1100 CacheTM1100 IdealIRAM (4-lane)IRAM (16-lane)

1056039490883714282201118000709470693096

15142316344031803606983281000015516881492161

20296640097061728069312570500027881312650600

2554168431826278313689721999000045091614235554

30889863532285485236596751662800071078186614237

iram, 250x20 was the 4 lane version

Sheet1

000000

000000

000000

000000

000000

TI 'C67 Ideal

TI 'C6711 Cache

TM1100 Cache

TM1100 Ideal

IRAM (4-lane)

IRAM (16-lane)

Columns

Cycles in millions

Columns vs. Cycles

Sheet2

Sheet3

Ideal C67 and TM 1100 Performance Gap Same memory bottlenecks in both processorsProgramming modelC67Assembly coded kernels1700 linesTM 1100Only C level optimizations

Chart3

2.65457388773.31010464673.78464483354.1898073464.3996927648

3.26240265464.34404161925.30265598346.09763660386.5222303862

3.44775818574.82432550876.15246557647.35503012648.0564702135

3.50742902214.96163240096.3807592397.69248602298.4709696125

3.52920951795.01678304156.47169697437.83014807518.6576764637

100X10

150X15

200X20

250X25

300X30

Lane Count

Gain vs. standard MIPS Core

VIRAM-1 Vector Core Scalability

Sheet1

baseline24460817485848171538803316501557263924

baseline24460817485848171538803316501557263924

baseline24460817485848171538803316501557263924

baseline24460817485848171538803316501557263924

baseline24460817485848171538803316501557263924gain

100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30

19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648

2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862

4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135

8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125

166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637

scalar5467011091058186702128907224342629

vector18821725775978129430752444869941879336

efficiency

100X10150X15200X20250X25300X30

10.51064995840.63850787570.71390469570.77216412140.8044166323

20.31378779610.41897539240.50012500150.56188457150.5962451111

40.1658079270.23264897650.29013779750.33887538890.3682506361

80.08433879410.1196352420.15045183180.17721165980.1935984287

160.04243126130.06048251910.07629802570.0901914890.0989327454

Sheet1

00000

00000

00000

00000

00000

100X10

150X15

200X20

250X25

300X30

Lane Count

sustained/peak bandwidth

Utilization of vector core

Sheet2

00000

00000

00000

00000

00000

100X10

150X15

200X20

250X25

300X30

Lane Count



Sheet3

Chart5

0.51064995840.63850787570.71390469570.77216412140.8044166323

0.31378779610.41897539240.50012500150.56188457150.5962451111

0.1658079270.23264897650.29013779750.33887538890.3682506361

0.08433879410.1196352420.15045183180.17721165980.1935984287

0.04243126130.06048251910.07629802570.0901914890.0989327454

100X10

150X15

200X20

250X25

300X30

Lane Count


Utilization of Vector Core

Sheet1

baseline24460817485848171538803316501557263924

baseline24460817485848171538803316501557263924

baseline24460817485848171538803316501557263924

baseline24460817485848171538803316501557263924

baseline24460817485848171538803316501557263924gain

100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30

19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648

2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862

4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135

8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125

166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637

scalar5467011091058186702128907224342629

vector18821725775978129430752444869941879336

efficiency

100X10150X15200X20250X25300X30

10.51064995840.63850787570.71390469570.77216412140.8044166323

20.31378779610.41897539240.50012500150.56188457150.5962451111

40.1658079270.23264897650.29013779750.33887538890.3682506361

80.08433879410.1196352420.15045183180.17721165980.1935984287

160.04243126130.06048251910.07629802570.0901914890.0989327454

Sheet1

00000

00000

00000

00000

00000

100X10

150X15

200X20

250X25

300X30

Lane Count



Sheet2

00000

00000

00000

00000

00000

100X10

150X15

200X20

250X25

300X30

Lane Count


Utilization of Vector Core

Sheet3

VIRAM Performance SummaryGains from vector unit limited by Amdahls law.Vector instructions comprise only ~15% of total code.Not much else of SVD can be vectorized.Gains limited by what cannot be vectorized.Perhaps streamline LAPACK or handcode assembly?Sub-linear scalability.Scaling IRAM is cheap but gains diminish.Efficiency and scalability increase with size of data set.

Concluding RemarksLimitations of both architecture are differentVIRAM: Scalar coreVLIW: Memory bandwidthVLIW cannot match performance of VIRAM when computing SVD.VLIW with vector coprocessor?

Documents

A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD