Transcript
  • A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD

    CS 252Spring 2000Jeff HermanJohn LooXiaoyi Tang

  • MotivationSVD ApplicationsSmart antennasImage processingMedical imagingVLIW Trend in high performance embedded computingVectorOut of favorFlynn bottleneck is a limiting factor in parallelismKnown for linear algebra performance

  • C67 Architecture (mapped)Instruction Ram (cache optional)Data Ram (>4 banks)Decode Logic (8-way)A Register FileB Register FileL1S1M1D1D2M2S2L2

  • C67 ArchitectureSplit Register Files16 registers per register fileOne cross path per register fileInstruction LatenciesBranches - 6 cyclesLoad - 5 cyclesFP add/multiply - 4 cycles

  • TM 1100 VLIW Processor Core Architecture5-issue VLIW2 FP adders/multipliers2 Load/Store Units128 general purpose 32 bit registers16KB data cache, 32KB instruction cacheInstruction Latencies3 cycles for Branches, Load, FP add/multiply

  • VIRAM-1 Microarchitecture2-way-issue superscalar MIPS IV coreAsynchronous vector unitCommunication to scalar core through queue32 general purpose vector and flag registers32 scalar and control register2 VAFU, 2 FFU, 1 VMFU4-lane standard configuration

  • VIRAM-1 Microarchitecture

  • Testing ConditionsSVD routine from CLAPACKRandom test matrices with a rank of 10Matrix dimension ratio of 10Sizes range from 100x10 to 300x30Suboptimal parameters usedTrends should still holdAssumed 200 Mhz clock rate

  • Chart6

    56039490883714282201118000709470693096

    142316344031803606983281000015516881492161

    296640097061728069312570500027881312650600

    54168431826278313689721999000045091614235554

    889863532285485236596751662800071078186614237

    TI 'C67 Ideal

    TI 'C6711 Cache

    TM1100 Cache

    TM1100 Ideal

    IRAM (4-lane)

    IRAM (16-lane)

    Columns

    Cycles in millions

    Columns vs. Cycles

    Sheet1

    TI 'C67 IdealTI 'C6711 CacheTM1100 CacheTM1100 IdealIRAM (4-lane)IRAM (16-lane)

    1056039490883714282201118000709470693096

    15142316344031803606983281000015516881492161

    20296640097061728069312570500027881312650600

    2554168431826278313689721999000045091614235554

    30889863532285485236596751662800071078186614237

    iram, 250x20 was the 4 lane version

    Sheet1

    000000

    000000

    000000

    000000

    000000

    TI 'C67 Ideal

    TI 'C6711 Cache

    TM1100 Cache

    TM1100 Ideal

    IRAM (4-lane)

    IRAM (16-lane)

    Columns

    Cycles in millions

    Columns vs. Cycles

    Sheet2

    Sheet3

  • Ideal C67 and TM 1100 Performance Gap Same memory bottlenecks in both processorsProgramming modelC67Assembly coded kernels1700 linesTM 1100Only C level optimizations

  • Chart3

    2.65457388773.31010464673.78464483354.1898073464.3996927648

    3.26240265464.34404161925.30265598346.09763660386.5222303862

    3.44775818574.82432550876.15246557647.35503012648.0564702135

    3.50742902214.96163240096.3807592397.69248602298.4709696125

    3.52920951795.01678304156.47169697437.83014807518.6576764637

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    Gain vs. standard MIPS Core

    VIRAM-1 Vector Core Scalability

    Sheet1

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924gain

    100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30

    19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648

    2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862

    4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135

    8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125

    166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637

    scalar5467011091058186702128907224342629

    vector18821725775978129430752444869941879336

    efficiency

    100X10150X15200X20250X25300X30

    10.51064995840.63850787570.71390469570.77216412140.8044166323

    20.31378779610.41897539240.50012500150.56188457150.5962451111

    40.1658079270.23264897650.29013779750.33887538890.3682506361

    80.08433879410.1196352420.15045183180.17721165980.1935984287

    160.04243126130.06048251910.07629802570.0901914890.0989327454

    Sheet1

    00000

    00000

    00000

    00000

    00000

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    sustained/peak bandwidth

    Utilization of vector core

    Sheet2

    00000

    00000

    00000

    00000

    00000

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    Gain vs. standard MIPS Core

    VIRAM-1 Vector Core Scalability

    Sheet3

  • Chart5

    0.51064995840.63850787570.71390469570.77216412140.8044166323

    0.31378779610.41897539240.50012500150.56188457150.5962451111

    0.1658079270.23264897650.29013779750.33887538890.3682506361

    0.08433879410.1196352420.15045183180.17721165980.1935984287

    0.04243126130.06048251910.07629802570.0901914890.0989327454

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    sustained/peak bandwidth

    Utilization of Vector Core

    Sheet1

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924gain

    100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30

    19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648

    2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862

    4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135

    8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125

    166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637

    scalar5467011091058186702128907224342629

    vector18821725775978129430752444869941879336

    efficiency

    100X10150X15200X20250X25300X30

    10.51064995840.63850787570.71390469570.77216412140.8044166323

    20.31378779610.41897539240.50012500150.56188457150.5962451111

    40.1658079270.23264897650.29013779750.33887538890.3682506361

    80.08433879410.1196352420.15045183180.17721165980.1935984287

    160.04243126130.06048251910.07629802570.0901914890.0989327454

    Sheet1

    00000

    00000

    00000

    00000

    00000

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    Gain vs. standard MIPS Core

    VIRAM-1 Vector Core Scalability

    Sheet2

    00000

    00000

    00000

    00000

    00000

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    sustained/peak bandwidth

    Utilization of Vector Core

    Sheet3

  • VIRAM Performance SummaryGains from vector unit limited by Amdahls law.Vector instructions comprise only ~15% of total code.Not much else of SVD can be vectorized.Gains limited by what cannot be vectorized.Perhaps streamline LAPACK or handcode assembly?Sub-linear scalability.Scaling IRAM is cheap but gains diminish.Efficiency and scalability increase with size of data set.

  • Concluding RemarksLimitations of both architecture are differentVIRAM: Scalar coreVLIW: Memory bandwidthVLIW cannot match performance of VIRAM when computing SVD.VLIW with vector coprocessor?


Recommended