If you can't read please download the document
Upload
deion
View
27
Download
0
Embed Size (px)
DESCRIPTION
A. Sazegari AltiVec Technical Lead. Introduction. AltiVec™ is an extension to the PowerPC Instruction Set Architecture Designed to extend Apple’s leadership position in multimedia processing. AltiVec is a trademark of Motorola, Inc. What You’ll Learn. About the AltiVec Architecture - PowerPoint PPT Presentation
Citation preview
A. SazegariAltiVec Technical Lead
IntroductionAltiVec is an extension to the PowerPC Instruction Set Architecture Designed to extend Apples leadership position in multimedia processingAltiVec is a trademark of Motorola, Inc.
What Youll LearnAbout the AltiVec Architecture Its performance potential AltiVec programming
AltiVec TechnologyVector/SIMD technologyFixed-length vector operands (packed data)Single Instruction Multiple DataRISC-style instruction setOptimized for digital signal processingElevates multimedia to rst-class data typeUseful wherever data-parallelism exists
AltiVec ArchitectureNew Vector Register File:32 new 128-bit wide registersNew data-types:Packed byte, halfword, and word integersPacked IEEE single-precision oatsSaturation Arithmetic capability160 new PowerPC instructions
PowerPC ArchitectureInstruction StreamFPUIUFPRFGRFMemoryBranch Unit3264
AltiVec ArchitectureInstruction StreamFPUIUFPRFGRFMemory3264128Vector UnitVector Register FileBranch Unit
Programming ModelVector RegisterFileGeneral Reg.FileFPR0FPR31VR0Branch Registers64-bits32-bits128-bitsFloating-Point RegisterFile32-registers Separate Vector Register File More space for coefcients, variables, etc. More names for scheduling Wider for more parallelism No interference with FP or integerCondCountLinkTimeTimeVRSaveVSCRGPR0GPR31XERFPSCRVR31
Vector Data Types16 signed or unsigned integer bytes8 signed or unsigned integer halfwords4 signed or unsigned integer wordsor4 IEEE single-precision floating-point numbersOne Vector (128 bits)
Simple SIMD Example++++++++VRAVRBVRT8 halfword additions in one instructionSaturation arithmetic (clamp to max or min on overow)T = vec_adds (A, B); // vector signed short T, A, Bvaddshs T, A, B
Vector Dot ProductVRA1VRB1VRT1/A2XXXXVRC1vec_msum( )vec_sums( )VRB2VRT2XXXXXXXXXXXX
Arithmetic OperationsAdd, Subtract, AverageMultiply, Multiply-add, Multiply-sumLogicals (and, andc, or, nor, xor)Rotates and shiftsComparesConvert oat xed (scaled) and via Newton-Raphson renement of reciprocal estimate
Vector Permute0123456789ABCDEF101112131415161718191A1B1C1D1E1F1718DEF1E10121110A14141414VRAVRBVRCVRT Arbitrary bytewise data reorganization Small table-lookupT = vec_perm (A, B, C);
Compare and Select00000000000000000000000000000000C10000001A1AC11A00C100001A001AC100FFFFFF00000000FF00FFFF00FF00009A9A9A9A9A9A9A9A9A9A9A9A9A9A9A9AC19A9A9A1A1AC11A9AC19A9A1A9A1AC1C10000001A1AC11A00C100001A001AC1VRA1VRB1VRT1/C2VRA1/A2VRB2VRT2vec_cmpeq( )vec_sel( )================
Other AltiVec InstructionsLoad and Store (vector or scalar element)Pack, Unpack, and Merge elementsSplat (element or literal replication)Bitwise vector shiftsDouble-vector bytewise shifts
Data Stream PrefetchSoftware directed prefetch into cache4 simultaneous streamsIndependent and asynchronousCan be non-contiguous123NMemoryBlock Size = 0-32 VectorsStride = 32KBytes0-256 Blocks
Typical ImplementationALL instructions fully-pipelined with single-cycle throughputSimple ops: 1 cycle latencyCompound ops: 34 cycle latencyDual AltiVec instruction issueOne arithmetic, one permuteNo restriction on issue with scalar instructions
AltiVec vs. MMXBoth SIMD, but AltiVec:Does everything MMX does, plusTwice the SIMD parallelism4x the register namespace8x the register storage spaceNo mode switch or use overheadPermuteRicher set of DSP instructions
AltiVec PerformancePeak PerformanceMultimedia kernelsDSP benchmarksPerformance based on cycle-accurate simulator with real memory effects includedPerformance stated relative to optimized PowerPC scalar code
Peak PerformanceVector operations at 400MHz:Integer12.8 billion arithmetic ops/sec+ 6.4 billion byte crossbar ops/secFloating-point3.2 gigaops+ 1.6 billion FP crossbar ops/sec
Multimedia KernelsVideo and Audio11.4xDiscrete Cosine Transform (DCT)16.1x*Motion estimation (* by |A-B|)12.5xQuantization 9.6xRGB -> YCbCr (CCIR601) 3.6xInverse FFT (FP) 4.9xWindowing (FP)
Multimedia KernelsImage Processing 6.2xBilinear interpolation1.1cy/pxSeparable convolution2.2cy/pxRGB to YUV1.3cy/pxMedian Filter (3x3)
Multimedia KernelsGraphics 6.2xVector-matrix multiply (FP)17.5xBuffer accumulation 6.6xLine clipping 6.3xBezier curves
Communication KernelsModems and Telephony 2.5xCRC-3210.5x64-QAM Demodulator 7.6xLinear prediction 9.3xReal 13-tap FIR30.7xAutocorrelation12.5xGSM Module 4.2.11
Miscellaneous DSP KernelsMiscellaneous 2.5 to 20xParallel table lookup10.0xSorting 5.8xAssociative search16.0xGalois eld multiply 4.0xGamma Correction12.0cy/blockHaar Transform (wavelet)
DSP BenchmarksResults from an independent DSP benchmarking rm indicate AltiVec on integer DSP algorithms (FIR, FFT, etc.) is:Twice as fast as the worlds fastest DSP (TMS320C6201) per clock, and four times faster including frequency2 to 5 times faster than Pentium II per clock (but P would still be 35% smaller)
AltiVec ToolsProgramming Model and ABICompilers and assemblersMotorolas MCC CodeWarrior plug-inApples MrC and PPCASM in MPW and MWMetrowerks C/C++Emulator/Trace generatorMacsBugCycle-accurate simulatorPerformance proler
Programming in C11 new fundamental packed data typesAltiVec operatorsParse like function callsSpecic operators > assembly instructionsGeneric operators type sensitivesizeof(), a=b, &a, *p, etc.Compiler does register allocation, inlining, code scheduling, etc.
C Program Examplezero = ( vector unsigned long ) ( 0 ); //zero = vec_xor ( zero, zero );shiftFactor = vec_splat_u8 ( 11 );z = vec_sro ( x, shiftFactor );z = vec_srl ( z, shiftFactor );
do{carry= vec_addc ( z, y );z= vec_add ( z, y );y= vec_sld ( carry, zero, 4 );} while ( !vec_all_eq ( y, zero ) );
Vector ShiftsThis shiftFactor vector is populated in 2 sections for vector shift right by octet vsro and vector shift right vsr
bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127 || used by || || ||
vsro is based on the permute cross bar and shifts bytes, Instruction vsr is a 0 to 7 bit shift.
Used sequentially,the combination of these instructions will shift a vector register right (or left) from 0 to 127 bits as specified in bits 121:127 of shiftFactor.
bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127|| shiftFactor = ... || 0 | 0 | 0 | 1 || 0 | 1 | 1 ||
AltiVec at AppleMac OS (blockmove, etc.)QuickDrawQTML (codecs, rasterizers)Media source code [email protected]
AltiVec SummaryMajor architectural extension will make future PowerPCs great media processorsEarly programming tools available nowDevelopment systems 2H98 (Now)AltiVec based systems in 1H99