L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Preview:

Citation preview

L33:Low Power Reconfigurable system

Jun-Dong ChoSungKyunKwan Univ.

Dept. of ECE, Vada Lab. http://vada.skku.ac.kr

Answer IV:Reconfigurable Processor

• Configurable datapaths (e. g., splittable ALUs,complex operations)

• Configurable interconnect (e. g., nearest neighbor,k buses)

• SIMD processor, many functional units,preferably VLIW, possibly superscalar

ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS

• Arthur Abnous and Jan Rabaey

• Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithms

• Domain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms

Flexibility vs. Energy-Efficiency

• Trade-off between efficiency and

flexibility, programmable designs incur

significant performance and power

penalties compared to ASIC.

• The parallel algorithm of signal processing can be achieved

significant power savings by executing the dominant computational

kernels of a given class of applications with common features on

dedicated, optimized processing elements with minimum energy

overhead.

Application Domains

CELP- Based Speech Coding• LPC Analysis and Synthesis• Codebook Search• Lag ComputationDCT- Based Video Compression and Decompression• DCT and Inverse- DCT• Motion Estimation and Compensation• Huffman Coding and Decoding Baseband Processing for Digital Radios• Demodulation, Channel Equalization• Timing Recovery, Error Correction

The Re-configurable Terminal

Low- Power Multimedia Processing

• Hybrid, Re-configurable Architecture– application- specific, parallellism, pipelining,– locality, minimum control- overhead, zero- power when idle

• Task Scheduling, and Miscellaneous Functions on Embedded Core Processor (low speed, minimum functionality)

• Standardized Communication Protocols reduce Design Cycle and enable High Level Support

• Use extensive set of low- power circuit techniques– Reduced swing, variable voltages and frequency, self- timin

g, locally generated clocks

Arithmetic Energy Profile :VSELP Speech Coder

Lag Computation+Basic Vector Filtering+Codebook Search=76% of total time

Hybrid Architecture Template

The dominant, energy-intensive computationalkernels of a given domain of algorithms are implemented as a set of independent,concurrent threads of computation on the satellite processors.

The Popoased Architectue,Arthur Abnous and Jan Rabaey, UC-Berkeley

Energy- Efficiency + Domain- Specific Programmability

Control Processor

• The main task of control processor is to configure the satellite processors and the communication networks and to manage the overall control flow of a given signal processing algorithm

• Uses the available satellite processor and the re-configurable interconnect to compose the data flow graph corresponding to a given kernel of computation in hardware

Overlay operation

• Control processor configures network and co- processors

• Co- processors operate in distributed “data- driven” mode

• At completion, control returns to the core processor for next reconfiguration

Satellite Processors

Elements of Energy- Efficiency

Multi-Processor Implementation

Communication Network

Distributed Data- Driven Control

Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.

Implementation of Handshaking

Single-Wire, Two-Phase Asynchronous Handshaking

Protocol

Low Power Circuit Techniques

• Reduced swing interconnect (communication network, memories, programmable logic modules)

• On chip dc- dc conversion + multiple supply voltages• Locally synchronous - globally asynchronous• Automatic power- down• Optimized libraries (0.6 m CMOS + Cadence/ Syno

psys design flow)

Power- Variable Performance

Low Power Circuit Techniques

• Reduced swing interconnect (communication network, memories, programmable logic modules)

• On chip dc- dc conversion + multiple supply voltages• Locally synchronous - globally asynchronous• Automatic power- down• Optimized libraries (0.6 m CMOS + Cadence/ Syno

psys design flow)

Design Methodology

Switching Activity Reduction(a) Average activity in a multiplier as a function of the constant value

(b) A parallel and serial implementations of an adder tree.

VSELP Synthesis Filter Mapped onto Satellite Processors

Mappings of VSELP Kernel

The most energy efficient CELP-based speech algorithm - dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS) - requires 23.4 MOPS

Proposed VSELP speech coder - 0.6 um CMOS - dissipates under 5 mW

Case Studies

• Voice coder for cellular

• Video decoder

• Baseband radio modem

• Security - encryption processor

Architecture for vector dot product

ConfigurationBus

StrobeAddress

Data

8

16

M em ory M em ory

Network (6 Buses)

AddG en AddG en

M AC

IPor

t

IPor

t

OPo

rt

Network ResetSatellite Reset

S low M ode

IP1 IP2 O P18 18

18AutoAck

M ode

• 0.6 ㎛ CMOS process

• Supply Voltage : 1.5

• Power estimation tool

– PowerMill

• 1 MAC, 2 SRAM, 2 Address

generator, 2 external input p

ort, 1 external output port

• All data and address values a

re 16-bits.

Result

• The most energy efficient CELP-based speech algorithm

- dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS)

- requires 23.4 MOPS

• Proposed VSELP speech coder

- 0.6 um CMOS

- dissipates under 5 mW

IIR Mapping

IIR Comparison

FFT Mapping

FFT Comparison

ResultStrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/tap)

Energy/tap(J)

Processor

169

0.5

17

37.4n

20

1

40 6 14

1

1

1.3n

1

600p

5 1

2.2n 205p

0.2 1

StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/IIR)

Energy/IIR(J)

Processor

169

0.5

114

277n

20

1

40 2.1 14

1

20

19.1n

13

9.5n

9 2

103n 1.9n

1 8

StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/stage)

Energy/stage(J)

Processor

169

0.5

766

1870n

20

1

40 - 14

1

152

131n

76

49.3n

- 4

- 13.3n

- 8

FIRResults

IIRResults

FFTResults

StroangARM: micro-processor[2]

TMS320C2xx: DSP chip

[3,4,5,6]

TMS320LC54x: DSP chip

[7,8,12]

XC4003A: FPGA chip[9,10]

Conclusions• The StrongARM has the worst performance of all because it takes many instru

ctions and cycles to execute a kernel in a highly sequential manners.– The lack of a single-cycle multiplier exacerbates this problem.

– The other architecture have more internal parallelism that allow them to have superior performance.

• Pleiades (architecture for vector dot product) does much better on the energy scale than the TI DSPs.

– Because DSPs are general-purpose, and instruction execution involves a great deal of overhead.

– Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead.

• Pleiades outperforms the other processors by a large margin owing to its ability to exploit higher levels of parallelism by creating a dedicated parallel structure from its computational resources and flexible interconnect.

Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Par

k,UMASS

Motion Estimation

Block Matching Algorithm

Configurable H/W Paradigms

Programmable Logic Modules

Why Hardware for Motion Estimation?

• Most Computationally demanding part of Video Encoding

• Example: CCIR 601 format

• 720 by 576 pixel

• 16 by 16 macro block (n = 16)

• 32 by 32 search area (p = 8)

• 25 Hz Frame rate (f frame = 25)

• 9 Giga Operations/Sec is needed for Full Search Block Matching Algorithm.

Why Reconguration in Motion Estimation?

• Adjusting the search area at frame-rate according to the changing characteristics of video sequences

• Reducing Power Consumption by avoiding unnecessary computation

Motion Vector Distributions

Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995

Re-configurable Architecture for ME

Power Estimation in Recongurable Architecture

Power vs Search area

Resource Reuse in FPGAs

Conclusion

• By adjusting the search area according to the changing characteristics of a picture, power can be saved. Further power saving can be achieved by utilizing freed up resources for local memory

• Extension of Adaptive Search Space Method to Software Implementation– Varying p still reduces computation and hence power– Resource reuse may also be applicable in S/W

implementation by freeing up cache space and compute power for more power efficient use of memory

Future Works

• Reconguration to support more sophisticated motion estimation algorithms ( intelligent search, object-based, ...)

• More detailed performance studies over a wider range of video sequences

• Generalization of this concept to other algorithms and architectures (not just video)

• Modification to FPGA architectures to support the use of logic and configuration cells as local memory

Motion Estimation - Conventional

Motion Estimation - Data Reuse

P P P

P P P P

P P

a add abs

b add add abs

abs add

2 2

2

0 45

2

2 1

2

/

/

.

Therefore, power reduction

factor is 11%

Kernel Scheduling in Reconfigurable Computing

• R. Maestre, F. J. Kurdahi, N. Bagherzadeh, H. Singh, R. Hermida, M. Fernandez, Design and Test in Europe, DATE99, Munich, Germany, Mar 99

The PartitionPartition is to fine some subsets of kernel that may be scheduled

(executed) independently of other kernels.

Partitioning of the application DFG

The SchedulingScheduling is performed within a given partition in detail after

partitioning .

Scheduling within a given partition

The Major Criteria

M E M C DCT Q IQ IDCT IM C

6 blocks blocks blocks blocks blocksblocks blocks

Fram e

8 4 21 6 6 421# of contexts :

M PEGsequence

G ranularityof com putation

¨Í

M E M C DCT Q IQ IDCT IM C

396(Fram e)

6 6 6 6 66¨Î

M E M C DCT Q IQ IDCT IM C

6 ¡¿396

6 66

¨Ï 396 396

a) M PEG sequence and granularity, b) a possib le schedule of an im age fram e, c) an a lternative schedule

• Context reloading

– Minimizing

• Data reuse

– Maximize

• Computation and

data movement

overlapping

– Maximize

Scheduling

C M

F B se t 1

F B se t 2

R 1i-1,R 2

i-1

K 1i K 2

i

C 3i

kc 2kc 1= 0

R 3i-1,D 1

i+1,D 2i+1,D 3

i+1

C 1i+1,C 2

i+1

kc 3

K 3i

T im e

¥ái = even t in ¥á ite ra tion i.

k i = C om pu ta tion tim e .

kc i = P ossib le ove rlap o f com pu ta tion and con text load ing

C i = C on text load ing tim e .

D i = D a ta load ing tim e .

R i = R esu lt read ing tim e .

Ide l tim e

P artition = { k 1, k2, k3 }. A poss ib le schedu le :

< Execution m odel representation >

Algorithm

K i

K j

Km

K p

1 2

3 4

B C = TR U E

a. LE E = ¥õ

K i

K j

Km

K p

2

3 4

B C = TR U E

b. LE E = { 1 }

K i

K j

Km

K p

3 4

B C = TR U E

c. LE E = { 1 , 2 }

K i

K j

Km

K p

2

4

B C =TR U E

d. LE E = { 1 , 3 }

K i

K j

Km

K p

2

B C =TR U E

b. LE E = { 1 , 3 , 4 }

K i

K j

Km

K p

B C =TR U E

c. LE E = { 1 , 4 }

2

3

B C =TR U E

< Som e steps of an exploration sequence >

References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of

the IEEE VLSI Signal Processing Workshop, San Francisco, Oct 1996.

[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.

[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.

[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.

[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.

[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.

[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.

[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.

[9] E. Kusse, Personal communication, 1996.

References

[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2, pp. 40-51, 1991.

[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31, N0. 11, pp. 1703-1714, Nov. 1996.

[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report SPRA281, TI, 1997.

[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer Academic publishers, 1992.

[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.

[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.

[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.

References

[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller Synthesis for Dynamically Recongurable Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.

[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.

[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.

[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.

[21] Xilinx XABEL reference manual.

Recommended