58
L33:Low Power Reconfi gurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. http://vada.skku.ac.kr

L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Embed Size (px)

Citation preview

Page 1: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

L33:Low Power Reconfigurable system

Jun-Dong ChoSungKyunKwan Univ.

Dept. of ECE, Vada Lab. http://vada.skku.ac.kr

Page 2: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Answer IV:Reconfigurable Processor

• Configurable datapaths (e. g., splittable ALUs,complex operations)

• Configurable interconnect (e. g., nearest neighbor,k buses)

• SIMD processor, many functional units,preferably VLIW, possibly superscalar

Page 3: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS

• Arthur Abnous and Jan Rabaey

• Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithms

• Domain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms

Page 4: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Flexibility vs. Energy-Efficiency

• Trade-off between efficiency and

flexibility, programmable designs incur

significant performance and power

penalties compared to ASIC.

• The parallel algorithm of signal processing can be achieved

significant power savings by executing the dominant computational

kernels of a given class of applications with common features on

dedicated, optimized processing elements with minimum energy

overhead.

Page 5: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Application Domains

CELP- Based Speech Coding• LPC Analysis and Synthesis• Codebook Search• Lag ComputationDCT- Based Video Compression and Decompression• DCT and Inverse- DCT• Motion Estimation and Compensation• Huffman Coding and Decoding Baseband Processing for Digital Radios• Demodulation, Channel Equalization• Timing Recovery, Error Correction

Page 6: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

The Re-configurable Terminal

Page 7: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Low- Power Multimedia Processing

• Hybrid, Re-configurable Architecture– application- specific, parallellism, pipelining,– locality, minimum control- overhead, zero- power when idle

• Task Scheduling, and Miscellaneous Functions on Embedded Core Processor (low speed, minimum functionality)

• Standardized Communication Protocols reduce Design Cycle and enable High Level Support

• Use extensive set of low- power circuit techniques– Reduced swing, variable voltages and frequency, self- timin

g, locally generated clocks

Page 8: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Arithmetic Energy Profile :VSELP Speech Coder

Lag Computation+Basic Vector Filtering+Codebook Search=76% of total time

Page 9: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Hybrid Architecture Template

Page 10: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

The dominant, energy-intensive computationalkernels of a given domain of algorithms are implemented as a set of independent,concurrent threads of computation on the satellite processors.

The Popoased Architectue,Arthur Abnous and Jan Rabaey, UC-Berkeley

Energy- Efficiency + Domain- Specific Programmability

Page 11: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Control Processor

• The main task of control processor is to configure the satellite processors and the communication networks and to manage the overall control flow of a given signal processing algorithm

• Uses the available satellite processor and the re-configurable interconnect to compose the data flow graph corresponding to a given kernel of computation in hardware

Page 12: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Overlay operation

• Control processor configures network and co- processors

• Co- processors operate in distributed “data- driven” mode

• At completion, control returns to the core processor for next reconfiguration

Page 13: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Satellite Processors

Page 14: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Elements of Energy- Efficiency

Page 15: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Multi-Processor Implementation

Page 16: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Communication Network

Page 17: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Distributed Data- Driven Control

Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.

Page 18: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Implementation of Handshaking

Page 19: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Single-Wire, Two-Phase Asynchronous Handshaking

Protocol

Page 20: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Low Power Circuit Techniques

• Reduced swing interconnect (communication network, memories, programmable logic modules)

• On chip dc- dc conversion + multiple supply voltages• Locally synchronous - globally asynchronous• Automatic power- down• Optimized libraries (0.6 m CMOS + Cadence/ Syno

psys design flow)

Page 21: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Power- Variable Performance

Page 22: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Low Power Circuit Techniques

• Reduced swing interconnect (communication network, memories, programmable logic modules)

• On chip dc- dc conversion + multiple supply voltages• Locally synchronous - globally asynchronous• Automatic power- down• Optimized libraries (0.6 m CMOS + Cadence/ Syno

psys design flow)

Page 23: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Design Methodology

Page 24: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Switching Activity Reduction(a) Average activity in a multiplier as a function of the constant value

(b) A parallel and serial implementations of an adder tree.

Page 25: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

VSELP Synthesis Filter Mapped onto Satellite Processors

Page 26: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Mappings of VSELP Kernel

The most energy efficient CELP-based speech algorithm - dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS) - requires 23.4 MOPS

Proposed VSELP speech coder - 0.6 um CMOS - dissipates under 5 mW

Page 27: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Case Studies

• Voice coder for cellular

• Video decoder

• Baseband radio modem

• Security - encryption processor

Page 28: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Architecture for vector dot product

ConfigurationBus

StrobeAddress

Data

8

16

M em ory M em ory

Network (6 Buses)

AddG en AddG en

M AC

IPor

t

IPor

t

OPo

rt

Network ResetSatellite Reset

S low M ode

IP1 IP2 O P18 18

18AutoAck

M ode

• 0.6 ㎛ CMOS process

• Supply Voltage : 1.5

• Power estimation tool

– PowerMill

• 1 MAC, 2 SRAM, 2 Address

generator, 2 external input p

ort, 1 external output port

• All data and address values a

re 16-bits.

Page 29: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Result

• The most energy efficient CELP-based speech algorithm

- dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS)

- requires 23.4 MOPS

• Proposed VSELP speech coder

- 0.6 um CMOS

- dissipates under 5 mW

Page 30: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

IIR Mapping

Page 31: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

IIR Comparison

Page 32: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

FFT Mapping

Page 33: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

FFT Comparison

Page 34: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

ResultStrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/tap)

Energy/tap(J)

Processor

169

0.5

17

37.4n

20

1

40 6 14

1

1

1.3n

1

600p

5 1

2.2n 205p

0.2 1

StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/IIR)

Energy/IIR(J)

Processor

169

0.5

114

277n

20

1

40 2.1 14

1

20

19.1n

13

9.5n

9 2

103n 1.9n

1 8

StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/stage)

Energy/stage(J)

Processor

169

0.5

766

1870n

20

1

40 - 14

1

152

131n

76

49.3n

- 4

- 13.3n

- 8

FIRResults

IIRResults

FFTResults

StroangARM: micro-processor[2]

TMS320C2xx: DSP chip

[3,4,5,6]

TMS320LC54x: DSP chip

[7,8,12]

XC4003A: FPGA chip[9,10]

Page 35: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Conclusions• The StrongARM has the worst performance of all because it takes many instru

ctions and cycles to execute a kernel in a highly sequential manners.– The lack of a single-cycle multiplier exacerbates this problem.

– The other architecture have more internal parallelism that allow them to have superior performance.

• Pleiades (architecture for vector dot product) does much better on the energy scale than the TI DSPs.

– Because DSPs are general-purpose, and instruction execution involves a great deal of overhead.

– Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead.

• Pleiades outperforms the other processors by a large margin owing to its ability to exploit higher levels of parallelism by creating a dedicated parallel structure from its computational resources and flexible interconnect.

Page 36: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Par

k,UMASS

Page 37: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Motion Estimation

Page 38: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Block Matching Algorithm

Page 39: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Configurable H/W Paradigms

Page 40: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Programmable Logic Modules

Page 41: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Why Hardware for Motion Estimation?

• Most Computationally demanding part of Video Encoding

• Example: CCIR 601 format

• 720 by 576 pixel

• 16 by 16 macro block (n = 16)

• 32 by 32 search area (p = 8)

• 25 Hz Frame rate (f frame = 25)

• 9 Giga Operations/Sec is needed for Full Search Block Matching Algorithm.

Page 42: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Why Reconguration in Motion Estimation?

• Adjusting the search area at frame-rate according to the changing characteristics of video sequences

• Reducing Power Consumption by avoiding unnecessary computation

Motion Vector Distributions

Page 43: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995

Page 44: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Re-configurable Architecture for ME

Page 45: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Power Estimation in Recongurable Architecture

Page 46: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Power vs Search area

Page 47: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Resource Reuse in FPGAs

Page 48: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Conclusion

• By adjusting the search area according to the changing characteristics of a picture, power can be saved. Further power saving can be achieved by utilizing freed up resources for local memory

• Extension of Adaptive Search Space Method to Software Implementation– Varying p still reduces computation and hence power– Resource reuse may also be applicable in S/W

implementation by freeing up cache space and compute power for more power efficient use of memory

Page 49: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Future Works

• Reconguration to support more sophisticated motion estimation algorithms ( intelligent search, object-based, ...)

• More detailed performance studies over a wider range of video sequences

• Generalization of this concept to other algorithms and architectures (not just video)

• Modification to FPGA architectures to support the use of logic and configuration cells as local memory

Page 50: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Motion Estimation - Conventional

Page 51: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Motion Estimation - Data Reuse

P P P

P P P P

P P

a add abs

b add add abs

abs add

2 2

2

0 45

2

2 1

2

/

/

.

Therefore, power reduction

factor is 11%

Page 52: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Kernel Scheduling in Reconfigurable Computing

• R. Maestre, F. J. Kurdahi, N. Bagherzadeh, H. Singh, R. Hermida, M. Fernandez, Design and Test in Europe, DATE99, Munich, Germany, Mar 99

The PartitionPartition is to fine some subsets of kernel that may be scheduled

(executed) independently of other kernels.

Partitioning of the application DFG

The SchedulingScheduling is performed within a given partition in detail after

partitioning .

Scheduling within a given partition

Page 53: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

The Major Criteria

M E M C DCT Q IQ IDCT IM C

6 blocks blocks blocks blocks blocksblocks blocks

Fram e

8 4 21 6 6 421# of contexts :

M PEGsequence

G ranularityof com putation

¨Í

M E M C DCT Q IQ IDCT IM C

396(Fram e)

6 6 6 6 66¨Î

M E M C DCT Q IQ IDCT IM C

6 ¡¿396

6 66

¨Ï 396 396

a) M PEG sequence and granularity, b) a possib le schedule of an im age fram e, c) an a lternative schedule

• Context reloading

– Minimizing

• Data reuse

– Maximize

• Computation and

data movement

overlapping

– Maximize

Page 54: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Scheduling

C M

F B se t 1

F B se t 2

R 1i-1,R 2

i-1

K 1i K 2

i

C 3i

kc 2kc 1= 0

R 3i-1,D 1

i+1,D 2i+1,D 3

i+1

C 1i+1,C 2

i+1

kc 3

K 3i

T im e

¥ái = even t in ¥á ite ra tion i.

k i = C om pu ta tion tim e .

kc i = P ossib le ove rlap o f com pu ta tion and con text load ing

C i = C on text load ing tim e .

D i = D a ta load ing tim e .

R i = R esu lt read ing tim e .

Ide l tim e

P artition = { k 1, k2, k3 }. A poss ib le schedu le :

< Execution m odel representation >

Page 55: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

Algorithm

K i

K j

Km

K p

1 2

3 4

B C = TR U E

a. LE E = ¥õ

K i

K j

Km

K p

2

3 4

B C = TR U E

b. LE E = { 1 }

K i

K j

Km

K p

3 4

B C = TR U E

c. LE E = { 1 , 2 }

K i

K j

Km

K p

2

4

B C =TR U E

d. LE E = { 1 , 3 }

K i

K j

Km

K p

2

B C =TR U E

b. LE E = { 1 , 3 , 4 }

K i

K j

Km

K p

B C =TR U E

c. LE E = { 1 , 4 }

2

3

B C =TR U E

< Som e steps of an exploration sequence >

Page 56: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of

the IEEE VLSI Signal Processing Workshop, San Francisco, Oct 1996.

[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.

[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.

[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.

[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.

[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.

[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.

[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.

[9] E. Kusse, Personal communication, 1996.

Page 57: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

References

[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2, pp. 40-51, 1991.

[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31, N0. 11, pp. 1703-1714, Nov. 1996.

[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report SPRA281, TI, 1997.

[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer Academic publishers, 1992.

[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.

[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.

[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.

Page 58: L33:Low Power Reconfigurable system Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab

References

[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller Synthesis for Dynamically Recongurable Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.

[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.

[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.

[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.

[21] Xilinx XABEL reference manual.