DF2010ProceedingsBody Revisado Cca2.PDF A4 v9

DESIGNER FORUM2010March 24 to 26, 2010Ipojuca | Porto de Galinhas BeachBRAZIL

PROCEEDINGS

Edval J. P. SantosCristiano C. AraújoValentin O. Roda

Editors

U F P E

www.fade.org.br

��

Design House

TABLE OF CONTENTS

Microcontrolador compatible con AVR, interfaz de depuracion y bus wishbone ......................................1 Tropea S. E. y Caruso D. M. Instituto Nacional de Tecnologia Industrial, Argentina Experiencia académica sobre incorporación de la metodología de diseño basada en HDL en una carrera de ingeniería electrónica …….....................................................................................................................7 Martinez R., Corti R., D'Agostino E., Belmonte J. y Giandomenico E. Universidad Nacional de Rosario , Argentina Audio sobre ethernet: implementación utilizando FPGA ……….............................................................13 Mosquera J., Stoliar A., Pedre S., Sacco M. y Borensztejn P. Universidad de Buenos Aires, Argentina

Use of self-checking logic to minimize the effects of single event transients in space applications ……19 Ortega-Ruiz J. and Boemo E. Universidad Autonoma de Madrid, Spain Wireless Internet configurable network module …………………………………………………...........25 Schiavon M. I., Crepaldo D. A., Martin R. L. Laboratorio de Microelectronica FCEIA, UNR, Argentina MIC- A new compression method of instructions in hardware for embedded systems ………………...29 Dias W. R. A., Barreto R. da S. and Moreno E. D. Department of Computer Science -Federal University of Amazonas, Brazil Department of Computer Science -Federal University of Sergipe, Brazil Embedded system that simulates ECG waveforms ...................................................................................35 De Farias T. M. T. and De Lima J. A. G. Universidade Federal da Paraiba, Brazil An FPGA based converter from fixed point to logarithmic number system for real time applications ...39 De Maria E. A. A., Maidana C. E. and Szklanny F. I. Universidad Nacional de La Matanza, Argentina Hardware co-processing unit for real-time scheduling analysis …………...............................................43 Urriza J., Cayssials R. and Ferro E. Universidad Nacional del Sur , Argentina Implementação em hardware do método de minkowsky para o cálculo da dimensão fractal …………..47 Maximiliam Luppe Universidade de Sao Paulo, Brazil

An entry level platform for teaching high performance reconfigurable computing .................................53 Viana P., Soares D. and Torquato L. Federal University of Alagoas, Brazil Derivation of PBKFD2 keys using FPGA ……………………………………………............................57 Pedre S., Stoliar A. and Borensztejn P. Universidad de Buenos Aires, Argentina Automatic synthesis of synchronous controllers with low activity of the clock …………………...........63 Del Rios J., Oliveira D. L. and Romano L. Instituto Tecnológico de Aeronáutica, Brazil Centro Universitário da FEI, Brazil Ajuste de hierarquia de memória para redução de consumo de energia com base em otimização por enxame da particulas (PSO) …………………………………………………………..............................69Cordeiro F.R., Caraciolo M.P., Ferreira, L.P. and Silva-Filho A.G. Federal University of Pernambuco, Brazil IP-core de uma memória cache reconfigurável ……………………………………….............................75 Gazineu, G.M.; Silva-Filho, A.G.; Prado, R.G.; Carvalho, G.R.; Araujo, A.H.C.B.S. e De Lima, M.E. Universidade Federal de Pernambuco, Brazil A note on modeling pulsed sequential circuits with VHDL ………………………………….................81 Mesquita Junior A. C. Universidade Federal de Pernambuco, Brazil Comparative study between the implementations of digital waveforms free of third harmonic on FPGA and microcontroller ………………………………………………………………...................................85 Freitas D. R. R. and Santos E. J. P. Universidade Federal de Pernambuco, Brazil

MICROCONTROLADOR COMPATIBLE CON AVR, INTERFAZ DE DEPURACION Y BUSWISHBONE

S.E. Tropea y D.M. Caruso

Electronica e InformaticaInstituto Nacional de Tecnologıa Industrial

Buenos Aires, Argentinaemail: [email protected]

ABSTRACT

En este trabajo presentamos un microcontrolador compati-ble con la lınea AVR de Atmel. Esta implementacion pue-de ser configurada para ser compatible con los AVR de se-gunda (ej. ATtiny22), tercera (ej. ATmega103) y cuarta (ej.ATmega8) generacion.

Este desarrollo incluye los siguientes perifericos com-patibles: controlador de interrupciones, puertos de entrada ysalida, temporizadores y contadores, UART y watchdog.

Para adaptarlo a distintas necesidades se incluyo una in-terfaz de expansion que utiliza el estandar de interconexionWISHBONE.

A los fines de facilitar el desarrollo de aplicaciones sobreesta plataforma se lo doto de una unidad de depuracion y seadapto el software necesario para poder realizar depuracionde alto nivel con una interfaz de usuario simple e intuitiva.

El diseno fue verificado usando simuladores y FPGAsde Xilinx (Spartan II y 3A).

1. INTRODUCCION

Nuestro laboratorio realiza aplicaciones con microcon-troladores embebidos de manera frecuente. Es por esta razonque al abordar la tecnologıa FPGA se deseo implementarmicrocontroladores compatibles con los usados en dichosdesarrollos. De esta manera personas del equipo que no sededicaran a esta nueva tecnologıa podrıan participar en desa-rrollos con FPGAs.

Con esta finalidad se encaro el desarrollo de un micro-controlador compatible con el PIC 16C84 [1] y posterior-mente se anadio una interfaz de depuracion para el mismo[2]. Este microcontrolador mostro ser util y fue transferidoa la industria aero-espacial [3]. Sin embargo su arquitectu-ra no es buena para la programacion en lenguaje C. En losultimos anos nuestro laboratorio se volco al uso de la lıneaAVR de Atmel, que si fue disenada para ser programada en

lenguaje C. Por lo que se decidio encarar el desarrollo de unequivalente para FPGAs.

En este trabajo presentamos las caracterısticas del mi-crocontrolador desarrollado y de una interfaz de depuracionque permite depurar programas escritos en lenguaje C desdeuna PC y utilizando un software simple e intuitivo.

2. MICROCONTROLADOR

2.1. Arquitectura

El AVR es un microcontrolador del tipo RISC de 8 bitscon dos espacios de memoria completamente independien-tes: memoria de programa y memoria de datos.

En la memoria de programas se encuentra el codigo aejecutar. Es una memoria de 16 bits y la mayor parte de lasinstrucciones son de este tamano. Algunas instrucciones ne-cesitan dos posiciones de memoria (32 bits).

La memoria de datos es de 8 bits y se divide en tres sec-ciones. Existen instrucciones especıficas para acceder a cadauna de estas secciones de memoria, pero tambien hay ins-trucciones que pueden acceder a todo el espacio de memoriaindistintamente. La parte mas baja de esta memoria alberga32 registros de 8 bits, seis de los cuales pueden agruparse dea pares para formar tres registros de 16 bits, usualmente usa-dos como punteros. A continuacion se encuentra el espaciode entrada/salida, con un total de 64 posiciones de 8 bits. Elresto de la memoria es RAM.

A partir de la segunda generacion la lınea AVR imple-menta un stack pointer. El mismo es decrementado cada vezque se guarda algo en la pila, usualmente se lo inicializa conla direccion mas alta de memoria RAM. Este recurso facilitael compilado de aplicaciones en lenguaje C.

La mayor parte de las instrucciones se ejecutan en unciclo de reloj, pero algunas pueden tomar hasta cuatro ciclos,como en el caso del CALL.

El stack pointer y el registro de estados se encuentranmapeados en el espacio de memoria de entrada y salida.

La ALU implementa las operaciones basicas de suma,

1

resta y desplazamiento. A partir de la cuarta generacion seintroduce la multiplicacion de enteros con o sin signo y enformato de punto fijo (1.7).

2.2. Implementacion

Las herramientas de desarrollo utilizadas fueron las re-comendadas por el proyecto FPGALibre [4] [5]. Para estedesarrollo se utilizaron estaciones de trabajo que corren De-bian [6] GNU [7] /Linux. Siendo el lenguaje de descripcionde hardware utilizado el VHDL.

A los fines de simplificar la tarea y comenzar por unabase de codigo funcional se decidio basar el desarrollo enel proyecto AVR Core [8] de OpenCores.org. Primeramentese adapto el codigo a los lineamientos del proyecto FPGA-Libre. El codigo VHDL original se encuentra escrito a muybajo nivel, sin explotar toda la expresividad del lenguaje,por lo que se decidio reescribirlo para lograr un codigo mascompacto y facil de mantener. Tambien se unificaron modu-los que estaban separados, como la ALU y el procesador deoperaciones orientadas a bits.

El proyecto original solo implementa la tercer genera-cion de AVR, en particular trata de modelar el ATmega103.Para lograr mayor flexibilidad se decidio hacer parametri-zable a la CPU permitiendo seleccionar entre tres posiblesgeneraciones. De esta manera es posible aprovechar mejorel area de FPGA seleccionando entre tres versiones diferen-tes de acuerdo a la complejidad del proyecto en cuestion.

Las principales diferencias entre la segunda y la terce-ra generacion son el tamano del stack pointer (8 bits en lasegunda y 16 en las posteriores) y la falta de las instruccio-nes de salto absoluto (JMP y CALL). La cuarta generacionagrega multiplicacion, movimiento entre pares de registros(16 bits) y mejora la instruccion LPM (acceso a la memoriade programa).

2.3. Perifericos

A los fines de facilitar el uso de aplicaciones ya existen-tes para la lınea AVR se decidio implementar algunos de losperifericos mas comunes.

Controlador de interrupciones: el mismo permite con-figurar, enmascarar, etc. las interrupciones externas. Se reali-zaron dos implementaciones diferentes, una compatible conlas lıneas modernas (ej. ATtiny22 y ATmega8) y otro con lasanteriores (ej. ATmega103).

Puertos de entrada y salida: permiten configurar pinescomo entrada o salida. Se realizo una implementacion nuevay flexible que pudiera modelar todos los casos posibles.

Temporizador y contadores: se adapto la implementa-cion del proyecto original. La misma implementa los Timers0 y 2 del ATmega103. Se modifico el codigo para permitirun mayor reuso ya que ambos timers son muy similares. Es-tos timers son de 8 bits y permiten su uso como contadores

o generadores de PWM.USART: se adapto la implementacion del proyecto ori-

ginal. La misma es muy flexible y permite la seleccion dedistintas tasas de transmision y largo de datos.

Watchdog: este periferico no se encontraba en la imple-mentacion original y debido a que se encuentra relacionadocon una instruccion de la CPU se decidio implementarlo.

2.4. Bus de Expansion

En un microcontrolador los perifericos disponibles sonfijos, pero en el caso de una implementacion en una FPGAes deseable que los mismos puedan agregarse y/o quitarsefacilmente. Por esta razon se decidio implementar un bus deexpansion.

Siguiendo los mismos criterios adoptados en el pasado[1] se selecciono el estandar de interconexion WISHBONE[9]. El mismo posee las siguientes ventajas:

Para casos simples (un maestro y uno o mas esclavos)se reduce a poca o ninguna logica adicional.

Fue pensado para casos mas complejos (mas de unmaestro, reintento, notificacion de error, etc.).

No posee royalties y puede ser usado sin costo alguno.La especificacion completa se encuentra disponible eninternet.

Para acceder al bus WISHBONE se implementaron dosregistros en el espacio de entrada y salida. Debido a quenuestra implementacion dejo de lado la memoria EEPROMde los AVR disponıamos de las direcciones 0x1C a 0x1F.Se decidio utilizar las direcciones 0x1E y 0x1F. El regis-tro 0x1F se usa para indicar la direccion del periferico quedeseamos utilizar en el bus WISHBONE. Posteriormentecualquier operacion sobre el registro 0x1E se transfiere atraves del bus WISHBONE. Esto permite acceder a hasta256 registros de 8 bits en el bus WISHBONE.

El bus WISHBONE contempla la conexion de periferi-cos lentos por lo que si un periferico necesita mas de unciclo de reloj para realizar la operacion puede detener al mi-crocontrolador hasta que la misma haya concluido.

2.5. Configuraciones Equivalentes

A los fines de proveer al usuario final de configuracio-nes que sean similares a microcontroladores ya existentes seimplementaron tres microcontroladores, uno por cada gene-racion implementada:

ATtiny22: de segunda generacion, un unico puertode entrada y salida de 5 bits, un timer de 8 bits, busWISHBONE, stack pointer de 8 bits, watchdog, unafuente de interrupcion externa, 128 bytes de memoriaRAM, 1024 words de memoria de programa.

2

ATmega103: de tercera generacion, 6 puertos de en-trada y salida, UART, bus WISHBONE, dos timers de8 bits, stack pointer de 16 bits, watchdog, ocho fuen-tes de interrupcion externa, 4096 bytes de memoriaRAM, hasta 65536 words de memoria de programa.

ATmega8: de cuarta generacion, 3 puertos de entra-da y salida, UART, bus WISHBONE, dos timers de8 bits, stack pointer de 16 bits, watchdog, dos fuen-tes de interrupcion externa, 1024 bytes de memoriaRAM, hasta 65536 words de memoria de programa.

Todas las configuraciones son parametrizables, pudien-dose eliminar los perifericos no utilizados.

3. HERRAMIENTAS DE DESARROLLO

Debido a que esta implementacion incluye todo el setde instrucciones del procesador original, y a que se imple-mentaron tres configuraciones equivalentes a microcontro-ladores existentes, es posible utilizar la mayor parte de lasherramientas disponibles para la lınea AVR.

Existen herramientas de software libre de muy buena ca-lidad disponibles para la lınea AVR y que pueden utilizarsecon nuestro microcontrolador.

3.1. Ensamblador

Para compilar fuentes en assembler escritos para el en-samblador de Atmel (avrasm) es posible utilizar avra [10].Se distribuye bajo licencia GPL y se puede compilar paralas plataformas mas populares. Ademas de ser compatiblecon el ensamblador de Atmel ofrece un soporte de macrosmejorado y ensamblado condicional.

3.2. Compilador de C/C++

A pesar de tratarse de una plataforma de 8 bits es posibleobtener una version del compilador de C del proyecto GNU[11] capaz de generar codigo para el AVR. A esta versiondel gcc se la conoce como gcc-avr y es capaz de generarcodigo altamente optimizado para AVRs de segunda a quintageneracion.

3.3. Biblioteca Estandar de C

Una implementacion muy completa de la biblioteca es-tandar de C especialmente disenada para los AVR se encuen-tra disponible en el proyecto avr-libc [12]. Esta implementa-cion se encuentra especialmente optimizada para los AVRsy permite realizar aplicaciones flexibles y compactas.

Uno de los detalles interesantes de esta biblioteca es quees posible redireccionar la entrada y salida estandar (stdin ystdout) a cualquier dispositivo, por ejemplo el puerto serie.

3.4. Depurador

El depurador del proyecto GNU, gdb [13], puede com-pilarse con soporte para AVR. A esta version del gdb se laconoce como avr-gdb. El mismo es capaz de depurar progra-mas escritos para los AVR. Debido a que gdb es una aplica-cion enorme la misma corre en una PC comunicandose conun simulador de AVR o bien con el microcontrolador.

GDB es una aplicacion de lınea de comandos, pero exis-ten varias interfaces de usuario disponibles para hacer massimple su uso.

3.5. Simulador

Un simulador capaz de simular el comportamiento de unAVR es el simulavr [14]. El mismo corre en una PC y es po-sible utilizarlo desde el avr-gdb para realizar una depuraciondel codigo simulado sin necesidad de disponer de un AVRreal.

4. INTERFAZ DE DEPURACION

4.1. Introduccion

El desarrollo de sistemas electronicos basados en siste-mas embebidos presenta un desafıo a la hora de eliminarerrores de implementacion. Esto se debe a que los microcon-troladores utilizados para estas tareas suelen ser pequenos yno es posible que ejecuten las complejas tareas involucradasen la depuracion de errores. La tarea se dificulta aun mascuando dichos dispositivos no poseen una interfaz de usua-rio amistosa, sin salida de vıdeo donde conectar un monitorni entradas de teclado o similares para ingresar datos.

Para solucionar estos problemas se suele incluir funcio-nalidad en el microcontrolador que permite realizar las ta-reas de depuracion en forma remota utilizando una compu-tadora personal, donde se encuentran disponibles los recur-sos antes mencionados.

4.2. Seleccion de la Arquitectura

Los dispositivos modernos de la lınea AVR poseen faci-lidad de depuracion a traves de un puerto JTAG (Joint TestAction Group). Una posible solucion a este problema hu-biera sido implementar una interfaz compatible. La ventajade esta solucion es que se podrıa haber utilizado cualquiersoftware ya disponible, sin modificaciones. La desventaja esque las PCs no poseen interfaz JTAG, esto implica un ca-ble especial. Este no es el unico problema, en la practica loque soportan los programas no es el manejo directo de JTAGsino un protocolo especial que normalmente se implementautilizando un microcontrolador. Ası, estos cables en reali-dad implican el uso de un microcontrolador que es el querealmente se comunica por JTAG con el microcontrolador

3

que deseamos depurar. Una solucion posible era implemen-tar esta segunda CPU en la misma FPGA.

Por otro lado era necesario implementar una unidad dedepuracion compatible con la del AVR y someterse a sus li-mitaciones. Nuestro equipo ya poseıa experiencia en el desa-rrollo de una unidad de depuracion [2], mas flexible que lade los AVR. Por lo que se decidio adaptar nuestra unidad dedepuracion y evitar la necesidad de un segundomicrocontro-lador. La desventaja de este mecanismo es que es necesarioadaptar el software para que funcione con nuestro microcon-trolador.

4.3. Comunicacion con la PC

Nuestra unidad de depuracion original es un perifericoque soporta el estandar de interconexionWISHBONE. Paracontrolar este tipo de perifericos es necesario poder accederal bus WISHBONE, que se encuentra dentro de la FPGA.Una forma de lograr este acceso es utilizando algun tipode puente. En nuestro trabajo anterior usamos un puente depuerto paralelo en modo EPP a WISHBONE [15], desarro-llado por nuestro equipo. En este caso y debido a que elpuerto paralelo ha sido reemplazado casi por completo porel USB optamos por utilizar un puente de USB a WISH-BONE [16] [17], tambien desarrollado por nuestro equipo.

4.4. Caracterısticas

Nuestra unidad de depuracion permite:

Detener/Reanudar la ejecucion del microcontroladoren cualquier momento.

Ejecutar su programa paso a paso.

Detener la ejecucion cuando se alcanzo una posicionde memoria determinada, punto de parada o break-point. La cantidad de breakpoints es configurable en-tre 1 y 256.

Reinicializar el microcontrolador.

Acceder a todos los registros, incluyendo el contadorde programa.

Acceder al espacio de entrada y salida.

Inspeccionar la pila de llamadas (calling stack).

Detener la ejecucion cuando se accede a una posicionde memoria de datos, watchpoint. Los accesos pue-den seleccionarse para detenerse por lectura, escriturao ambos. El numero de watchpoints es configurableentre 1 y 256.

Alterar la memoria de programa.

Detectar desbordes en la pila y detener la ejecucioncuando esto sucede.

4.5. Software complementario

Para poder depurar programas corriendo en microcon-troladores AVR es posible utilizar el avr-gdb, pero este pro-grama necesita comunicarse con otro programa que es el querealmente controla al microcontrolador. Un programa muyusado es el AVaRICE [18] y al ser software libre fue posiblemodificarlo. Se agrego soporte al AVaRICE para controlarnuestra unidad de depuracion utilizando el puerto USB.

Como interfaz de usuario para controlar al avr-gdb se se-lecciono el programa SETEdit [19]. El mismo es el entornode trabajo recomendado por el proyecto FPGALibre. SETE-dit implementa el protocolo GDB/MI para la depuracion deprogramas en C/C++ y assembler por lo que no fueron ne-cesarios cambios importantes para lograr que el mismo seadaptara a la depuracion de programas escritos en lenguajeC o assembler corriendo en dicho microcontrolador.

4.6. Hardware complementario

El microcontrolador en cuestion no implementa la fun-cionalidad de reprogramacion incluida en el original. Estono es un problema importante debido a que las FPGAs sonreconfigurables y por lo tanto basta con volver a sintetizar eldiseno para modificar el programa ejecutado por el micro-controlador. Es comun que los depuradores remotos puedanmodificar el programa ejecutado por el sistema embebido,por lo que para complementar este desarrollo se diseno unperiferico WISHBONE capaz de acceder a la memoria deprograma del microcontrolador. Esto permitio reconfigurardicha memoria sin necesidad de reconfigurar toda la FPGA.

La Fig. 1 ilustra la interconexion entre los distintos com-ponentes de hardware antes mencionados, los bloques ilus-trados con relleno solido corresponden a los desarrollos des-criptos en este trabajo. En la Fig. 2 se muestra el flujo de da-tos dentro de la computadora, los datos ingresan a traves delpuerto USB, son tomados por el sistema operativo (Linux)utilizando operaciones basicas de entrada/salida y enviadosal espacio de usuario a traves de la biblioteca libusb, estosdatos son procesados por AVaRICE y traducidos al proto-colo remoto de gdb y finalmente el depurador (avr-gdb) losenvıa a la interfaz de usuario (SETEdit) utilizando el proto-colo GDB/MI.

5. RESULTADOS

La Fig. 3 muestra una sesion de depuracion utilizandoSETEdit. En la misma se observa el codigo fuente del pro-grama, el codigo desensamblado y una ventana utilizada pa-ra monitorizar el valor de una variable.

Este desarrollo fue verificado utilizando FPGAs SpartanII y Spartan 3A de Xilinx. Para la sıntesis se utilizo el pro-grama XST 10.1.02 K.37.

4

Fig. 1. Diagrama de conexiones de los bloques de hardware.

Fig. 2. Flujo de datos dentro de la computadora.

El area ocupada depende de varios parametros, a conti-nuacion se describen algunos ejemplos.

Configuracion ATmega103 con memoria de programade 1024x16, sin perifericos internos y solo una UARTpequena en el bus WISHBONE: 644 slices (245 FFsy 1124 LUTs) 3 BRAMs.

Configuracion ATmega8 con memoria de programade 1024x16, sin perifericos internos y solo una UARTpequena en el bus WISHBONE: 707 slices (275 FFsy 1224 LUTs) 2 BRAMs y un multiplicador.

ConfiguracionATtiny22 con memoria de programa de1024x16, sin perifericos internos y solo una UARTpequena en el bus WISHBONE: 606 slices (237 FFsy 1053 LUTs) 2 BRAMs.

Configuracion ATmega8 con memoria de programade 1024x16, puerto B habilitado, una UART pequenaen el bus WISHBONE, interfaz de depuracion con 3breakpoints y 3 watchpoints (USB): 1548 slices (902FFs y 2477 LUTs) 4 BRAMs y unmultiplicador. Apro-ximadamente 500 de los slices son necesarios para laimplementacion del puente de USB a WISHBONE.

Solo en el ultimo caso se le pidio a la herramienta quebuscara cumplir con una frecuencia de trabajo fijada de 24MHz para todo el circuito, salvo para el PHY de USB quedebıa correr a 48 MHz. En el resto de los casos no se pi-dio ningun requisito en particular y la herramienta repor-to frecuencias de trabajo de entre 30 y 37 MHz para unaSpartan 3A grado 4.

5.1. Control de Motores de DC

Con la finalidad de verificar el correcto funcionamien-to del microcontrolador y la capacidad de la unidad de de-puracion, se encaro el desarrollo de un control de posicionpara motores de corriente continua. Para dicho control sedecidio implementar un PID. Dicho desarrollo utiliza los si-guientes perifericos conectados al bus WISHBONE:

Modulador PWM de 15 bits de resolucion.

Decodificador de encoder relativo con 16 bits de reso-lucion.

UART pequena trabajando a 115200 baudios.

La configuracion usada es compatible con el ATmega8con una memoria de programa de 4096x16 y la unidad dedepuracion habilitada. De los perifericos internos solo elpuerto B se encuentra habilitado. Dicha configuracion in-sumio 1732 slices (1009 FFs y 2786 LUTs) 7 BRAMs y unmultiplicador (Spartan 3A).

Habiendose ya obtenido los primeros resultados exito-sos queda por refinar el mecanismo de determinacion de lasconstantes del PID.

6. CONCLUSIONES

La eleccion de la arquitectura AVR permitio contar conuna amplia gama de herramientas y bibliotecas. Al mismotiempo se comprobo que programar este dispositivo es tanfacil como programar un AVR comercial.

La unidad de depuracion obtenida es poderosa, capaz derealizar la mayor parte de las operaciones realizadas por de-puradores utilizados en computadoras personales, y que esde gran ayuda a la hora de buscar errores en sistemas fun-cionando en tiempo real.

El uso de FPGAs posee como ventaja el hecho de que launidad de depuracion puede removerse en la version defini-tiva del diseno con lo que el mismo no ocupa recursos en eldispositivo final. En el caso en que el diseno ocupe practi-camente el total de la FPGA basta con usar una FPGA masgrande durante la etapa de desarrollo, a los fines de incluirunidades de depuracion como esta.

La seleccion del estandar de interconexionWISHBONEpermitio el reuso de un puente de puerto USB y abre la posi-bilidad a la implementacion de otro tipo de mecanismos decomunicacion, como podrıan ser RS-232 o Ethernet.

La eleccion de modificar un programa como AVaRICEpermitio acelerar notablemente el desarrollo y reusar interfa-ces de usuario ya existentes y con las cuales nuestro equipoya se encontraba familiarizado.

La utilizacion de las herramientas propuestas por el pro-yecto FPGALibre mostro ser adecuada para este desarrollo.

5

Fig. 3. Sesion de depuracion.

7. REFERENCIAS

[1] S. E. Tropea and J. P. D. Borgna, “Microcontrolador com-patible con PIC16C84, bus WISHBONE y video,” in FPGABased Systems. Mar del Plata: Surlabs Project, II SPL, 2006,pp. 117–122.

[2] S. E. Tropea, “Interfaz de depuracion para microcontrolador,”in 2008 4th Southern Conference on Programmable LogicDesigner Forum Proceedings, Bariloche, 2008, pp. 105–108.

[3] R. M. Cibils, A. Busto, J. L. Gonella, R. Martinez, A. J. Chie-lens, J. M. Otero, M.Nunez, and S. E. Tropea, “Wide rangeneutron flux measuring channel for aerospace application,”in Space Technology and Applications International Forum-STAIF 2008 Proceedings, vol. 969, New Mexico, 2008, pp.316–325.

[4] S. E. Tropea, D. J. Brengi, and J. P. D. Borgna, “FPGAlibre:Herramientas de software libre para diseno con FPGAs,” inFPGA Based Systems. Mar del Plata: Surlabs Project, IISPL, 2006, pp. 173–180.

[5] INTI Electronica e Informatica et al., “Proyecto FPGA Li-bre,” http://fpgalibre.sourceforge.net/.

[6] Debian, “Sistema operativo Debian GNU/Linux,” http://-www.debian.org.

[7] “GNU project,” http://www.gnu.org/.

[8] R. Lepetenok. (2009, Nov.) AVR Core. OpenCores.org. [On-line]. Available: http://www.opencores.org/project,avr core

[9] Silicore and OpenCores.Org, “WISHBONE System-on-Chip(SoC) interconnection architecture for portable IP cores,”

http://prdownloads.sf.net/fpgalibre/wbspec b3-2.pdf?down-load.

[10] (2009, Nov.) Avr assembler (avra). [Online]. Available:http://avra.sourceforge.net/

[11] (2009, Nov.) GCC, the GNU compiler collection. [Online].Available: http://gcc.gnu.org/

[12] M. Michalkiewicz, J. Wunsch et al. (2009, Nov.) AVR C run-time library. [Online]. Available: http://www.nongnu.org/-avr-libc/

[13] (2009, Nov.) GDB: The GNU project debugger. [Online].Available: http://www.gnu.org/software/gdb/

[14] T. A. Roth et al. (2009, Nov.) Simulavr: an AVRsimulator. [Online]. Available: http://savannah.nongnu.org/-projects/simulavr/

[15] A. Trapanotto, D. J. Brengi, and S. E. Tropea, “Puente IEEE1284 en modo EPP a bus WISHBONE,” in FPGA Based Sys-tems. Mar del Plata: Surlabs Project, II SPL, 2006, pp. 257–264.

[16] R. A. Melo and S. E. Tropea, “IP core puente USB a WISH-BONE,” in XV Workshop Iberchip, vol. 2, Buenos Aires,2009, pp. 531–533.

[17] S. E. Tropea and R. A. Melo, “USB framework - IP core andrelated software,” in XV Workshop Iberchip, vol. 1, BuenosAires, 2009, pp. 309–313.

[18] S. Finneran. (2009, Nov.) AVR in circuit emulator. [Online].Available: http://avarice.sourceforge.net/

[19] Salvador E. Tropea et al., “SETEdit, un editor de texto ami-gable,” http://setedit.sourceforge.net.

6

EXPERIENCIA ACADÉMICA SOBRE INCORPORACIÓN DE LA METODOLOGÍA DE DISEÑO BASADA EN HDL EN UNA CARRERA DE INGENIERÍA ELECTRÓNICA

Roberto Martínez, Rosa Corti, Estela D’Agostino, Javier Belmonte, Enrique Giandoménico

Facultad de Ciencias Exactas, Ingeniería y Agrimensura Universidad Nacional de Rosario – (FCEIA/UNR) Avenida Pellegrini 250, (2000) Rosario, Argentina

email: romamar, rcorti, estelad, belmonte, [email protected]

RESUMEN

En este trabajo se describe la planificación e implementación parcial de los cambios necesarios para introducir los HDL como metodología de diseño en el área digital de una carrera de Ingeniería Electrónica. Para lograr una fluida integración de la temática con los contenidos conceptuales de las asignaturas se aprovechan, como herramientas didácticas, las ventajas de los HDL y los ambientes de diseño asociados. Se evalúa el impacto de las modificaciones llevadas adelante en dos asignaturas durante el corriente año, comparando los resultados de las evaluaciones en 2009 con las de años anteriores y encuestas de opinión a los alumnos cursantes. Finalmente, se utilizan estas mediciones para obtener conclusiones, y determinar las líneas de trabajo a futuro.

1. INTRODUCCION

La ingeniería electrónica, al igual que otras disciplinas de base tecnológica, produce avances acelerados incorporando una amplia gama de novedades que, a su vez, impulsan la inclusión de otras. Este proceso provoca el reemplazo de tecnologías que han quedado obsoletas, por otras de última generación. De esta manera, el ciclo de vida útil de algunos instrumentos tecnológicos puede ser muy corto. Ante la situación planteada, existe una preocupación presente en el cuerpo docente: la de seleccionar los contenidos más significativos del programa de estudio y, en algunos casos, los de mayor vigencia en el tiempo. Asimismo, los contenidos y su naturaleza, determinarán la metodología de enseñanza más apropiada. Una aspiración siempre presente en el docente (fundamentalmente en los últimos cursos), respecto de la “utilidad” de los temas, es que éstos tengan cierta vigencia, al menos en los primeros años de la futura actuación profesional del alumno. El plan de estudios de una carrera tecnológica, como Ingeniería Electrónica debe cumplir con dos objetivos: que el alumno aprenda los fundamentos científicos de la electrónica, pero al mismo tiempo incorpore, en su manejo, las últimas tecnologías

emergentes. Notamos también, que el abordaje de esta problemática se ve dificultada por la ausencia de estudios sobre el aspecto epistemológico de nuestra disciplina, poco se ha dicho sobre la manera en que este cuerpo de conocimientos va construyendo sus categorías y sus métodos. Hay que tener presente que, a su vez, la estructuración de los contenidos conlleva en sí mismo implicaciones metodológicas. En particular, el diseño de los sistemas digitales, desde la perspectiva disciplinar, es decir, conjunto de conocimientos estructurados para su enseñanza, debe ser acompañada por un adecuado marco pedagógico que la facilite. La teoría constructivista del proceso de enseñanza aprendizaje, se basa en que la realidad que creemos conocer es activamente construida por el sujeto cognoscente, el alumno. Llegar a conocer, en el constructivismo, es un proceso adaptativo que organiza el mundo experiencial del sujeto. Muchos autores han realizado aportes en esta línea de pensamiento, entre ellos, Ausubel [1], introduce dos conceptos claves, el de aprendizaje significativo y el de inclusores. El aprendizaje significativo resulta ser aquel por el cual las ideas expresadas simbólicamente son relacionadas en forma sustancial con lo que el alumno ya sabe, con los conocimientos previos. Es decir referenciadas con algún aspecto esencial de su estructura cognitiva. Para Ausubel, la forma más relevante del aprendizaje significativo se da cuando las nuevas ideas se relacionan subordinadamente con ideas relevantes de mayor nivel de abstracción, generalidad e inclusividad. A estos conocimientos previos, que sirven de anclaje para los nuevos conceptos, les llama inclusores. También en este modelo, Bruner [2], desde la perspectiva curricular, propone que el currículum debe organizarse de forma espiralada, es decir, trabajando periódicamente los mismos contenidos, cada vez con mayor profundidad. Esto facilita que el alumno modifique las representaciones mentales que ha venido construyendo, en un proceso continuo. Estas consideraciones se han tenido en cuenta para la planificación de los cambios que aquí se describen. El resto del trabajo se organiza de la siguiente forma: en la sección 2 se describe la situación actual del Área digital, la sección 3 presenta las modificaciones propuestas cuya implementación se analiza 7

7

en la sección 4. Finalmente, la sección 5 enumera las conclusiones alcanzadas y plantea las líneas de trabajo a futuro.

2. SITUACIÓN ACTUAL DEL ÁREA DIGITAL

La vertiginosa evolución de la electrónica digital, con las herramientas y tecnologías asociadas, ha revolucionado la manera de analizar, diseñar y sintetizar los sistemas digitales. Los lenguajes de descripción de hardware (HDL) junto a los ambientes de apoyo al diseño electrónico (EDA) incorporan la metodología de diseño Top-Down y como plantean los autores en [3], son las fuerzas impulsoras del desarrollo de la microelectrónica. En este contexto, coincidimos con lo que se plantea en [4] respecto de que la enseñanza de los HDL orientados al diseño, son una herramienta de calidad para el aprendizaje de los sistemas digitales. Asimismo, la disponibilidad de tarjetas de desarrollo en base a FPGA, facilitadas al ámbito académico por las principales empresas proveedoras de esta tecnología (Xilinx, Altera) [5], [6], vía programas universitarios, han permitido encarar un proceso enseñanza-aprendizaje basado en proyectos (Project Based Learning – PLB) y cuya aplicación y potencialidades en la enseñanza de la disciplina que nos ocupa, han sido reportados en varios trabajos [7], [8], [9]. En la carrera de Ingeniería Electrónica de la FCEIA/UNR el Área Digital está integrada por tres asignaturas obligatorias, Digital I, II y III y varias asignaturas optativas cuyas currículas pueden consultarse en [10]. La metodología de diseño basada en HDL, se trata únicamente en una asignatura electiva que, debido a su carácter, cursan sólo una parte de los alumnos. Sin embargo, considerando la importancia de la temática se evaluó que era necesario incorporarla en asignaturas obligatorias para que constituyera parte de la formación integral de los futuros ingenieros.

3. MODIFICACIONES

El objetivo de las modificaciones aquí expuestas, fue la incorporación de la metodología de diseño basada en HDL en las asignaturas obligatorias del área digital de la carrera de Ingeniería Electrónica. Para lograrlo, se tomaron como base los cambios consensuados con los docentes que vienen introduciéndose en la carrera desde 2001, y que se detallan en [11]. La tarea se planificó integrando la temática de interés con los conceptos tratados en cada una de las materias, para aprovechar la potencialidad de los HDL como herramientas didácticas [12]. La planificación tuvo en

cuenta los contenidos curriculares de cada asignatura, y la secuencialidad adecuada para la enseñanza de los nuevos conocimientos. Los temas se distribuyeron de la siguiente forma: Digital I: Se incorpora una introducción al VHDL, unidades de diseño, descripción de sistemas combinacionales, conceptos básicos de sistemas secuenciales y descripción de una red de Petri haciendo uso del método de traducción directa descripto en [13]. Digital II: Se profundiza la descripción VHDL de sistemas combinacionales y secuenciales iniciado en Digital I, incorporando otros bloques estándar a nivel RTL analizando su personalización. Utilizando la metodología Top-Down se integran en un sistema de complejidad media las descripciones con esquemáticos y VHDL. Digital III: Se hace énfasis en los distintos tipos de descripción de Máquinas de Estado Finito (MEF) y su aplicación en el modelo Control-Data Path. Además se ha planificado trabajar con el estilo de descripción estructural, el manejo de bibliotecas y re-utilización de módulos, relacionándolos con la arquitectura de procesadores e interconexión de periféricos. Estas modificaciones se implementarán en el primer semestre de 2010.

4. IMPLEMENTACIÓN DE LOS CAMBIOS

Las modificaciones se implementaron en el curso del año 2009. Durante el primer semestre se trabajó en Digital I, y considerando los resultados alcanzados se avanzó en el segundo semestre con las modificaciones en Digital II.

4.1. Digital I

La asignatura Digital I es la primera materia que trata sobre los sistemas digitales. El programa esta estructurado de manera tal que en un comienzo se estudian los sistemas combinacionales y sus distintos estilos de descripción y síntesis. Posteriormente se aborda el diseño de sistemas secuenciales, se da el marco teórico general de su modelización (representación Mealy y Moore) y se utilizan las Redes de Petri para el modelado de sistemas secuenciales de baja y media complejidad. La resolución de problemas está orientada a los sistemas secuenciales de características industriales, caracterizados por un número importante de entradas, fuertemente no especificados y con evoluciones paralelas o uso de recursos compartidos, sistemas estos donde las Redes de Petri se muestran muy eficientes en su modelización. Finalmente, se aborda la implementación hardware a través de la síntesis cableada (puertas, flip flop y PROM)) y programada (PLC). Los alumnos realizan trabajos prácticos en laboratorio sobre

8

PLC, donde implementan un sistema digital de baja complejidad modelado con una Red de Petri. VHDL se introdujo como una representación más del comportamiento de los módulos combinacionales elementales. Es decir, a la tabla de verdad y a la expresión booleana, que modela un módulo, se le agregó una sentencia VHDL en estilo flujo de datos. Desde luego, a esta altura del desarrollo no se podía profundizar en temas del lenguaje tales como tipos de datos o estilos de descripción. Cuando fue el momento de tratar los circuitos combinacionales, donde se deben interconectar varias compuertas para formar el circuito total, se incorporó el concepto de entidad y arquitectura. Posteriormente, se introdujo el concepto de descripción secuencial de un elemento concurrente, a través de la sentencia process. La modelización de un flip-flop y un contador resultó adecuada para ejemplificar la descripción algorítmica en VHDL. En la última fase, una vez desarrollado el tema de redes de Petri, a las implementaciones tradicionales (cableada y PLC) se le agregó la descripción VHDL de la red de Petri, utilizando un método de traducción directa propuesto en [13]. En todo el desarrollo del nuevo tema, prevaleció la idea base de enseñar este lenguaje de descripción de hardware orientado al diseño [4]. La utilización de la herramienta de simulación, en el ambiente de desarrollo, fue de gran utilidad para el proceso de enseñanza-aprendizaje. Como consecuencia de ser la primera experiencia de incorporación temprana de VHDL, las clases en laboratorio sobre el tema, fueron realizadas sólo por un grupo de catorce alumnos (grupo piloto) de un total de sesenta y dos que componían el curso. El resto de los estudiantes (48) trabajó en la modalidad habitual en los laboratorios. El grupo piloto realizó un laboratorio consistente en la modelización, por medio de una Red de Petri, de un sistema secuencial de baja complejidad. El modelo resultante se describió en VHDL y se verificó su comportamiento por medio de simulación en el entorno de trabajo ISE WebPack [5]. En la etapa de valoración de los conocimientos adquiridos sobre VHDL orientado al diseño, se realizó una

prueba de evaluación a toda la población del curso. En la Tabla 1 se muestra un resumen de los resultados.

Tabla 1. Resultados de la evaluación del tema VHDL en Digital I

Nota ( 0 a 100 puntos)

No realizaron Lab VHDL 48 19 6 23Realizaron Lab VHDL 14 11 2 1Total 62 30 8 24

Cantidad de alumnos

80 o más

Entre 79 y 60

Menos de 60

Se observa una sensible diferencia entre el grupo piloto y el resto de los alumnos. Efectivamente, en el grupo piloto el 79% obtuvo 80 o más puntos sobre 100 mientras que en el otro grupo esta calificación la obtuvo el 40%. Asimismo, en el grupo piloto sólo el 7% no alcanzó a aprobar mientras que en el otro grupo no alcanzó una calificación satisfactoria el 47%. Podemos resumir al respecto, que el trabajo en el ambiente de desarrollo, con actividades de simulación y verificación de comportamiento, tuvo una influencia altamente benéfica en el aprendizaje. Finalmente, no podemos obviar, que en estos resultados también incidió favorablemente la interacción alumno-alumno y alumno-profesor que se establece en los grupo de trabajo en laboratorio.

4.2. Digital II

Digital II forma parte del ciclo superior de la carrera de Ingeniería Electrónica y es la segunda de las tres materias obligatorias del área digital. Está caracterizada como una asignatura de tecnología aplicada, ya que sus contenidos teóricos apuntan directamente a la solución de problemas prácticos del área, utilizando recursos y dispositivos concretos. En esta materia se abordan dos bloques de conocimientos, que si bien tienen fuertes puntos de contacto presentan características diferenciadas: Diseño constructivo de sistemas basado en el uso de bloques funcionales/Lógica Programable y Arquitectura Básica de Microprocesadores/Programación Assembler. En la asignatura se introduce el diseño de sistemas digitales con un enfoque de tipo Top-Down, mediante el cual se divide el problema a abordar en módulos más sencillos, con el objetivo de poder describirlos como la interconexión de bloques funcionales a nivel RTL. Se trabaja con el entorno ISE WebPack [5], y se utiliza el flujo de ingreso por esquemáticos incorporando los bloques necesarios como elementos de biblioteca o como módulos personalizados si no se dispone de la funcionalidad requerida. En este marco, utilizando los conceptos básicos de diseño con VHDL introducidos en el semestre anterior en Digital I, se avanzó en la implementación de los módulos obtenidos a partir de la partición Top-Down del sistema, con VHDL. La metodología utilizada se fundamentó en la resolución de problemas a partir de un conjunto de requerimientos, para lo cual el docente plantea y analiza distintas opciones de solución, seguido por los alumnos en sus puestos de trabajo con el ambiente de diseño instalado, estudiando el impacto de las distintas descripciones sobre las características y el comportamiento de los circuitos

9

logrados. En este sentido, se trabajó sobre tres ejes íntimamente relacionados:

� Estilo de descripción VHDL de los dispositivos � Esquemático a nivel RTL asociado � Simulación de comportamiento

El trabajo, relacionando estos tres aspectos en un mismo diseño, permitió ubicar a los alumnos en el contexto de la descripción de elementos hardware y sus conexiones, ya que podían visualizar el sistema como interconexión de bloques RTL ya conocidos, verificar los cambios de comportamiento mediante simulación, y relacionarlos con la forma en que se había descripto el circuito en VHDL. Los problemas encarados integraron en su solución el diseño con esquemas y con VHDL. Este enfoque permitió comparar las dos formas de descripción de un circuito poniendo en evidencia los beneficios de cada una. Las descripciones VHDL demostraron gran flexibilidad y brindaron la posibilidad de parametrizar el código. Estas características facilitaron la personalización de la funcionalidad de los módulos. Por otro lado, los alumnos verificaron que este tipo de descripción simplificaba los cambios durante el proceso de depuración. Quedó claro que estas características son fundamentales al incrementarse la complejidad de los diseños. Por su parte, si la complejidad del sistema no es muy grande, los esquemas presentan la funcionalidad del circuito en forma muy clara al mostrar la conexión gráfica de módulos, constituyendo un apoyo importante durante el

desarrollo del tema. Desde el punto de vista metodológico, el conocimiento de los bloques disponibles en biblioteca, simplificó el análisis de las distintas descripciones realizadas y ayudó a comprender el impacto que tienen los cambios en la descripción VHDL sobre las características y comportamiento del circuito.

0

5

10

15

20

25

30

Alta Media Baja Muy baja No opinaClaridad clasesIntegracion teoria/ practicaCalidad del material didáctico

Fig. 1 Desarrollo de las clases

Tabla 2. Importancia e interés en el tema

Alta Media Baja Muy baja

No opina

Importancia del tema 32,6% 37,2% 7,0% 4,7% 18,6%

Interés despertado 41,9% 44,2% 11,6% 2,3% 0,0%

Tabla3. Evaluaciones

Alta Media Baja Muy baja

No opina

Dificultad de las evaluaciones 27,9% 69,8% 0,0% 0,0% 2,3%

Coherencia entre nivel de clases y

evaluaciones34,9% 51,2% 9,3% 2,3% 2,3%

La evaluación de los nuevos contenidos se realizó mediante un examen parcial individual y un trabajo práctico grupal encarado por equipos de dos alumnos. La evaluación parcial consistió en la descripción VHDL de un módulo sencillo, en el cual los requerimientos sobre interfaz y funcionalidad se establecieron considerando la duración de la prueba (1 hora). El trabajo práctico fue de un nivel de complejidad superior, ya que los alumnos trabajaron en grupo disponiendo de un período de dos semanas durante las cuales contaron con el apoyo de los docentes para resolver el problema planteado. En este sentido, debieron abordar el diseño de un sistema digital con un enfoque jerárquico, en el cual varios de los módulos constitutivos debían desarrollarse en VHDL. Cabe destacar, que el porcentaje de alumnos con calificaciones superiores a 6 (Aprobado) en el primer bloque temático de la materia, se incrementó de un promedio del 45% (2006 a 2008) a un 55% en 2009. Para evaluar el impacto que tuvieron los cambios introducidos sobre los alumnos, se implementó una encuesta anónima y voluntaria una vez concluidas las evaluaciones del primer módulo de la materia, que fue respondida por cuarenta y tres estudiantes. En la misma, se realizaron preguntas referidas a distintos aspectos del

10

trabajo realizado y al interés que los nuevos conocimientos habían despertado, cuantificando cada aspecto con la escala: alta, media, baja, muy baja y no opina. En la Tabla 2 se muestra que el 86% de los encuestados declaró tener un interés entre alto y medio por la nueva temática y el 70% calificó de igual manera la importancia que, a su juicio, tenían los nuevos conocimientos para su formación profesional. La Fig. 1 muestra que la mayoría de los estudiantes manifestó estar satisfecho respecto de la claridad de las clases impartidas, la integración de aspectos teóricos y prácticos y la calidad del material didáctico entregado por la cátedra. Los valores de la Tabla 3 ponen de manifiesto que la mayoría de los encuestados opina que la coherencia entre la profundidad en el desarrollo de los temas y la dificultad de las evaluaciones es razonable. Los alumnos manifestaron, en un campo de libre respuesta dedicado a sugerencias, su interés de completar el flujo de diseño implementando los circuitos sobre placas de desarrollo. Esto sería muy importante, ya que según se ha verificado en las asignaturas optativas del área, el poder implementar los resultados de los diseños es muy conveniente para reafirmar los conocimientos e incentivar el interés de los estudiantes. Está previsto incorporar este tipo de actividades en cursados posteriores a partir de una actualización de los laboratorios que se encuentra en curso.

5. CONCLUSIONES Y TRABAJO FUTURO

Los resultados informados en este trabajo son todavía parciales toda vez que la etapa de implementación está todavía en curso, no obstante esto, podemos considerarlos satisfactorios. Se puede decir que se ha verificado la utilidad de los HDL y los EDA como herramientas didácticas en la enseñanza del diseño digital y se ha incorporado, en las materias de grado, un poderoso instrumento para el modelado, simulación e implementación de sistemas digitales. La metodología aportó flexibilidad para efectuar cambios y permitió parametrizar los diseños. Sin embargo, pese al acento puesto en relacionar las descripciones HDL con el hardware asociado, subsiste la tendencia de los alumnos a utilizarlos como lenguajes de alto nivel tradicionales. Los estudiantes tienden a seguir pensando la descripción como puramente secuencial, olvidando que el objeto que ahora se describe es de una naturaleza completamente diferente. Se esfuerzan en buscar un código compacto, más que en lograr una buena descripción acorde con los requerimientos. Finalmente se prevé incorporar una nueva optativa al Área, “Diseño digital orientado a SoC”. Los temas

centrales a desarrollar serán: IP cores (software y hardware), procesadores embebidos de 8 y 16 bits (arquitectura y programación), ambientes integrados de trabajo, aplicaciones de procesamiento de señal, control y comunicación RF.

6. REFERENCIAS

[1] Ausubel, D.P., Novak J.D., Hnesian, H., “Sicología Educativa: un punto de vista cognoscitivo”, Editorial Trillas, México, 1986 (de orig. 1978).

[2] Bruner J. S., “Desarrollo cognitivo y educación”, Editorial Morata, Madrid, 1988.

[3] M. Castro, S. Acha, J. Perez, A. Hilario, J.V. Miguez, F. Mur, F. Yeves, J. Peire, "Digital systems and electronics curricula proposal and tool integration," in Proc 30th ASEE/IEEE Annual Frontiers in Education, vol. 2, 2000, pp.F2E/1-F2E/6.

[4] V.A. Pedroni, “Teaching design-oriented VHDL”, in Proc. of the 2003 IEEE International Conference on Microelectronic Systems Education (MSE’03), 2003, pp 6- 7.

[5] Xilinx Inc, en www.xilinx.com .

[6] Altera, en www.altera.com

[7] J. Macías-Guarasa et al. "A project-based learning approach to design electronic system curricula", IEEE Trans. Education, 49(3) 2006..

[8] J. Northern "Project-Based learning for a digital circuits design sequence" in Proc. IEEE Region. 5 Technical Conf., 2007 USA.

[9] F. Machado, S. Borromeo, N. Malpica, “Project based learning experience in VHDL digital electronic circuit design, Microelectronic Systems Education”, MSE '09. IEEE International Conference on, 2009, pp 49-52.

[10] www.dsi.fceia.unr.edu.ar

[11] R. Corti, R Martínez, E. D’Agostino, E. Giandoménico, “Experiencia didáctica en una carrera de Ingeniería Electrónica. Actualización de los contenidos del área digital”, Revista de Enseñanza de la Ingeniería, año 7, no. 13, pp. 61–72, Dic. 2006.

[12] G. Baliga, J. Robinson, L. Weiss L., "Revitalizing CS hardware curricula: object oriented hardware design", The Journal of Computing Sciences in Colleges, 25(3): 60-66, 2010

[13] R. Martínez, J. Belmonte, R. Corti, E. D’Agostino, E. Giandoménico, “Descripción en VHDL de un sistema digital a partir de su modelización por medio de una Red de Petri”, in Proc. FPGA Designer Forum, 2009 SPL, pp 7-11.

11

�

12

�

�� J ��K� ��

L��

)�� <� ��0��<��$%��

��-��.��4�� /��+ ��E��M��+�� N�� OM��+��+��

��

$�� #�� 012�� &�� AE�� F08�)H+� *�� '��.��-<4�>PP�� .�� )85PP37P>�� 012��Q��%��%5>�0Q��E�� .�� E�� .�� .�� $�� +��

!"�# $�%& ''#J �

*�� '�� "�� E�� E�� E�� A�� A�� F�� "��H+�! ��&��E�� R�� .�� +� *�� AE�� $�� 0�� F0�� 8�� H� �� < �� )�� F)�� 1��H� �� "#� �� '�� "��+�4�A�� "#�� 1��"�1+�<��S�T�� "�� E�� E��.#�� .&�� $��+� *� �� 012�� .�� +� $� ��.�� <��$�� "�� "#� 08�)� F08�� )�� 1��H� E�� "�� "�� ()*F� (��)�� *��"��H+� -�� #�� E�� "��E�� .#��.&��+�$��.��

�� "��.�� #��+��$��#��A��"�� "�� /�� 8��:�� 8�� 3� �� '�� 8��>� � � �� "#��'��"��+�1��R�� 8��;��+��

("�&��'�#�'#J �&�)��%U�'$%�

$� �� 012�� "��E�� .#�� $�� '�� .�� +�

*� �� "�� '�� 6�� E�� $�� $$$� IP:+3� S:T� � �� .#�� E�� "�� .#��E�� +��

$�� E�� .�� "��012�+�$��0�"+�� E��+�

1�� 6��V��%5>�0Q�:V�� .�� $�� 012��Q��%��%5>� S3T+� 0�� 6��$��PW�PP��!�� :� �� "�� F�.4��H� S>T� � ��.&�� S;T+�

�� +��

("!"�&�*��+�� !��

*� ��"�� -<4�>PP� SCT� �� +� -�� .�� E�� "�� "�� .�� "��5��"�� "��5��"�� A�� +

13

�

!��+�8��'��-<4�>PP�8�'� )��X��X�� :+:II��(��6�

��X��8�'�� <7,��E�� +�

��X� � �� -<4�>PP+��X��!��X�� -<4�>PP�� +�

!��:+�8��'��$��<�8�'� )��

�"!#�!� 8��.#��E��$��<+�

�"!� 4��

�"�� <�� $��<�� +�

�"$�!��$��<�E��E��E�� +�

�� .��#��+�

0�"+��+��E��+�

�

0�"+�:+�)��+�

$�-<4�>PP��&��:P5�� "�� <Y7,��.�� :+�� S,T�� <Y7,� �� "�� "�� Z�<[7,��\��Z�<[7,��\+��

$��!��'��-<4�>PP�E��&��+�

*�� F�� H� �� 3� �� F��H� ��%�� '��X�!��X��+�<�� :;C��P��C��:��:P��+��

$� �� P� �� E�� "�� #�� &��"��+�*��:P� �� '� �� A��3+�

$��6��<Y7,��:+:II��W��"�� -<4�>PP�� >I�6(�+�

%&%&�#�'�� (��')��'��

*� �� $�� $$$� IP:+3� �� A�� #�� '�� .�� #�� .�� G8��SIT��

!�� .��0�"+�� $$$� IP:+3� �PP� �� A� �� )1I3I>,� S7T� �� #�� $��<�S�PT�F�1�� 012��Q��%��%5>H��"�� +�

$��!��:��'��$��<�E��&��+�

*��$��<��A�� )1I3I>,��E��PP��:�;��(�+�$��'�� F.��!��:H�� E��$��1��$��<+�

*&�&#��]+��

1�� '�� &�� 08�)� F0�� 8�� )��H� S�T+� *��

�� 8� ��1��

14

�

0�"+�>+�08�)��<[7,��+�

�

0�"+�;+�08��<[7,��

08�)� �� AE�� F�� H�� F)�� 1��H+� )�� &��'�� E��"�� AE��E�� +��

$�� .�� "�� .&��'��X�!��X��F.��!�� H�� "�� $��<��.&�� '� �"!� F.�� !�� :H+� $�� .�� 5�� I� �� "�� F��"�� H��+�<� �� "��:P��.��.��>�� "��.�� R��+�

*�� "�� E��$��+�8�� .�� "�� 0�0G�F��5��H+�

*�0�0G�� E�� .�� E�� E�� +�$� �� 0�0G� �� I�� '� !Q)� �� $��<�� E�� F�� H� �� E�� "�� E�� +� 8��

�� E�� '�� +� $�� >+3� �� '�� 0�0G� �� .�� E�� +�

��*�� E�� $�� +�� %�� E�� '��"!��"��0�0G+� 8�� +�$�� 0�"+�:� ��+��

�)�� E�� '�� X�!��X�� "!�

�� AE�� +� <� �� E�� F�� "��H� �� $��F�� %�� H��E��.�� +�

$�� S��T� �� "�� +� -�� 0�0G�� .�� %��#�� E��+�

0�"+�3+�� +�

15

�

0�"+�C+�08�)�0� ��+�

�

0�"+�,+�08��0� ��+�

0�"+�I+��+�

)��A�� /�

�� /� �� -<4�>PP� �� "�� E�� Z�<[7,��\�+�

�� E��/�� 0�0G��+�

�� $�� $$$� IP:+3� �PP� ��/�� )1I3I>,� 1(,�� 1� ��$��<� �� "�� E�� 0� ��+�

$�� 0�"+� 3� �� E�� +� $��"��'�� 012��Q��%��%5>+�

,&�#-.)�-� $�'#J ��

$�� 012�+�

,&/&�-��0��0��'��1��'2-.��'��

<� �� %�� '�� A�� "��I��+�

*�0�"+�>�� 08�)�� 0�"+�;�� AE�� +� �� . �� +�

�/�� +��)��%��3��08��A��

��P�� E�� .A�� .A��+� )�� A��.��'��X��+�

�4� ��.� �� +� )�� A� ��.� � ��'� �� "��X��"��+�$��8,��.��'�*G�)��0�0G��"��+��0�0G��

��'��F��.��0�0GH�

� � ��5��

<� �� %�� '�� A� �� %��E��E��.��0�0G+�

*�0�"+�C� ��08�)�� 0�"+�,�� AE�� +� �� . �� +��!��'��$�)"�

E��E��0�0G��E��+�� .�� '�

#$�%�� $��<� F��'�!Q�<BH�� +��&� ��

�E��F��A��.&�� %��H�� .�� 0�0G�F��A�� .&�� %��H� ��.�� '�' ��+��(��F��'�#$)�')��.H+�

16

�

0�"+�7+�!�� .��0�0G+�

� & ��"��*��*��*��+�,�� "�-��

!�� $$$� IP:+3� �� <Y7,� �� $$$� IP:+3��. �� "��'�� .��+�

$�� E�� .+� !� �� 3� �� 0�0G��+�

$��0�"�I�� '�� +� �� '�� .+�

$� �� E�� 0�0G� ��"�� 012�� Q��%� ��%5>+� )�� 012�� 0�0G�� .&�� "�� 4�� "��+�*��4��6��F4��H��E�� 012�+�

1��"��0�0G��.� ��E�� E�� '�� +�

�� '�� E�� E�� # �� A%� �� '�� E��

�� $$$� IP:+3� �� .�� F ��H+�8��"��R ��>P�� R��3�� A%� ��>,:��"�R��+��

1�� .�� '��"�� 0�0G� �� 012��Q��%��%5>�� 0�0G� �� :B� %� 7� �� E�� E��E��+�

)��0�"+�7�� .�� E�� .� �0�0G+�*��'� ��/��0�0G��"��.�� '��E��A�� '��$�)"�� $��<+��

( �/��0501�10J ��

1�� &�� +� *�� 8� �S�:T+��

8�� '��/��<7,� <�� 0� �� 0�0G�� /�--!�F-��-��!��H��)��$��"�� "�� # ��+� $� �� .�� "� �� "��+�

( ! �$�+�#��*��123�*��

1��"��# �� <7,� S,T� �� <7,��+��

*�0�"+��P� �� .A�� .A��+� G��.�� E�� '� �� 4�� 0�0G��.��.��+�G��.�� &�� E�� '� �X�� +�

G�� /� �� .A�� .A��0�0G��+�

�

0�"+�P+�)�"� �� ;+��

17

�

0�"+��+�)�"� �� ;+:+�

!��3+�-��.��<� �� 1��.�� ^�

�� ^�

�� :�^�

�� >�^�

� �� C�^�

�� :�^�

�� PP�^�

!"!�$�+�#��*#�5$%&�$!�

�� E��'�� #��E��'��'+�

1�� E��'� �� 08�� E�� 0� �� .�� '� �� '�� E��'��'� �� A�� '� ��+�G��.�� 0�"+��'��'��$�

!(!�$�+�#��*#��%�505+�%+��*$)��*%!�

$�� 0�0G� �� "�� # �� <7,��0� ��+��

8��E��0�0G��<7,� �� '� �� .� �� *�� 0�0G� �1� <�� "�� 3�� *+� <��"�� <7,��E�� E�� +�

5!��0+$��0��

8��!��3�� .�+�

,!�1++1) �0J+�

8��'�� "#�� S�T�� '�� "��+� )��A�� .�� '�� 08�)�� E��0�0G��+�

)�� ()*�� .�� 6��+�

-!��5��+1��

S�T� 1��"� 1+� <�� Z012�� 1�G!G!.1��2� 4.� �$��*G2�$Q��1*$8\��ZC�/�08�)\��375�,3��0��:PPI+�

S:T� �$$$� IP:+35:PP:� �$$$� 8�� "�58�� E�� 5� 1�� 3/� <�� 8�� <�� )�� F<�8�W<)H��1��*��8��+�

S3T� ��%5>�012��-��2��+�-2P,P�F.:+>H��P��:PPI+�

S>T� Q��%_��%`5>�0Q�:�$.��B��-��2��)85PP;:P>��:PP;��.��+�

S;T� ��W�� [�� "�� )85PP37P>�� :PP>��.��+�

SCT� -<4�>PP�� "� �� +��.+�P:5:��0��:PP:+�1��)�+�

S,T� �� <�� a7,�� .�� :+3� ��.�� +P�� +� ��:PP:�

SIT� G8��b�!��8G��G��8�� (��1� �� $$$�!��<� ��.�+�:I��+�>��7IP��+�>:;�5�>3:+�

S7T� )1I3I>,� )�1(.!$�� 8��"�� PW�PP� $��!��.�� )�� 8� ��<��:PP:+�

S�PT� ��%5>� 012�� $ �� !��5�� $�� <� -��2��-2P,>�F.�+IH�0��C��:PPI+�

S��T� 8��.�� B�� V��.�� 012�� )��"�/� �� G�� V�� VC� 5� <��6�� V��I35�PP��:PP,+�

S�:T� ��/WW��+ ��+�� W��W�.W �� W��W��+��+�

18

USE OF SELF-CHECKING LOGIC TO MINIMIZE THE EFFECTS OF SINGLE EVENTTRANSIENTS IN SPACE APPLICATIONS

Juan Ortega-Ruiz and Eduardo Boemo

School of Computer Engineering,Universidad Autonoma de Madrid

Ctra. de Colmenar Km 15, 28049 Madridemail: [email protected], [email protected]

ABSTRACT

The use of Self-Checking circuits has been explored as amean to minimize the effects of Single Event Transients incombinational logic. Different types of circuits have beendescribed in VHDL RTL, synthesized and simulated at gatelevel. During their gate-level simulation faults were injectedand their outputs were checked against a fault-free simula-tion. Results show that Self-Checking circuits can be usedto either detect or correct errors in combinational logic, min-imizing the effects of SETs in combinational logic.

1. INTRODUCTION

FPGAs are increasingly utilized in space applications: therecognized virtues of that technology like high-density, ver-satility, availability are also attractive in a sector that suffersa lack of varied components as well as increasingly budgetrestrictions. However, the reprogrammability - other dis-tinctive characteristic of these circuits - can be a drawbackin safe-critical missions. In effect, the intrinsic possibilityof correct errors during design time, can produce an unno-ticed relax of the strict design and verification rules usuallyapplied during the development of an masked ASIC [1]

Radiation can affect FPGAs by producing soft or harderrors. The first ones are represented by upsets in flip-flops,latches or SRAM cells (SEU, Single Event Upset), the ac-tivation of disabled functionality (SEFI, Single Event Func-tional Interrupt), or the generation of glitches in combina-tional logic (SET, Single Event Transient). Permanent dam-age to the device, like degradation by accumulated radiation(Total Dose) or by particle impacts which produce imme-diate hardware damage, like latch-up (SEL, Single EventLatchp) are examples of hard-errors.

This paper deals with the effects of Single Event Tran-sients (SET) in the user plane of the logic of the FPGA, andthe utilization of self-checking circuits to minimize it. SETin combinational logic might produced glitches which mightbe captured by flip-flops at the end of the logic cone. Such

wrong values would propagate though the logic, with unpre-dictable functional results.

Probability of capturing a SET increases with clock fre-quency, because duration of the glitches and the clock periodbecome similar. Additionally, conventional TMR (TripleModule Redundant) flip-flops can do little to avoid capturingsuch glitches. As a rule-of-thumb, SET becomes a problemat frequencies, above 150MHz, in space-applied FPGAs.

This paper is organized in the following way. In section2, a definition of totally self-checking circuits is given. Sec-tion 3 describes an architecture for a totally self-checkingnetwork and its properties. In section 4 a set of 4 totallyself-checking circuits are evaluated, to study the propertieswhich make them self-checking. Section 5 summarizes theconclusions, and finally, section 6 acknowledges this study.

2. TOTALLY SELF-CHECKING CIRCUITS

A circuit whose output is encoded in some error detectingcode is called a self-checking circuit [2]. Self-checking cir-cuits has properties of Self-Testing and Fault-secure, whichwere formally defined by [3, 4]. The following definitionsare extracted from [5, 6, 4].

• X is the set of all possible input words.

• Xc is the set of valid input codes. Xc ⊂ X

• Z is the set of all possible output words

• Zc is the set of valid output codes. Zc ⊂ Z

• F is the set of all possible faults. λ ∈ F is null fault.

• A function z : (X,F ) → Z defines a circuit.

Fault-Secure A circuit is fault-secure for an input set I ⊂Xc and a fault set Fs ⊂ F if the circuit, in the pres-ence of a fault, produces either the right output codeword or an output non-code word, but never an incor-rect output code word [6, 4].

19

Fig. 1. Totally self-checking circuit example.

Self-Testing A circuit is self-testing for an input set Xc, anda fault set Ft ⊂ F if for every fault in the fault set, itexists an input code word, for which the circuit pro-duces an output non-code word in the presence of thefault [6, 4].

Totally Self-Checking a circuit is Totally Self-Checking (TSC)if the circuit is self-testing for Xc ⊂ X, Ft ⊂ F , andthe circuit is fault-secure for Xc ⊂ X, Fs ⊂ F [6, 5].It is assumed that Fs ⊂ Ft. Fig. 1 shows the relation-ship between different code and fault spaces in a TSCcircuit.

Code-Disjoint A circuit is code-disjoint if input code wordsare mapped into output code words, and input non-code words are mapped into output non-code words[5].

Fig. 1 shows that faults from Fs produce either the rightoutput code word, like z(x1, f2), z(x2, f2), or a output non-code word, like z(x3, f2), but it never produces an incorrectoutput code word. However, faults in Ft might produce in-correct output code words, like z(x2, f1), but all faults in Ft

can be detected because for every fault in Ft it exists an in-put which produces a output non-code word in the presenceof the fault, like z(x1, f1).

Unfortunately, in normal operation it is not guaranteedthat all valid input codes are applied [7], nevertheless, totallyself-checking property guarantees that an output code wordis really a good one, and an output non-code word indicatesthe presence of a fault.

3. TOTALLY SELF-CHECKING NETWORKS

Totally self-checking networks are built by totally self-checking

functional circuits, which are monitored by totally self-checking

checkers, whose outputs are coded in some error detectingcode [3]. The code 1-out-of-2 is commonly used becausem-out-of-n codes are able to detect unidirectional multiple

errors. Authors [6, 5] explain m-out-of-n coding theory.In fault-less operation, the functional circuit transforms

input code words from the input code space Xcf ⊂ {0, 1}n

Fig. 2. Totally self-checking network.

into output code words in the output code space Zcf ⊂{0, 1}m.

The checker circuit transforms input code words fromZcf into code words of the checker output code space Zcc ⊂{(0, 1), (1, 0)}. The checker constitutes a code-disjoint cir-cuit. Fig. 2 shows both the functional and the checker cir-cuits of a TSC network.

By observing the output of the checker, it is possible todetect any fault in the network, but it is not possible to decidewhether the fault is in the functional circuit or in the checkeritself [5]

4. CASES OF STUDY

The following circuits have been implemented and checkedfor the properties of totally self-checking

1. Single bit parity checker

2. Two rail checker

3. 2-out-of-5 code checker.

4. Berger prediction code checker

For each checker the evaluation consisted on the follow-ing steps:

1. build VHDL RTL model of the checker.

2. synthesize it using a generic target technology. Pro-duce two implementations: one based in “xor” gates,another based in “and-or” gates.

3. simulate the synthesized gate-level model of the checkerwith fault injection. Simulation consists on the fol-lowing steps:

(a) generate set of all possible faults. Only single

faults will be considered

(b) generate the set of valid input code words.

(c) for every fault of the fault set, simulate the wholeset of valid input code words.

(d) compare results with the fault-free checker sim-ulation searching for errors.

4. catalog the fault effects

20

x(4:0)

x(1)

x(0)

x(2)

x(3)

x(4)

z(0)

z(1)

z(1:0)

Fig. 3. Single bit parity checker, xor based.

If for a certain input pattern, the fault produces a validoutput code word which is not the expected one, then thecircuit is not fault-secure.

If the fault does not produce any output non-code wordfor any input in the set of valid input code words, then thecircuit is not self-testing.

The dependency of the checkers with respect to its syn-thesis implementation is also evaluated. Both synthesis netlists,“xor” and “and-or” based netlists are simulated with fault in-jection.

4.1. Single bit parity checker

A single bit parity checker generates a 2-bit vector outputwhose value is {“10”, “01”} when the input vector has theright parity, and {“00”, “11”} when the input vector has thewrong parity. Two different implementations are proposed.Fig. 3 shows a xor-based implementation of a simple paritychecker. Fig 4 describes an and-or-based implementation.

1. The input code space is a {1, 0}5

2. The output code space is 1-out-of-2 code.

3. Fault set consist on inversions, stuck-at-0 and stuck-at-1 in input bits and gate outputs.

Table 1 shows the results of the simulations with faultinjection.

1. For single errors the circuit is self-testing: there is al-ways an input pattern which produces an output non-code word, or an unidirectional error. There are noundetected faults

2. For single errors, the circuit is fault-secure: no faultsproduces unexpected valid output codes, or a bidirec-

tional error. A fault does not convert “10” into “01”and vice versa.

The parity checker is then Totally Self Checking.Results are identical regardless of the two implementa-

tions proposed.

z(1:0)

z(0)

z(1)

x(1)

x(0)

x(4)

x(3)

x(2)

x(4:0)

Fig. 4. Single bit parity checker, and-or based.

Table 1. Single bit parity checker results.

Fault Loc #PatUnidir

ErrBidirErr

NoDet

Inv x(*) 16 16 0 0SA0 x(*) 16 8 0 0SA1 x(*) 16 8 0 0Inv U* 16 16 0 0SA0 U* 16 8 0 0SA1 U* 16 8 0 0

4.2. Two-rail checker

A two-rail checker compares two complementary set of in-put vectors of n-bits. Its output is coded in a 1-out-of-2 code.The output is “01”, “10” when the inputs are complemen-tary, otherwise “11”, “00”.

Equations 1a and 1b describe a two-rail checker of 2 bits.

z1 = x1 · y0 + x0 · y1 (1a)z0 = x1 · x0 + y1 · y0 (1b)

There are methods to build n− bit two-rail checkers de-scribed in [6, 3]. Fig. 5 shows two architectures for a two-rail checker of 4 bits.

A 4-bit two rail checker, type “tree B”, shown in Fig. 6,has been implemented for the evaluation.

Table 2 shows the results of the simulations with faultinjection.

1. For single errors the circuit is self-testing: there isalways an input pattern which produces an unidirec-

tional error, a non-code word. There are no unde-

tected faults



The 4-bit two-rail checker is then Totally Self Checking.

21

Fig. 5. A 4-bit dual rail checker.

x(3:0)

y(3:0)

z(1:0)z(0)

z(1)

y(3)

x(3)

y(2)

x(2)

x(1)

y(1)

x(0)

y(0)

Fig. 6. netlist of the 4-bit dual rail checker “tree B”.

4.3. 2-out-of-5 code checker

A m-out-of-n code checker checks whether the input word isa m-out-of-n valid code word. For a m-out-of-n checker tobe self-testing the code words must contain the same numberof 0′s and 1′s, that is, the input code space must be a k-out-

of-2k code [4].

However, general m-out-of-n code checkers, where n �=2m, can be done by translating the code words into k-out-of-

2k code code words by using code-disjoint translators [4, 5].

A k-out-of-2k code checker is built by dividing the inputcode word in two parts, which are processed independentlyby two circuits, which together generate a 1-out-of-2 outputcode space.

The 2-out-of-5 code described in [5] has been evaluated.The code has been translated into a 3-out-of-6 code, whichis finally checked. Equation 2 defines the translation [5], andequation 3 defines the checker.

Table 2. 4-bit two-rail checker results.


ErrBidirErr

NoDet

Inv x/y(*) 16 16 0 0SA0 x/y(*) 16 8 0 0SA1 x/y(*) 16 8 0 0Inv U* 16 [12,16] 0 0SA0 U* 16 [4,8] 0 0SA1 U* 16 [4,8] 0 0

Table 3. 2-out-of-5 checker results.


ErrBidirErr

NoDet

Inv x(*) 10 10 0 0SA0 x(*) 10 4 0 0SA1 x(*) 10 6 0 0Inv U* 10 [3,10] 0 0SA0 U* 10 [1,6] 0 0SA1 U* 10 [1,7] 0 0

y9 = x4x3

y8 = x4x2

y7 = x4x1

y6 = x4x0

y5 = x3x2

y4 = x3x1

y3 = x3x0

y2 = x2x1

y1 = x2x0

y0 = x1x0

z5 = y9 + y8 + y7 + y6 + y0

z4 = y9 + y8 + y5 + y4 + y1

z3 = y9 + y7 + y5 + y3 + y0

z2 = y5 + y4 + y3 + y2 + y1

z1 = y7 + y6 + y3 + y2 + y1

z0 = y8 + y6 + y4 + y2 + y1

(2)

f3 = (z5 + z4 + z3)(z2z1 + z2z0 + z1z0) + (z5z4z3)g3 = (z2z1z0) + (z5z4 + z5z3 + z4z3)(z2 + z1 + z0)

(3)Table 3 shows the results of the simulations with fault

injection.

1. For single errors the circuit is self-testing: there isalways an input pattern which produces an unidirec-

tional error, a non-code word. There are no unde-

tected faults


22

Fig. 7. Berger prediction code checker.


The implemented 2-out-of-5 checker is then Totally Self

Checking.

4.4. Berger Prediction Code Checker

Arithmetic operations can be done in Berger-coded data byusing prediction circuitry [8, 9].

Although the Berger check bits of the result are calcu-lated with the arithmetic result itself, it is also possible topredict the result Berger code [8].

Based on these assumptions, a Berger prediction code

checker can be built comparing the output Berger code andpredicted Berger code. Fig. 7 shows the Berger predictioncode checker architecture.

The comparator must be totally self-checking, which canbe implemented by a totally self-checking two-rail checker,where one the inputs has been inverted.

The code checker fault tolerance is reduced to the faulttolerance of the TSC two-rail checker. Results of a 4-bit

two-rail checker are summarized in table 2

5. CONCLUSION

Totally self-checking properties of four circuits have beenevaluated with positive results. Their properties make themsuitable to minimize SET effects in combinational circuits,because the percentage of nets which remain SET sensitive,with no SET protection, is overall reduced. Nets which arepart of TSC logic are less sensitive to SET, because in pres-ence of a SET, the TSC logic either recovers, or indicates

an error at the end of the logic cone, which is normally con-nected to registers. In opposition, normal logic in presenceof SET has no possibility to detect errors at the end of thecone of logic. The number of nets affected by SETs whichproduced undetected errors is then reduced.

Complex code checkers can be made totally self-checking

by using simple and smaller TSC checkers, like two-rail

checker, at a not very expensive cost in terms of area andspeed. However inputs and outputs must be coded in someerror detection code [5].

TSC combinational logic combined with TMR sequen-tial logic, will increase reliability of SET sensitive systems,specially when the operation frequency is similar to the ef-fect of SET, when the probability for a glitch to be capturedin a TMR flip-flop increases. In addition, the cases of studydescribed in this paper, demonstrated that the chosen imple-mentation, either and-or-based or xor-based, have the sameTSC properties.

6. ACKNOWLEDGES

This work has been granted by the CICYT of Spain undercontract TEC2007-68074-C02-02/MIC

7. REFERENCES

[1] S. Habinc, “Lessons learned from fpga developments,” GaislerResearch,” Technical Report, September 2002.

[2] W. C. Carter and P. R. Schneider, “Design of dynamicallychecked computers,” in IFIP, vol. 2. North-Holland, 1968,pp. 878–883.

[3] D. A. Anderson, “Design of self-checking digital networks us-ing coding techniques,” Computer Science, University of Illi-nois at Urbana-Champaign, 1971.

[4] D. A. Anderson and G. Metze, “Design of totally self-checkingcheck circuits for m-out-of-n codes,” in Proceedings of FTCS,ser. FTCS, vol. 3, no. 25, IEEE. IEEE, 1995, pp. 244–248.

[5] P. K. Lala, Self-Checking and Fault-Tolerant Digital Design,D. E. M. Penrose, Ed. Morgan Kaufmann, 2001.

[6] J. F. Wakerly, Error Detecting Codes, Self-checking Circuits

and Applications, ser. Computer design and architecture series,E. J. McCluskey, Ed. Elsevier North-Holland, Inc, 1978.

[7] F. Ozguner, “Design of totally self-checking embedded two-rail code checkers,” in Electronic Letters, ser. Electronics Let-ters, vol. 27, no. 4, IEEE. IEEE, 1991, pp. 382–384.

[8] J.-C. Lo, S. Thanawastien, and T. R. N. Rao, “An sfs bergercheck prediction alu and its applications to self-checking pro-cessor designs,” IEEE Transactions on Computer-Aided De-

sign, vol. 11, no. 4, pp. 525–540, April 1992.

[9] J. H. Kim, T. R. N. Rao, and G. L. Feng, “The efficient de-sign of a strongly fault-secure alu using a reduced berger codefor wsi processor array,” in International Conference on Wafer

Scale Integration, 1993.

23

�

24

WIRELESS INTERNET CONFIGURABLE NETWORK MODULE

María Isabel Schiavon Laboratorio de Microelectrónica

FCEIA, UNR Rosario, Argentina,

[email protected],

Daniel Alberto Crepaldo Laboratorio de Microelectrónica


[email protected]

Raul Lisandro MartinLaboratorio de Microelectrónica


[email protected]

Abstract— A field programmable logic devices (FPGA) wire-less Internet configurable network node was developed. Inter-nally, three modules can be differenced, an 802.11 compatible transmitter/receiver, an ETHERNET compatible dedicated communication module and other module for receiving field sensors activity signals. Implementation over SPARTAN III developing board results is presented with simulation results.

Keywords: wireless; reconfigurable network node; fpga

I. INTRODUCTION A wireless Internet configurable network node imple-

mented with field programmable logic devices (FPGA) is presented. The system has reconfiguration capability achieved by the Internet connection in a wireless ETHERNET local area network. [1] [2]

Each one of the net nodes contains three modules, a sensor subsystem that receives activity signals from field sensors and generates the data telegram, a dedicated communication module that builds the message frame to transmit and decodes the incoming frames, and another to manager wireless data interchange identified as TRANSMITTER/RECEIVER.

Node operation and configuration data are interchanged via a wireless ETHERNET network and is achieved re-motely using IP protocol [3]. The data interchange protocol for wireless Ethernet networks with connectivity to INTERNET are defined in IEEE 802.11 standard rules. These rules are technology and internal structure independ-ent. Rules minimum and necessary subset was selected. Each network node has a physical address (MAC) and an IP address.

II. DESCRIPTION Block diagram of a typical net node is shown in Figure 1.

It shows six blocks. First block is a wireless ETHERNET com-patible transmitter/receiver. Next three blocks constitute the dedicated communication module: ETHERNET frames encod-ing and decoding block (CODE/DECO ETHERNET), IP packets encoding and decoding block (CODE/DECO IP) and a memory block (ETHERNET DATA MEMORY). The last two blocks corre-spond to the sensor subsystem, one (SENSOR MANAGER) to receive data from sensors and to conform them for transmis-sion and the other is a memory block to store configuration parameters (CONFIGURATION MEMORY).

Figure 1. System node block diagram

A. TRANSMITTER/RECEIVER The transmitter/receiver to be used in this application will

be a wireless ETHERNET IEEE 802.11 compatible transmit-ter/receiver and its description runs out of the scope of this paper

B. DEDICATED COMMUNICATION MODULE ETHERNET frames encoding and decoding block

(CODE/DECO ETHERNET) is a bidirectional block to manage data transmission and reception. As a receptor, it recog-nizes, decodes and processes the incoming frame according to ETHERNET rules. In data transmission, the reverse proc-ess is managed. The internal block diagram is shown in fig-ure 2.

Analyzer block (AT) block is designed as a finite state machine with an initial state to select transmission or recep-tion process. Each one of those process is accomplished by its own state chain.

In the transmission process, when PACKET OUT OK signal from the CODE/DECO IP is detected by AT block process to conform the output message is started assembling memory stored data telegram, destination/origin MAC and control bits. If channel is busy the transmission is inhibited. When the transmission medium is free, a signal (CCA INDICATION) is generated by the transmitter. After a random backoff time to prevent possible collisions the transmission is enabled.

TTRRAANNSSMMIITTTTEERR //RREECCEEIIVVEERR

CCOODDEE//DDEECCOO EETTHHEERRNNEETT

CCOOMMUUNNIICCAATTIIOONN MMEEMMOORRYY

CCOODDEE//DDEECCOO IIPP

CCOONNFFIIGGUURRAATTIIOONN MMEEMMOORRYY

SSEENNSSOORRSS

SSEENNSSOORR MMAANNAAGGEERR

25

Figure 2. Code/deco ethernet internal block diagram

If channel is free, AT block generates a signal (TXSTARTREQUEST) to require starting transmission. If transmitter activates the TXSTARTCONFIRM signal, AT block sends the first data octet and sets DATAREQUEST signal. It waits until DATACONFIRM signal is set to indicate octet recep-tion to send next octet. When the complete frame is transmit-ted, AT block activates the TXENDREQUEST signal to stop transmission. Transmitter responds activating the TXENDCON-FIRM signal.

In the reception process, when the transmitter/receiver detects a valid frame is starting, RXSTARTINDICATION signal is set to activate AT block in reception mode. Presence of a valid data octet at the output of the transmitter is indicated by the DATAINDICATION signal. RXENDINDICATION signal is set when the complete frame was received.

Data are received by AT block and CRC CHECKER block. Data are processed by AT block and stored in the COMMUNICATION MEMORY. Simultaneously, CRC CHECKER block checks redundancies through a feedback shift register like is proposed in XILINX application notes. [4]

When CRC is validated stored data packet are ready to be used by the CODE/DECO IP. Any other situation stops the process and the data are rejected.

COMMUNICATION MEMORY was implemented in one of the block RAM with two read/write ports available in SPARTAN III devices. To be used in message answer construc-tion, origin and destination MAC address are stored.

When corresponding fields from the ETHERNET incoming frame are loaded in COMMUNICATION MEMORY by CODE/DECO ETHERNET block, signal PACKET OK validates data presence, and IP protocol encoder/decoder block (CODE/DECO IP block) manages the IP protocol and extracts the CONFIGURATION DATA telegrams.

IP protocol encoder/decoder block is designed as a finite state machine, with an initial waiting state and one state for each field of the received packet. If the state corresponds to a transmission control information field, data are verified and when they are validated the system change to the next state.

Otherwise packet is discarded, and the system returns to the initial state.

If the corresponding IP address is detected, decoding process is accomplished and data are stored in sensor subsistem memory (CONFIGURATION MEMORY).

When sensor unit (SENSOR MANAGER) generates a data telegram, a valid data signal is activated and IP block starts the transmission procedure storing this data in the COMMUNICATION MEMORY and setting PACKET OK signal.

III. RESULTS

A. Simulation results Simulation results for an incoming appointed ETHERNET

frame is shown in figure 3. Once an ETHERNET frame is de-tected by the transmitter, RXSTARTINDICATION signal turns to high, and DATAINDICATION signal validates the incoming bytes. When destination MAC address is recognized, DIR OK signal turns to high, and when IBSS identifier is verified, BSSOK signal turns to high. As data come in, duration field, destination address, and data bytes are stored (see DATAAIN signal). CRC is checked and FRAME OK signal turns to high to validate the frame.

(a) First 16 bytes

(b) Rest of frame

Figure 3. Reception of an appointed frame

IP packet decoding simulation results are shown in figure 4. Packet data stored on communication memory are read and decoding process is started by the CODE/DECO IP stage (see ADDRB_IN and DATAB_IN signals). Once this process is completed, data are stored in the configuration memory (see ADDRA_OUT and DATAA_OUT signals).

FFRROOMM // TTOO CCOODDEE //DDEECCOO IIPP

FFRROOMM // TTOO MMEEMMOORRYY

CCRRCC

DDAATTAA

CCRRCC 3322

FFRROOMM // TTOO

TTRRAANNSSMMIITTTTEERR

CCRRCC CCHHEECCKKEERR

AATT

CCCCAA IINNDDIICCAATTIIOONN

TTXXSSTTAARRTTRREEQQUUEESSTT

TTXXSSTTAARRTTCCOONNFFIIRRMM

DDAATTAACCOONNFFIIRRMM

DDAATTAARREEQQUUEESSTT

TTXXEENNDDRREEQQUUEESSTT

TTXXEENNDDCCOONNFFIIRRMM

RRXXSSTTAARRTTIINNDDIICCAATTIIOONN

DDAATTAAIINNDDIICCAATTIIOONN

RRXXEENNDDIINNDDIICCAATTIIOONN

26

Figure 4. IP packet decoding

Figure 5. Sensor data telegram generation

When an activation signal from one sensor is detected, SENSOR1 signal is activated and the corresponding telegram generation is started. FREE signal indicates that channel is clear, and telegram bytes are transmitted through DATA and VALID DATA signals as it is shown in figure 5.

Simulation results for the transmission of a test frame are shown in figure 6.

Once the transmission request from the CODE/DECO IP is detected by AT block, the output message is assembled. When the transmission medium is free (signal CCA INDICATION set) signal TXSTARTREQUEST from AT block turns to high to require starting transmission. Transmitter activates the TXSTARTCONFIRM signal, and AT block sends the first data octet setting the DATAREQUEST signal When DATACONFIRM signal is set to indicate octet reception, next octet is sent. When the complete frame is transmitted, AT block activates the TXENDREQUEST signal to stop transmis-sion. Transmitter responds activating the TXENDCONFIRM signal.

(a) Start of transmission

(b) End of transmission

Figure 6. Transmission of a test frame

B. Hardware Results Prototype was implemented using Digilent S3 SKB de-

velopment boards for SPARTAN 3 devices [5]. Designed node occupies 124 from a total of 1920 slices (about 6 % of the FPGA total capacity).

IV. CONCLUSIONS SPARTAN 3 design and implementation of a configur-

able via INTERNET domotic network was presented. Mean-ingful simulation results using XILINX ISE platform simula-tion software are shown. Simulation results were validated with successfully communication tests done over prototype implemented in Digilent S3 SKB development boards.

V. REFERENCES [1] Schiavon M. I., Crepaldo D., Martín R. L., Varela C. Dedicated

system configurable via Internet embedded communication manager module, V Southern Conference on Programmable Logic, San Carlos, Brasil (2009) pp 193-197.

[2] IEEE, IEEE STD 802.11-2007, “Revision of IEEE STD 802.11-1999”,. June 2007.

[3] Waisbrot, J. “Request For Comments: 791”, http://www.rfc-es.org/rfc/rfc0826-es.txt

[4] Borrelli C. “IEEE 802.3 cycle redundancy check”, XILINX, App. Note XAPP209. March, 2001.

[5] Digilent S3 SKB development boards, SPARTAN 3 FPGA, and ISE platform , http://www.xilinx.com

27

�

28

MIC – A NEW COMPRESSION METHOD OF INSTRUCTIONS IN HARDWARE FOR EMBEDDED SYSTEMS

Wanderson R. A. Dias, Raimundo da S. Barreto

Department of Computer Science - DCC Federal University of Amazonas - UFAM

[email protected], [email protected]

Edward David Moreno

Department of Computer Science - DCOMP Federal University of Sergipe - UFS

[email protected]

ABSTRACT

Several factors are considered in the development of embedded systems, among which may be mentioned: physical size, weight, mobility, energy, memory, freshness, safety, all combined with a low cost and way of use. There are several techniques to optimize the execution time and power consumption in embedded systems. One such technique is the compression code, the majority of existing proposals focus on decompression assuming the code is compressed in time compilation. This article proposes the development of a new method of compression/decompression code implemented in VHDL and prototyped on an FPGA, called MIC (Middle Instruction Compression). The proposed method was compared with the traditional method Huffman also implemented in hardware. The MIC showed better performance compared with Huffman for some programs MiBench, widely used in embedded systems, 71% increase in clock frequency (in MHz) and 36% more in compression codes compared with the method of Huffman, and allows the compression and decompression at runtime.

1. INTRODUCTION

Embedded systems are any systems digital are incorporated into other systems in order to add or optimize features [16]. Embedded systems have the task to monitor and/or control the environment in which it is inserted. These environments may be present in electronic devices, appliances, vehicles, machinery, engines and many other applications.

The growing demand for the use of embedded systems has become increasingly common, prompting the implementation of complex systems on a single chip, called System-on-Chip (SoC). In this case, the embedded processor is a key component of embedded computer systems [4]. Today, many embedded processors found in the market are based on architectures of high-performance (e.g., RISC architectures of 32-bit) to ensure a better computational performance for the

tasks to be performed. Therefore, the design of embedded systems for high-performance processors is not a simple task.

It is known that many embedded systems are powered by batteries. For this reason, it is critical that these systems are able to control and manage power, thus enabling a reduction in energy consumption and control of heating. Therefore, designers and researchers focused on developing techniques that reduce energy consumption while maintaining performance requirements. One such technique is the compression of the code of instructions in memory.

Most of the techniques, methodologies and standards for software development, for the control and management of energy consumption, do not seem feasible for development of embedded systems because they possess several limitations of computing resources and physical. Current strategies designed to control and manage energy consumption have been developed for general-purpose systems, where the cost of additional processors or memory are usually insignificant.

The code size increases significantly as the systems become more heterogeneous and complex. In this sense, there was a high technical level that seeks to compress the code at compile time and their relief, in turn, is made at run time [12, 13, and 14].

The compression technique was developed in order to reduce the size code [15]. But over time, groups of researchers found that this technique could be of great benefit to the performance and energy consumption in general-purpose systems and embedded systems. Once the code is compressed in memory is possible on each request processor, get a much larger amount of instructions contained in memory. So is there a decrease in the activities of transition pins memory access, leading to a possible increase in system performance and a possible reduction in energy consumption of the circuit [15].

Likewise, when storing compressed instructions in the cache increases the number of instructions stored in the cache and increases your hit rate (hit rate), reducing search in main memory, increasing system performance and therefore, reducing energy consumption.

This article presents the development of a new method of compressing and decompressing instructions (at runtime), which was implemented in VHDL (Very Hardware

29

Description Language) [5] and prototyped in an FPGA (Field Programmable Gate Array) [3], called MIC (Middle Instruction Compression), which was compared with the traditional method of Huffman also implemented in hardware, and was shown to be more efficient than the method of Huffman from a comparison using the benchmark MiBench [7].

The rest of the paper is organized as follows: Section 2 presents the related work, Section 3 explains the architecture PDCCM developed for the MIC method, Section 4 details the description of the method MIC; Section 5 shows the simulations with benchmark MiBench using MIC methods and Huffman finally, Section 6 presents conclusions and ideas for future work.

2. RELATED WORK

This section lists some researches founded in the literature related to compressed instruction codes.

WOLFE & CHANIN [17] developed the CCRP (Compressed Code RISC Processor), which was the first hardware decompression implemented in a RISC processor (MIPS R2000) and was also the first technique to use the failures of access to the cache mechanism to trigger the decompression.

The CCRP has similar architecture to the standard RISC processor and thus the models of the programs are unchanged. This implies that all existing development tools for RISC architecture, including compilers optimized, functional simulators, graphics libraries and others, also serve to architecture CCRP. The unit of compression used is the cache line of instructions. Every failure of access to the cache, the instructions are fetched from main memory, uncompressed and feed the cache line where there was a failure [17]. The fact that the CCRP has to perform decompression of the instructions before storing them in cache is advantageous in that the addresses contained in the cache jumps are the same as the original code. This solves most problems of addressing; there is no need to resort to gimmicks such as (I) put extra hardware in the processor for different treatment of jumps, and (II) make patches address jump.

The technique used the CCRP Huffman coding [8] generated by a histogram of occurrences of bytes of program and showed a compression ratio of 73% on average for the tested package (consisting of the programs nasa1, nasa7, tomcatv, matrix25A, espresso, fpppp and others). For memory models slower DRAM (Dynamic Random Access Memory), processor performance was mostly mildly improved. For models faster memory EPROM (Erasable Programmable Read Only Memory), performance suffered a slight degradation.

AZEVEDO [2] proposed a method called IBC (Instruction Based Compression), which is to perform the division of instruction set processor classes, taking into account the number of events along with number of elements in each class. Research by AZEVEDO [2] showed better results in compression of 4 classes of instructions. The compression technique developed is to group pairs in the format [prefix codeword] that replace the original code. In pairs formed, the

prefix indicates the class of instruction and serves as a codeword index for the table of instructions.

The process of decompression is performed in 4 pipeline stages. The first stage is called INPUT where the address is converted processor (code uncompressed) in the main memory address. The second stage is called FETCH, which is responsible for the search word in the compressed main memory. The third stage is known as DECODE where it is really held the decoding of codewords. And finally the fourth stage, called the output, the query is performed in the dictionary of instruction to be provided to the processor instruction. In tests, AZEVEDO [2] obtained a compression ratio of 53.6% for the MIPS processor and 61.4% for the SPARC (Scalable Processor Architecture). The performance, there was a loss of 5.89% using the method IBC.

BENINI et al [4] developed a compression algorithm that is suitable for efficient implementation of the hardware (decompressor). The instructions are packaged in groups that are the size of a cache line and its decompression occurs at the moment that is extracted from the cache. The experiments were performed with the DLX processor, due to even have a simple architecture of 32 bits and also be a RISC architecture. In addition, the DLX processor is similar to several commercial processors family ARM [1] and MIPS. A table of 256 positions was used to store the instructions executed more. Each cache line consists of 4 original instructions or a set of instructions compressed and possibly interspersed with non-compressed, prefixed by a word of 32 bits. The word is not compressed in a fixed position from the cache and serves to differentiate a cache line with instructions from the other lines compressed with the original instructions. Indeed, a compressed cache line does not necessarily contain all the instructions compressed, but should always be a number between 5 and 12 instructions in the compressed cache line to be advantageous the use of compression [4].

To avoid the use tables of address translation, BENINI et al, require that the destination addresses are always aligned to 32 bits (word). The first word (32 bits) of the cache line contains an L mark and a set bits of flag. The brand is an opcode instruction is not used, ie, an opcode that signals a compressed line (in the DLX processor opcodes are 6 bits). The compression algorithm developed by BENINI et al [4] analyzes the code sequentially from the first instruction (assuming that each cache line is already aligned) and tries to pack instructions in adjacent rows compressed. The experiments carried out in several packages of the benchmark C code provided by Ptolemy project [6] proved that there was an average reduction in code size by 28% and an average savings in energy consumption by 30%.

LEKATSAS et al [10, 11], developed a decompression unit with a single cycle. Decompression can be applied to instructions of any size of a RISC processor (16, 24 or 32 bits). The only specific application is part of the interfacing between the processor and memory (main or cache). The decompression mechanism is capable of decompressing one or two instructions per cycle to meet the demand of the CPU without increasing the runtime. They developed a technique to create a dictionary that contains the instructions that appear more frequently. The dictionary code refers to a class of

30

compression methods replacing sequences of symbols with the contents of a table. This table is called a “dictionary” and the contents are “codewords” in the compressed program [11]. The main advantage of this technique is that the rates are usually fixed-length, and thus simplifies the logic of decompression to access the dictionary and also reduces the latency of the decompression. The results obtained in tests carried out showed that there was an average gain of 25% performance in execution time of applications using the compression code and an average of 35% reduction in code size. The technology developed is not limited to only one processor, but can be applied and achieve similar results on other processors.

LEFURGY et al [9] proposed a compression technique based on the code of the program code using a code dictionary. Thus, compression is performed after compiling the source code, however, the object code is analyzed and the common sequences of instructions are replaced by a coded word (codeword), as in text compression. Only the most frequent instructions are compressed. A bit (bit escape) is used to distinguish one word compressed (encoded) of an uncompressed instruction. The instructions corresponding to the compressed instructions are stored in a dictionary in the decompression hardware. The compressed instructions are used to index the dictionary entries. The final code consists of codewords mixed with uncompressed instructions.

It is observed that one of the most common problems found in the compression code refers to the determination of the target addresses of jump instructions. Usually this type of instruction (direct diversion) is not coded to avoid the need to rewrite the code words that represent these instructions [8]. Since deviations overhead can be encoded normally, because, as their target addresses are stored in registers, only the code words need to be rewritten. In this case, you need only one table to map the addresses stored in the original registrar for the new addresses tablets.

This method differs from other methods seen in literature that addresses the goals are always aligned to 4 bits (size of a codeword), not the size of the word processor (32 bit). As advantage it seems a better compression, but the disadvantage there is a need for changes in the core processor (extra hardware) to address gaps to address aligned to 4 bits. However, it is unclear details about the interaction of hardware decompression with experienced processors (PowerPC, ARM and i386). The operation of hardware decompress or is done basically as follows: The instruction is fetched from memory, if a codeword, the decoding logic of specific codeword gets the offset and its size will serve as an index to access the uncompressed instruction in the dictionary and pass processor. If instructions are not compressed, they are passed directly to the processor. With the method proposed in [9], were obtained compression ratios of 61% for the PowerPC processor, 66% for the ARM processor and 75% for the i386 processor. The metrics of performance and power consumption were not expressed.

3. PDCCM ARCHITECTURE AND MIC METHOD

In the literature we have found two basic types of architectures, code compression, CDM and PDC, which indicate the position of the decompressor for the processor and memory subsystem, as shown in Figure 1. The CDM architecture (Cache Memory Decompressor) indicates that the decompressor is positioned between the cache and main memory, while the architecture PDC (Processor Cache Decompressor) places the decompressor between the processor and cache.

Fig. 1. Architectures decompression code: (a) CDM e (b) PDC [12].

As previously mentioned (Section 2), the development of

architectures for compression or decompression code instruction is done separately, in most of the work of the treaty only because the decompressor hardware compression of the instructions is usually done through changes in the compiler. Thus, the compression is performed at compile time and decompression is done at run time using a specific hardware decompression.

To operate the MIC method proposed in this work, it was necessary to develop a new architecture, hardware, to carry out the compression and decompression of the instruction code at runtime. The architecture was created titled PDCCM Processor (Compressor Decompressor Cache Memory) in which it is shown that hardware compression was inserted between the cache and main memory and hardware decompression was inserted between the processor and memory cache. PDCCM The architecture was implemented in VHDL and prototyped on an FPGA manufacturer ALTERA® [18].

The architecture works with instructions PDCCM size of 32 bits, that is, each line of instruction cache consists of 4 bytes. Thus, the architecture developed is compatible with systems using the ARM processor as the core of your embedded system, because this processor features a set of instructions 32 bits. PDCCM In architecture, using the method of compression / decompression MIC all instructions that are recorded in the instruction cache will suffer a 50% compression in its original size.

Figure 2 shows the architecture PDCCM developed to implement the new method of compressing and decompressing instructions in hardware (MIC), which consists of four basic components, and they are:

31

• LAT (Line Address Table): is a table that has the function to map the addresses of the instructions with your new address in the instruction cache;

• ST (Sign Table): is a table that contains bits that serve as flags to indicate to the decompressor which pair of bits should be reconstituted, uncompressed;

• Compressor: is to enforce the compression of all instruction codes that will be saved in the instruction cache. The compressor is started every time the RAM has an access and a new instruction is passed on to be saved in the instruction cache;

• Descompressor: is to enforce the decompression of all instructions that are stored in the instruction cache and will be passed to the processor. The decompressor is triggered every time the LAT is found and return a hit.

Fig. 2. PDCCM Architecture.

4. A COMPRESSION ALGORITHM: MIC METHOD

The MIC method (Middle Instruction Compression) is a compression method which is to reduce by 50% the size of instruction codes that are stored in the instruction cache, and then passing the length of the 32 bit instructions (original size) to 16 bits (compressed size).

The MIC method requires an additional memory components used by the ST and LAT, which store the set of flags of the compressed instruction and mapping of the new addresses of the compressed instructions in the cache, respectively.

For compression, each instruction is read into memory and saved to the instruction cache will be split into pairs of bits with each pair consisting of: 00, 01, 10 and 11. Compressor MIC performs the following logic: pairs with equal values are replaced by bit 0 (zero) and pairs with different values are replaced by bit 1 (one). Then the bits 00 and 11 in compression are replaced by bit 0 and bits 01 and 10 are replaced by bit 1. So, a couple of bits are reduced to a single bit.

An auxiliary table (ST) is used to store the set of flags of double bit compressed. Pairs of bits that start with the value 0 (zero), such as 00 or 10 is recorded in the ST bit 0 and bit pairs that start with the value 1 (one), such as 11 or 01, is recorded in the ST bit 1. It is noteworthy that the mode of address lines of instruction for this architecture is Big-Endian.

For further clarification and, where possible, the names of components, variables, and input pins and output (input and

output) are similar to those used in the code implemented in VHDL.

4.1. Compression/Decompression Process

The processor requests an instruction to the instruction cache through a pin, which for this implementation was called: end_inst_proc (Current PC). The LAT will be checked whether or not the address provided by the processor. If the instruction is found in the instruction cache the LAT signal with a HIT. So, the LAT will provide the new address of the instruction in the instruction cache, the address of the set of flags in ST instruction and placement of double-byte (first or second) where the proceedings and flags in the instruction cache and ST, respectively. All this information is passed to the decompressor to reconstruct the instruction and return the uncompressed form to the processor through the variable returnD_inst_proc.

The decompression of instruction codes is performed as follows:

• The new address of the instruction that was passed by the LAT, was located in the instruction cache and ST;

• The instruction cache and ST return to the decompressor the 16-bit compressed instruction and 16-bit set of flags;

• If the bit read from the compressed instruction in the instruction cache is 0 (zero), the pair of bits to be reconstructed is 00 or 11. What defines how the pair of bits is the bit flag, that is, if the flag bit is 0 the pair of bits to be reconstructed is 00 and if the flag bit is 1 the pair of bits to be reconstructed is 11;

• But if the bit read from the compressed instruction in the instruction cache for 1 (one), the pair of bits to be reconstructed is 10 or 01. Then again the flag bit that defines how the pair of bits, if the flag bit is 0 the pair of bits to be reconstructed is 10 and if the flag bit is 1 the pair of bits to be reconstructed is 01;

• For each instruction to be decompressed, we analyze the 16-bit instruction saved in the instruction cache, thus transforming the instructions compressed 16-bit instructions in 32 bit uncompressed.

Now, if the address provided by the processor is not in the

LAT, it means that there is no such instruction in the instruction cache. The LAT signal a FAILURE from the instruction cache. The address provided by the processor will be transferred to the RAM (Random Access Memory), where it will be found and verified whether or not this instruction. If the search in a FAILURE RAM, the instruction is fetched in the HD (Hard Disk). Now if the fetch process indicates a hit, it means that the instruction is in RAM. Next, the RAM returns a copy of the instruction in the original format (uncompressed) to the processor through the variable returnC_inst_proc and another copy to the compressor, which will make the whole process of compression.

The compression of instruction codes is performed as follows:

• The instruction is placed in RAM and a copy of it is passed to the processor and one for the compressor;

32

• The instruction in the compressor is splited into 16 pairs of bits, and each pair is formed at the moment that is read by the function of compression. The beginning of the instructions coming from the RAM is the MSB (most significant bit);

• The compressor always consider what part of the double-byte (first or second) must be saved in the compressed instruction cache and instruction ST;

• If the pair of bits read for the compression is 00 or 11, then this pair of bits will be replaced by bit 0 and saved in the instruction cache. Now if the pair of bits read for 10 or 01 then this pair of bits will be replaced by bit 1 and saved in the instruction cache;

• The set of flags of ST will be formed through the following logic: if the first bit of double bit being compressed is 0, then the flag bit saved is 0. Now if the first bit of double bit being compressed is 1, then the flag bit will be 1 unless;

• After the compressor to replace the entire 32 bit instruction in the original 16 bit compressed and its bits flags, the compressor will save the couple of bytes (first or second) compressed in the instruction cache and instruction set of flags in ST;

• The LAT table is updated with the new instruction address saved in the instruction cache and ST;

• For each instruction that is sought in memory, repeat this process of compression.

Important to highlight out in our approach these

mechanisms of compression and decompression are performed at runtime, by specialized hardware which was prototyped in FPGAs. The PDCCM architecture has a small loss of performance due to the additional cycle in the pipeline. In our hardware implementation we have founded that this result is similar that obtained for LEKATSAS in [10, 11], since we found a component that needs only a single cycle for the compression or decompression, and the benefits are shown in the next section.

5. SIMULATIONS WITH MIBENCH

The benchmark used in the simulations of compression and decompression methods are Huffman and MIC package MiBench [7] specifically for embedded systems and different categories, which are in assembly code ARM9 processor, as found in [19, 20, 21, 22]. The category and functionality of MiBench benchmark used in the simulations are: CRC32, JPEG, QuickSort and SHA.

We used the instruction set embedded processor ARM (ARM9 family, version ARM922T, ARMv4T ISA) to simulate the operation of the compressor and decompressor MIC methods in architecture and Huffman PDCCM. However, the chosen processor (ARM) is the type and has a RISC instruction set consists of 32 bits (instruction) that enabled him to be a good platform to simulate the architecture PDCCM.

For the simulations of compression and decompression of MIC methods and Huffman were selected the 256 first instructions for each MiBench (due to physical limitations of

the FPGA used for prototyping), obtained by the compiled code (Assembly) to the embedded processor ARM, forming so the set of sequences of instructions that were used to load a piece of RAM memory and instruction cache. For more details, see [23].

The stretch of RAM described in VHDL was used in all simulations with the benchmark MiBench and had fixed size of 256 lines of 4 bytes each (modeling a memory 1Kbyte), thus accounting for 8.192 bits and the instruction cache has a size of 32 lines of 32 bits each (totaling then a cache instruction 1Kbit). Thus, we observe that there is a 8:1 ratio between the sizes of RAM and its instruction cache.

Table 1 shows the averages in the timing of PDCCM architecture, using both methods and MIC Huffman compression and decompression of the instructions for some programs MiBench.

Table 1. Delay in FPGA.

MIC Huffman Compression

Time in the worst case 9.314 ns 9.849 ns Clock in MHz 33.52 MHz 13.16 MHz Clock time 30.398 ns 76.020 ns

Decompression Time in the worst case 9.234 ns 11.006 ns Clock in MHz 30.92 MHz 5.52 MHz Clock time 32.554 ns 184.606 ns

By observing the table 1, the MIC method showed better

timing in the FPGA for all benchmark MiBench analyzed. In compression, it is observed a difference of more than 60% at the clock frequency (in MHz). Since the time in the worst case, for the two methods were very similar. Decompression, is a visible difference even greater, or more than 82% at the clock frequency (in MHz).

Based on the 256 first instructions of benchmark MiBench obtained from assembly code compiled for ARM platform, Table 2 that the method MIC depressed by 50% the size of instructions, ie 256 lines of the passage of RAM used the simulation, after the compression process has moved only 128 lines in the instruction cache. Since the instructions compressed using the Huffman received an overall average compression of 32% less in relation to the size of the RAM used in the simulation.

Table 2. Comparison of the rate of compression.

MiBench 256 instructions

MIC Huffman CRC32 128 (50%) 159 (38%) JPEG 128 (50%) 181 (29%) QuickSort 128 (50%) 192 (25%) SHA 128 (50%) 164 (36%) Averages 128 (50%) 174 (32%)

Based on the results, we find that for architecture PDCCM

using the 256 first instructions of the benchmark MiBench (CRC32, JPEG, QuickSort and SHA), the MIC method was

33

more efficient when compression phase, since we have a percentage 36% higher when compared to the Huffman method.

6. CONCLUSIONS AND FUTURE WORKS

This summary presented a study of research on compression/decompression instruction code, and architectures (CDM Decompressor Cache Memory) that indicates the position of the decompressor between cache and main memory and PDC (Processor Cache Decompressor) suggests that the positioning of the decompressor between the processor and cache.

The article described a new method of compression, called MIC, which was prototyped in FPGAs, and proved to be feasible for embedded systems that use RISC architecture. For the future this technique may become a necessary component in embedded systems projects. With the use of compression techniques of code, RISC architectures can minimize one of their biggest problems, which is the amount of memory to store programs.

Through simulations carried out with some benchmark programs MiBench found that the MIC method showed the following averages: frequency (MHz) operating at approximately 3 times higher for the processes of compression/decompression of instruction codes and 36% more efficient at compression rate of MiBench analyzed in relation to the method of Huffman, who also was prototyped in hardware.

Therefore, analyzing the data obtained through the simulations, it is concluded that the method developed and presented in this paper, called the MIC was more computationally efficient compared with the method of Huffman implemented in hardware. The simulations used the programs CRC32, JPEG, QuickSort and SHA MiBench benchmark for performance measurements.

As future work are: to design and implement a RISC processor that already has the hardware built-in compressor and decompressor at its core; testing of compression and decompression methods and MIC Huffman more MiBench benchmark programs and reach an implementation in ASIC, so that this project goes beyond the academic realm, serving as a contribution also to the industrial sector.

7. REFERENCES

[1] ARM. An Introduction to Thumb. Advanced RISC Machines Ltd., March 1995.

[2] AZEVEDO, R. An architecture for code tablet Dedicated Systems. PhD thesis, IC, UNICAMP, Brazil, June 2002.

[3] COSTA, C. da. Designing Digital Controllers with FPGA. – São Paulo: Novatec Publisher, 2006, 159p.

[4] BENINI, L.; MACII, A.; NANNARELLI, A. Cached-Code Compression for Energy Minimization in Embedded Processor. Proc. of ISPLED'01, pages 322-327, August 2001.

[5] D'AMORE, R. VHDL - Description and Synthesis of Digital Circuits. – Rio de Janeiro: LTC, 2005, 259p.

[6] DAVIS II, J.; GOEL, M.; HYLANDS, C.; KIENHUIS, B.; LEE, E. A.; LIU, J.; LIU, X.; MULIADI, L.;

NEUENDORFFER, S.; REEKIE, J.; SMYTH, N.; TSAY, J.; XIONG, Y. Overview of the Ptolemy Project, ERL Technical Memorandum UCB/ERL Tech. Report Nº M-99/37, Dept. EECS, University of California, Berkeley, July 1999.

[7] GUTHAUS, M.; RINGENBERG, J.; ERNST, D.; AUSTIN, T.; MUDGE, T.; BROWN, R. MiBench: A Free, Commercially Representative Embedded Benchmark Suite. In Proc. of the IEEE 4th Annual Workshop on Workload Characterization, pages 3-14, December 2001.

[8] HUFFMAN, D. A. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, 40(9):1098-1101, September 1952.

[9] LEFURGY, C.; BIRD, P.; CHEN, I-C.; MUDGE, T. Improving Code Density Using Compression Techniques. In Proc. Int'l Symposium on Microarchitecture, pages 194-203, December 1997.

[10] LEKATSAS, H.; HENKEL, J.; JAKKULA, V. Design of One-Cycle Decompression Hardware for Performance Increase in Embedded Systems. In Proc. ACM/IEEE Design Automation Conference, pages 34-39, June 2002.

[11] LEKATSAS, H.; WOLF, W. Code Compression for Embedded Systems. In Proc. ACM/IEEE Design Automation Conference, pages 516-521, June 1998.

[12] NETTO, E. B. W. Code Compression Based on Multi-Profile. PhD thesis, IC, UNICAMP, Brazil, May 2004.

[13] NETTO, E. B. W.; AZEVEDO, R.; CENTODUCATTE, P.; ARAÚJO, G. Mixed Static/Dynamic Profiling for Dictionary Based Code Compression. The Proc. of the International System-on-Chip Symposium, Finland, pages 159-163, November 2003.

[14] NETTO, E. B. W.; AZEVEDO, R.; CENTODUCATTE, P.; ARAUJO, G. Multi-Profile Based Code Compression. In Proc. ACM/IEEE Design Automation Conference, pages 244-249, June 2004.

[15] NETTO, E. B. W.; OLIVEIRA, R. S. de; AZEVEDO, R.; CENTODUCATTE, P. Code Compression in Embedded Systems. HOLOS CEFET-RN. Natal, Year 19, pages 23-28, December, 2003. 94p.

[16] OLIVEIRA, A. S. de; ANDRADE, F. S. de. Embedded Systems - Hardware and Firmware in Practice. – São Paulo: Publisher Érica, 2006, 316p.

[17] WOLFE, A.; CHANIN, A. Executing Compressed Programs on an Embedded RISC Architecture. Proc. of Int. Symposium on Microarchitecture, pages 81-91, December 1992.

[18] ALTERA® Corporation. Available at: www.altera.com. Accessed on 09 de July of 2008.

[19] Assembly code compiled for the ARM9's MiBench CRC32. Available at: www.efn.org/~rick/work/. Accessed February 17, 2009.

[20] Assembly code compiled for the ARM9's MiBench JPEG. Available at: www.zophar.net/roms/files/gba/supersnake.zip. Accessed February 17, 2009.

[21] Assembly code compiled for the ARM9's MiBench QuickSort. Available at: www.shruta.net/download/archives/project/report/5/5.2/ARM9. Accessed February 17, 2009.

[22] Assembly code compiled for the ARM9's MiBench SHA1. Available at: www.openssl.org/. Accessed February 17, 2009.

[23] DIAS, W. R. A. Architecture PDCCM in the Hardware for Compression/Decompression of Instructions in Embarked Systems. M.Sc. Dissertation, DCC, UFAM, Brazil, April 2009.

34

EMBEDDED SYSTEM THAT SIMULATES ECG WAVEFORMS

Thyago Maia Tavares de Farias

Programa de Pós-Graduação em Informática Universidade Federal da Paraíba

Cidade Universitária - João Pessoa - PB – Brasil – CEP: 58059-900

email: [email protected]

José Antônio Gomes de Lima

Programa de Pós-Graduação em Informática Universidade Federal da Paraíba

Cidade Universitária - João Pessoa - PB – Brasil – CEP: 58059-900 email: [email protected]

Fig. 1. Typical ECG signal.

ABSTRACT

This paper describes a embedded system developed for simulation of electrocardiographic signals, also known as ECG signals. The objective of this system is generate several examples of ECG waveforms for analysis and reviews in short time periods, eliminating the difficulties of obtaining real ECG signals through the invasive and noninvasive methods. One can simulate any given ECG waveform using this embedded system. This simulator was developed through the Altera’s Nios ® II development kits and Altera’s CAD software, for definition of hardware layer, beyond the use of Fourier series and Karthik’s algorithm implemented in language C through the Altera’s Nios ® II IDE, for implementation of software layer.

1. INTRODUCTION

According to Dirichlet [1], any periodic functions which satisfies boundary conditions in which a state variable remains constant over time, can be expressed as a series of scaled magnitudes of sine and cosine terms of frequencies which occur as a multiple of fundamental frequency. Karthik [2] also states that ECG signals are periodics, with frequency determined by heart beats, and satisfy Dirichlet’s condition. Therefore, Fourier series [3] can represent ECG signals. Fourier series are described in (1a).

f(x) = (a /2) + � a cos (n�x / l) + � b � �

o n n=1 n=1

n sin (n�x / l) (1a)

ao = (1/ l ) � f (x) dx, T = 2l T (1b)

an = (1/ l ) � f (x) cos (n�x / l) dx, n = 1,2,3… T (1c)

bn = (1/ l ) � f (x) sin (n�x / l) dx, n = 1,2,3… T (1d)

This work is inspired in the algorithm implemented in MATLAB® script language by Karthik [2]. From the definition of signal parameters as heart beats frequency, amplitude and duration, the algorithm calculates separately

the portions P, T, U, Q, S and QRS of a tipical ECG signal. These portions are illustrated in Fig. 1. The calculation of each portion is based in Fourier series described in (1a). Every significant feature of an ECG signal is generated from the sum of each of these waveforms. Developing a embedded system based on Karthik's algorithm [2] enables the prototyping of hardware that will help researchers in the analysis and reviews of electrocardiographic signals. This prototype can be applied in aid to corrective and preventive maintenance of various types of electrocardiography equipments, can periodically check the operating limits of heart monitors and similar equipment and evaluate and compare the performance between equipment from different manufacturers.

2. METHODOLOGY

The work involves description of the platform Nios® II, including internal peripheral and access to external devices, software development with GNU C/C++ in Eclipse® IDE and debugging aided by hardware. Nios® II is treated as a reconfigurable processor soft-core. A set of standard peripherals follows the platform and there is the possibility of development of personalized peripherals. The development kit used to design the holter monitor was the Altera’s ® Nios Development Board, Stratix Edition. This

35

Fig. 2. Hardware layer.

procedure qrs_q_s(amp, dur, hbr) {

x = 0.01:0.01:600; li = 30/hbr; b = (2*li)/dur; wave_1 = (amp/(2*b))*(2-b); n = <TOTAL_NUMBER_OF_SAMPLES>; for i = 1 to n

harm = (((2*b*amp)/((i*i)*(PI*PI)))*(1 cos((i*PI)/b)))*cos((i*PI*x)/l); wave_2 = wave_2 + harm;

end final_wave = wave_1 + wave_2; }

Fig. 4. Algorithm for the calculation of QRS, Q and S portions.

Fig. 3. Software layer.

board provides a hardware platform for developing embedded systems based on Altera® Stratix devices.

3. HARDWARE LAYER

Fig. 2 shows the hardware structure of the simulator. The Nios II Core executes the module of the software layer, previously stored in SDRAM memory. The Avalon® is a special bus that prioritizes speed data-communication, allowing connections in parallel. The PIO module offers

input and output ways, estabilishing a communication between the Nios II platform and blocks used. The Flash Memory device is a 8 Mbyte AMD AM29LV065D, used as general-purpose readable memory and non-volatile storage. The JTAG UART core uses the JTAG circuity built in to Altera® FPGA’s, and provides host access via JTAG pins on the FPGA. The software used to define and generate the system are the Quartus® II and SOPC Builder.

4. SOFTWARE LAYER

Fig. 3 shows the software structure of the simulator. The parameters of amplitude, duration, hearth beat rate and intervals (P-R and S-T) are used to calculated the P, Q, QRS, S, and U portions. Each sample portion is generated from 2 procedures, where one will be responsible for calculating the samples of QRS, Q and S portions, since these parts can be represented by triangular waveforms [2], and other will be responsible for calculating the samples of P, T and U portions, since these parts can be represented by sinusoidal waveforms [2]. Fig. 4 shows the algorithm for the calculation of the QRS, Q and S portions, and Fig. 5 shows the algorithm for the calculation of the P, T and U portions. The main procedure is responsible to order the necessary parameters for the calculation of wave portions in auxiliary procedures and merge the same calculated portions into a single wave, the ECG signal resulting. Samples of the resulting signal are written to a text file, to be opened in any software that generates graphics, such as MATLAB® and Excel®.

36

procedure p_t_u(amp, dur, hbr, int) {

x = 0.01:0.01:600; x = x – int; li = 30/hbr; b = (2*li)/dur; u1 = 1/l; n = <TOTAL_NUMBER_OF_SAMPLES>; for i = 1 to n

harm = (((sin((PI/(2*b))*(b-(2*i))))/(b-(2*i))+(sin((PI/(2*b))*(b+(2*i))))/(b+(2*i)))*(2/PI))*cos((i*PI*x)/l); u2 = u2 + harm;

end wave_1 = u1 + u2; final_wave = wave_1 * amplitude; }

Fig. 5. Algorithm for the calculation of P, T and U portions.

Table 1. Input data used in tests. Heart beat 72

Amplitude - P wave 25 mV

Amplitude – R wave 1.60 mV

Amplitude – Q wave 0.025 mV

Amplitude – T wave 0.35 mV

Duration – P-R interval 0.16s

Duration – S-T interval 0.18s

Duration – P interval 0.09s

Duration – QRS interval 0.11s

Fig. 6. Obtained ECG signal after tests.

The software used for the development of this layer was the Nios® II Embedded Design Suite, and the language C was used in the implementation. This IDE carries the developed software for the SDRAM memory of the Altera’s development kit, to be executed by Nios® II Core.

5. RESULTS

Table 1 shows the input data used in tests for the generation of an ECG signal by the embedded system developed. These values are used with default values by the simulator. Other values can be specified for the generation of ECG signals with distinct features. Fig. 6 shows the resulting ECG signal.

6. CONCLUSION

The results obtained show that the developed embedded system had success in simulate an ECG signal through the Fourier series. This simulator can simulate any given ECG waveform without using the ECG machine, removing the difficulties of taking real ECG signals with invasive and noninvasive methods.

7. REFERENCES

[1] G. L. Dirichlet. (1829, Jan.). Sur la convergence des séries trigonométriques qui servent à representer une fonction arbitraire entre des limites-donées. Journal für die reine und angewandte Mathematik. [Online]. 1829(4), pp. 157–169. Available:http://www.reference-global.com/doi/abs/10.1515/crll.1829.4.157

[2] R. Karthik. (2009, Aug. 13). ECG Simulation Using MATLAB – Principle of Fourier Series. [Online]. Available: http://www.mathworks.com/matlabcentral/fileexchange/10858

[3] J. Fourier. (1826). Théorie du mouvement de la chaleur dans les corps solides (suite). Mémoires de l’Académie royale des sciences de l’Institut de France. [Online]. pp. 153–246. Available: http://gallica.bnf.fr/ark:/12148/bpt6k33707/f6n94.capture

37

�

38

An FPGA BASED CONVERTER FROM FIXED POINT TO LOGARITHMIC NUMBER SYSTEM FOR REAL TIME APPLICATIONS

Elio A. A. De María, Carlos E. Maidana, Fernando I. Szklanny

Grupo de Investigación en Lógica Programable Universidad Nacional de La Matanza

Florencio Varela 1903, San Justo Prov. Buenos Aires - Argentina


ABSTRACT

This paper presents a high speed conversion system, based on programmable logic arrays, to be used to convert fixed point values, as those obtained at the output of a high speed analog to digital converter, into the logarithmic number system. It is a basic objective for this research to obtain a real time conversion of the data outputs generated by conversion devices in order to manage the obtained data in a proper format which allows arithmetical operations in a simple and efficient process. A conversion algorithm is therefore suggested, which will avoid the use of tables and interpolation methods. Another feature for this algorithm is the fact it will be entirely developed in one FPGA without the need of external hardware (as external RAM memories), and with minimum use of internal resources.

1. OBJECTIVES

The need for real time or near real time numerical operations, with a high level of accuracy, appears in different areas of current technology. In the particular area of analog to digital conversion, state of the art converters are available working at sampling rates of around 1 Gsamples/sec.

In such cases, real time operations will require the use of a representation system that will allow these calculations to be made properly, and offering a good precision in the obtained results. This requests a very good response time for the logic circuits involved in such calculations.

For these application areas, exponential formats offer important advantages over other representation systems, because an important range of values can be shown, with an adequate precision, suitable for most real world applications.

The use of a floating point number system like the traditional IEEE 754 standard [1] or a logarithmic number system, is therefore suitable for the requested objective.

This paper responds to the need of solving arithmetical operations in real time or near to real time systems, in order

to use the provided results in digital signal processing applications.

It is a first objective of our project, depicted in this paper, to develop an algorithm able to convert integer numbers, as those to be obtained at the output of an analog to digital converter, to Logarithmic Number System representation. using numerical procedures that will not require a high amount of hardware resources or time consuming calculus procedures.

It is another objective of this project, to show that the conversion can be made with a minimum conversion error, comparable or better than the AD conversion error, and better than the approximation errors associated with the logarithmic number system itself.

It is a third objective of this project to install the complete fixed point to LNS data converter in one unique field programmable gate array, using the least possible amount of hardware resources, specially referring to sequential logic elements, and with no need for external devices.

2. INTRODUCTION AND BACKGROUND.

During the last years, many researchers have analyzed the characteristics of LNS representations. Many published papers are based on a comparison with a conventional floating point representation system, referring basically to the way arithmetical operations are solved, and to the precision and errors related to each representation systems.

Among these papers, those written by Matousek et al [2] Haselman et al [3], Detrey and de Dinechin [4] can be mentioned.

Matousek et al [2] analyze the logarithmic number system configuration, including ranges and precision of the system, in order to conclude that this representation system is adequate for being used in FPGA devices.

Haselman et al [3] compare the logarithmic number system and the IEEE floating point representation system,

39

and suggest an interesting conversion method between both systems. This paper includes a deep analysis of the hardware requirements needed for this conversion.

On the other hand, Detrey y de Dinechin [4] suggest a VHDL operators library, to be used when LNS is used in signal processing applications.

These and other papers refer to different considerations about the way of solving arithmetical operations, especially as regards to adding and subtracting, when working in logarithmic number systems.

In some of these papers, and in order to obtain minimum errors, the abovementioned operations are based on tables and interpolation systems.

When the bit length of the numbers to be converted rises, these tables can grow enough as to require memory chips, external to the FPGA used for implementation of such arithmetical operations.

The LNS representation system shows as a major advantage the fact that the relative error in a numerical representation is constant and only depends on the number of bits included in the fractional part of the exponent. Thus, what has to be decided is how many bits are needed, in the LNS exponent, to convert a fixed point number with a reasonable error, compatible with the standard representation error.

On the other hand, and considering that the number to be converted is an fixed point integer value obtained as the output from a analog to digital converter, it is clear that this number, acting as an input to the LNS converter, may include an input error, which would be the output error of the AD conversion, not higher than 0,5 bit. In this condition, this last reference is to be taken into account in order to limit the LNS representation to a number of bits according to this required precision. This means that, even if the usual LNS format show exponent numbers with an integer part formed by eight integer bits and fractional part formed by 23 bits, there may be no need for such amount of bits considering the fixed point output error.

3. THE CONVERSION ALGORITHM.

Being N a number to be shown in LNS format, it will be expressed as written in equation (1), where m and f are the integer and fractional part of an exponent of number 2..

�� (1) If a representation range similar to the one given by

IEEE 754 32 bits floating point standard is needed, it can be shown that values for m and f are 8 and 23 bits, respectively.

From equation (1), it can be stated that: �� (2)

The value of the integer part of the exponent, m, will be defined as a function of the number of bits of the integer number to be converted. This number can be shown as a normalized expression 1,mmm, through equation (3).

�

� �� (3)

In this equation, exponent 0.f can be expressed as a sum

of negative powers of binary base 2, as shown in equation (4).

��

�� (4) The conversion algorithm developed in this paper is

based on obtaining the exponent fractional bits through successive squaring of the normalized expression for number N.

As the normalized number to be multiplied by itself has a range of values going from 1 to almost 2, its square power will adopt values from 1 to about 4, which, expressed in binary numbers, would be represented by 01.xxxxxx...x to 11.xxxxxx...x, being xxxxxx...x the fractional part of the mantissa, in both cases.

If the obtained second power of the number has an integer part formed by two bits, the obtained value will have to be normalized again. This will require a 1 value added in the corresponding position of the exponent fractional part, and normalization to 1.xxxxxx...x format, truncating the fractional exponent to the original quantity of bits.

If result obtained by multiplying the normalized number by itself has an integer part of 1, the next bit in the fractional part of the exponent will be a zero.

This procedure will be repeated until the required precision for number N representation is obtained.

Therefore, the number of successive squaring steps depends on the requested precision. If number N, in it’s normalized expression, has a fractional part with a predefined quantity of bits, it’s square has double quantity of bits. Therefore, the developed algorithm will analyze the need of truncation or rounding the obtained square, in order to maintain the number of bits after further iterations constant.

Truncation of the obtained result will result into an error, which must be measured to be sure that the error produced after the multiplications and normalization process does not give a final result with an error exceeding the error of the LNS representation system.

4. ALGORITHM IMPLEMENTATION

The proposed conversion algorithm was implemented using a Virtex II Pro device, considering as a main objective to obtain a design with minimum space requirements and

40

maximum speed. This led to a pipelined structure, which allows, once that all the stages of the pipeline are full, to obtain a new conversion with every clock pulse in the device.

Figure 2 depicts the first stage of the pipeline, in care of normalization of the incoming number to be converted. This stage includes a priority encoder, which is responsible for obtaining the integer part of the converted number N by analyzing the first significant one’s position in the incoming number. It also includes a barrel shifter which will be used to “normalize” the incoming number, considering this normalization as obtaining a number within the range 1.mmmm...m. For simplicity purposes, this normalized number, which is not represented in LNS, will be hereinafter mentioned as the numerical mantissa of number N.

The barrel shifter will simply shift the number from right to left as many times as indicated by the priority encoder.

Furthermore, as LNS has no representation for number 0, a zero flag is included, flag which is activated directly from the incoming number.

Results coming out from this stage will be latched to enter the fractional part of the exponent calculating block.

Stage 1 of this block, as shown in fig. 2, will multiply this mantissa by itself, and the obtained result will be truncated to 17 bits. As shown before, after calculating the second power of the incoming normalized “mantissa”, the most significant bit of the result will be included as a new bit in the LNS fractional part of the exponent.

Each of the successive stages of this pipeline will add one new bit to the EXPONENT register. The number of stages will equal the number of bits in the fractional part of the exponent. In our case, the final register will include 4 bits for the integer part of the LNS exponent and 18 bits for it’s fractional part.

The included multiplexer has the function of shifting the normalized mantissa depending on the value of product bit 17.

Figure 1, for simplicity reasons, only shows the first stage of the pipeline.

After the last stage of the pipeline, numerical mantissa is disregarded, as only values included in the ZERO and EXPONENT registers are required as result.

It is to be mentioned now that, with the use of a Virtex II Pro device, results obtained must be considered as very good, as the final design only used 1% of the chip resources. As regards to speed, conversion speed is very high, because of the pipelined architecture which allows one conversion per clock pulse.

In a near future, it is planned to install this same device in a Spartan III family device, which is much cheaper than the Virtex II we considered. The Spartan family has a

Fig. 2. Converter block diagram (one stage).

limited quantity of multipliers, which can limit the results of the proposed algorithm.

Anyway, considering this situation and considering also the fact that successive conversion stages are oriented to obtain bits at the fractional part of the exponent with less and less weight in the final result, it is a future objective to design this conversion system based on variable length multipliers, in order to simplify the need of large calculating units.

Preliminary tests, which are not ready to be included in this paper, have shown a space economy of around 30% only in the multiplying pipeline stages.

5. ACKNOWLEDGEMENT

Authors wish to acknowledge to the members of the Signal and Images Processing Research Group of our University, Mr. Roberto De Paoli and Mr. Luis Fernández, for all their knowledge and support received during our research work. Many of the results obtained in this project would have been impossible to reach without their help.

6. CONCLUSION AND RESULTS

This paper refers to a proposed algorithm for converting fixed point numbers into the logarithmic number system (LNS) and the way of installing the developed algorithm on a FPGA Virtex II Pro.

Fig. 1. Converter block diagram.

41

The developed project considers an iterative algorithm as base for the conversion process, to allow a simple solution using a pipelined multiplication system, with no requirements for external memories, huge size tables or interpolation methods, usual in other conversion methods.

As can be seen in simulation results, shown in table 1, the algorithm uses a very small part of the hardware resources included in the FPGA.

The pipelined design will allow for one new conversion completed at every clock cycle of the circuit.

This performance allows considering the use of the developed system for high speed applications, such as digital signal processing, audio and video applications, among others.

Data inputs have been considered to be not larger than 16 bits, so as to consider compatibility with current AD converters that could be used together with this application. The use of the FPGA internal multipliers, which are 18 bit x 18 bit multipliers, allow results to be very accurate.

In fact, testing processes included converting every fixed point number in the 16 bit rate into LNS, using the 18 bits multipliers in two different ways.

In the first test, the internal multipliers in the FPGA were used as 16 bits x 16 bits multipliers. In the second test, they were used at their complete 18 bit range, even the input data was represented in 16 bits.

Results obtained from the fixed point to LNS converter, were again converted into fixed point numbers, for the entire 16 bit number range. Results are shown in table 1.

This table shows that the absolute error obtained at the conversion is never higher than 1.355, obtained when converting integer number 55185. This implies a relative error of around 24.5 part per million. When the 18 bits multipliers are used, absolute error falls to 0.428857, obtained converting integer number 55516. In this case, relative error is less than 8 ppm.

Also considering the fact, already mentioned, that the converter system allows for one conversion for each clock cycle, after an initial delay of 18 clocks, it is clearly shown that the developed converter is a high speed conversion system, suitable for real time applications, and with no need for RAM blocks or huge tables to be memorized, as in other converters.

This will allow the converter to be developed on one FPGA device, using only the hardware resources included in the device.

The lack of need for external elements gives, as a result, a compact design, available on well known commercial FPGA devices.

Table 2 shows the hardware resources used from the Virtex Pro FPGA. The abovementioned results show that most of the hardware resources included in the FPGA have not been used and are, therefore, still available.

Simulation Using 16 bit Using 18 bit results multipliers multipliers Maximum 1.355 0.428857 absolute error Obtained at 55185 55516 number

Maximum 28.5 ppm 8.77 ppm relative error Obtained at 8237 46425 number

Table 2. Device utilization Summary

Logic utilization

Used Available Utilization

Slices 223 13696 1% Multipliers 18 136 13% Multiplexers 1 16 6% 4 Input Luts 354 27392 1% Slice flip flops 329 27392 1%

In this connection, a further stage of this research will

take the converter to be installed in a Spartan III device. Even considering that the Spartan III does not include so many multipliers, results can still be useful for simpler applications that do no require very high speed and precision.

7. REFERENCES

[1] IEEE 754-2008 IEEE Standard for floating point arithmetic. IEEE Computer Society. Jun. 2008

[2] R. Matousek, M. Tichy, Z. Pohl, J. Kadlec, C. Softley and N. Coleman, “Logarithmic Number System and Floating Point Arithmetic in FPGA”, Lecture Notes in Computer Science, Springer Berlin, ISSN 0302-9743, Vol 2438/2002.

[3] Michael Haselman, Michael Beauchamp, Aaron Wood, Scott Hauck, Keith Underwood, K. Scott Hemmert, "A Comparison of Floating Point and Logarithmic Number Systems for FPGAs,", 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05), pp.181-190, 2005.

[4] Detrey, J.; de Dinechin, F., "A VHDL library of LNS operators," Signals, Systems and Computers, 2003. Conference Record of the Thirty-Seventh Asilomar Conference on , vol.2, no., pp. 2227-2231 Vol.2, 9-12 Nov. 2003

[5] Virtex II platform FPGA handbook.. Xilinx Inc. 2000.

Table 1. Conversion error results

42

HARDWARE CO-PROCESSING UNIT FOR REAL-TIME SCHEDULING ANALYSIS

José Urriza, Ricardo Cayssials1, Edgardo Ferro

Universidad Nacional del Sur – CONICET1

Department of Electrical Engineering Bahía Blanca – Argentina


ABSTRACT1

In this paper we describe the design and the implementation of a co-processing unit for real-time scheduling analysis. This unit implements an arithmetic architecture that determines the schedulability of a fixed-priority discipline without requiring processing time from the system processor. Fixed-priority discipline is one of the most important disciplines in real-time. In this discipline, a priority is assigned to each task and it remains fixed during runtime. Exact schedulability conditions are useful to determine if the real-time requirements can be met. However, when the schedulability is required to be determined during runtime, the complexity of the calculus requires so much processing time that makes the system unfeasible. The processing unit implements efficiently the real-time analysis of a set of real-time tasks scheduled under a fixed-priority discipline and can be used in different real-time areas.

1. INTRODUCTION

In the classical definition ([1]), Real-Time Systems (RTS) are those in which results must be not only correct from an arithmetic-logical point of view but also produced before a certain instant, called deadline.

Hard real-time systems are those in which no deadline can be missed. In the hard real-time systems, missing the deadline of a task may have severe consequences and catastrophic results. Schedulability analyses were proposed to determine that all deadlines will be met during runtime. If the schedulability analysis is successful, it is said that the system is schedulable, otherwise the system is non-

This work was supported by the Technological Modernization

Program under Grant BID1728/OC-AR-PICT2005 Number 38162. and the project “Digital processing platfrom for Active Power Line Filters” granted by Fundación Hermanos Agustín y Enrique Rocca.

schedulable and consequently some deadlines may be missed.

In [1], the schedulability of single processor and multitasking systems is considered. A priority discipline establishes a linear order on the set of tasks, allowing the scheduler to define at each instant of activation, which task will use the shared processor.

Usually, it is considered that the tasks are periodic, independent and appropriable. A periodic task is one that after a certain time requests execution. The task is said to be independent if it does not need the result of the execution of some other task for its own execution. Finally, it is said that the task is appropriable when the scheduler can suspend its execution and withdraw it from the resource at any time.

Generally, the parameters of each task, under this framework, are: its execution time, which is noted as Ci, its period; noted Ti, and its deadline, Di. So, a real-time system is specify by a set of n tasks, S(n), such that S(n)={(C1, T1, D1), (C2, T2, D2),..., (Cn, Tn, Dn)}.

Numerous schedulability tests have been proposed in the real-time literature ([4, 5, 6, 7]). In 1986, Joseph and Pandya ([8]) presented an iterative method of Fixed Point to evaluate a necessary and sufficient condition to validate the feasibility of a RTS using a fixed-priority scheduler. Several works have been published with equivalent solutions afterwards. In 1998, Sjödin ([3]) makes an improvement to Joseph’s test, which is to begin the iteration of the task i+1 at the moment when the method of Joseph found the worst case response time of task i plus the execution time of task i+1. In 2004, Bini ([2]) proposed a new method called Hyperplanes Exact Test (HET) to determine the schedulability of an RTS.

All these proposals try to improve the efficiency on the schedulability analysis since it is the base of several real-time areas: aperdiodic task servers, fault tolerance computing, slack stealing techniques, multitask-multiprocessor assignment problem, among others. Several of the mechanisms proposed are unfeasible to be implemented during runtime because of the processing time that the schedulability analysis demands.

In this paper, it is proposed a co-processing unit for the real-time scheduling analysis of real-time system under a

43

fixed-priority discipline. This co-processor unit can be included in different architectures since it was designed with a general memory interface. The high performance of the unit makes it useful to implement on-line real-time strategies.

This paper is organized as follows: Section 2 describes the main concepts in real-time scheduling analysis. Section 3 explains the arithmetic architecture proposed to solved the fixed point schedulability function. In Section 4, we describe the data structure to interface the arithmetic structure with the processor of the system using the main memory as interface. Section 5 shows the results obtained. In Section 6, we describe the target applications in which this architecture may be applied. Conclusions are drawn in Section 7.

2. REAL-TIME SCHEDULING ANALYSIS

Real-time scheduling analyses are proposed in order to determine the schedulability of real-time systems. The scheduling analysis depends on the priority discipline considered. Two of the most important priority disciplines in real-time are: Earliest Deadline First and Fixed Priority. Almost all practical real-time operating systems implement fixed priority schedulers and consequently it is important to get efficient scheduling analysis strategies for this discipline.

Since 1986, most of the schedulability tests developed for fixed-priority disciplines are based on applying fixed point methods to guarantee the schedulability of the real-time system.

By definition, a fixed point of a function f is a number t, such that t= f(t). In our case, the fixed point function is a function of time and consequently the t point and the instant t are equivalent expressions.

The first fixed point method to determine the schedulability of a real-time system under a fixed priority discipline was developed by Joseph and Pandya ([8]). In [8], it is proved that there is not an analytical construction to resolve such kind of problems and it is only possible to solve it by iterative calculations.

Joseph’s method is initialized in the critical instant in which all the tasks are simultaneously invoked. As shown in [8], the result is the Worst Case Response Time of task i, denoted Wi, of a subset of tasks S(i). The fixed point equation proposed in [8] is:

11

1

i qq i j

jj

tt C CT

−+

=

⎡ ⎤= + ⎢ ⎥

⎢ ⎥∑ (1)

The fixed point given by this equation, if it exists, is the Worst Case Response Time of the task i. Consequently, task i is schedulable if, starting in t0 = Ci + Wi-1, there exists a fixed point ( tq+1 = tq) and the Worst Case Response Time

of task i is lower than or equal to its deadline (tq ≤ Di). Otherwise, the task i is non schedulable and consequently the real-time system is non schedulable as well.

The Utilization Factor of a real-time task is defined as Ci/Ti. The total utilization factor of the real-time system is therefore defined as the summation of the utilization factor of the tasks of the real-time system. It is a necessary condition for schedulability that the total utilization factor be less than or equal to 1.

Several methods have been proposed to solve this fixed point function. All of them start with an initial value of t and iterate until reaching the fixed point. The complexity of these methods for a real-time system with n task is proportional to n2.max(T). This complexity may turn unfeasible the scheduling analysis during runtime because of the processing time required to find the fixed point.

In this paper we propose a hardware processing unit that solves this fixed point function without requiring processing time from the processor of the system and consequently without perturbing the execution of the real-time tasks.

3. ARITHMETIC ARCHITECTURE

The arithmetic architecture proposed resolves the schedulability condition given in Eq. 1. A fixed point function requires an iterative method to find a solution. Several methods were proposed in order to find the lesser number of iterations to converge to the final solution. Finding a solution trying each value of t for 0 to Ti may lead to a great number of iterations and consequently producing a very time consuming method.

The simplest and more trivial iteration method begins with t0=0, calculates the function to get the next value of t and ends when the t gotten is equal to the t utilised in the calculus. This iteration mechanism is valid because the fixed point function is proven to be monotonic. Figure 1 shows the arithmetic architecture proposed to perform the calculus of Eq. 1.

All the architecture requires is a synchronization to produce the values Cj and Tj each time it is necessary. This synchronization is in charged of a sequential machine that picks up these values from the memory of the system and transfers them to the respective registers. The sequential machine initiates the accumulator in zero as well as determines the end of the iterations.

From Fig. 1, it can be noted that the time required for the calculus depends on the time required for the integer division first and for the integer multiplication later. The throughput can be easily increased with a pipeline that contains in the first stage the integer divider and the integer multiplier in the second stage.

44

4. DATA STRUCTURE

There exists a great deal of information involved in the schedulability analysis. This information contains all the parameters of the real-time tasks and may be modified or accessed by the processor of the system. The arithmetic architecture requires accessing to the real-time parameters efficiently and without perturbing the execution of the real-time tasks of the system. For this reason we implemented a memory accessing unit with indirection capabilities to access to the different parameters of the real-time tasks.

Integer Multiplier

tq Tj

Integer Divider

Num. Den.

Result Remain

Comparator

0

A B

=

0 1

1 0

+ Cj

+

Accumulator

+

Ci

tq+1

Fig. 1. Arithmetic architecture proposed

The information was structured in order to deal with all

the different target applications that may require a schedulability analysis. The data structure implements indexes to access to the real-time parameters of the tasks as well as to store the results of the analysis. The data structure of the real-time system (Table 1) stores the number of the tasks, the index to the data structure of the highest priority task and the results of the scheduling analysis. There exist a memory address to command the beginning of the scheduling analysis and another memory address in which the scheduling unit indicates that the analysis has ended.

The data structure of each real-time task (Table 2) stores the real-time parameters, the index to the next

highest priority task and the result of the scheduling analysis for the task.

Table. 1. Data Structure of the Real-Time System

n : number of tasks Index to highest priority task System Schedulability: True or

False Start analysis: True or False End of Analysis: True or False

Table. 2. Data Structure of the Real-Time System

C: Worst Execution Time T: PeriodD: Deadline W: Worst Case Response Time Schedulability: True or False Index to the next highest priority

task

This data structure allows an easy communication with

different target applications that requires an efficient on-line real-time scheduling analysis. Changes on real-time parameters of the tasks, the number of real-time task of the system and the priority of the tasks can be easy produced during runtime without any change on the arithmetic architecture proposed. Moreover, this data structure can be shared among the arithmetic architecture and the real-time processor or with others specific hardware units.

5. EXPERIMENTAL RESULTS

The unit was synthesized for an APEX device. It required 276 LC for a 16 bit implementation with the divider and the multiplier parameterised for 16 clock cycles.

The architecture proposed was tested using several real-time systems randomly generated. From the experiments could be noted that the number of iterations, and consequently the time required to find the fixed point, depends on the magnitude of the different parameters of the real-time tasks. However, the number of iterations remains always bellow of the theoretical complexity of the schedulability equation 1 (n2.max(T)).

Of course, this complexity analysis was already performed by the authors of the iteration method implemented in the architecture proposed. However, we have to consider that the time required to find a solution using the arithmetic architecture proposed is measured in periods of clock whilst an algorithm implemented on a processor is measured in number of arithmetic instruction, each one of them requiring several periods of the system clock. Moreover, the scheduling analysis may be improved choosing faster divider and multiplier units. The difference of time required between both measurements is several orders of magnitude which could be the difference between

45

the feasibility of implementing the schedulability analysis on-line or not.

6. TARGET APPLICATIONS

Several applications in real-time system are based on an efficient implementation of a scheduling analysis technique. The scheduling analysis is utilised to decide what actions should be taken in the future. However, if the time required to produce a result is long enough to turn that future in past, then the analysis has non sense. The lack of an efficient on-line scheduling analysis is the main reason that makes that most of the different real-time mechanisms cannot be implemented during runtime. Some of the target applications for this architecture may be:

Dual Scheduling: the execution time of a processor is share among two or more schedulers running real and non-real-time tasks. When the deadline of a real-time task cannot be met, then the task has to be assigned to another scheduler or the processor time assigned to the scheduler has to be increased.

Slack Stealing methods: utilise the idle processing time that leave the real-time tasks to execute non-real-time tasks. In order to improve the response time of the non-real-time tasks, it is worth to postpone the execution of the real-time tasks as much as possible.

Adaptive Scheduling: is used when the period of a real-time task can be modified in order to change the total utilization factor of the system. In this way, the utilization factor of each real-time task may be adapted to the current load of the system in order to make it schedulable.

Dynamic task assignment: allows assigning real-time tasks during runtime. The real-time system has to guarantee that the schedulability will be not affected.

Fault Tolerance mechanisms: based on the execution of different tasks to produce the same results in order to compare them. The scheduler has to guarantee that there will be enough time to execute all the versions of the real-time tasks without jeopardise the schedulability of the real-time system.

Flexible real-time systems: are those in which the missed deadlines are bounded and restricted to a certain pattern. The schedulability analysis has to be done in order to guarantee that the temporal constraints of the real-time tasks will be satisfied.

Dynamic Voltage Scheduling: is a strategy that modified the voltage/frequency of the processor in order to reduce the power consumption. The techniques applied have to guarantee that temporal constraints will be satisfied with the minimum power consumption possible.

These are some of the applications that can be feasible if an efficient scheduling analysis is implemented. The architecture proposed improves in several orders of

magnitude the performance of the scheduling analysis implemented through a software algorithm executed by the same processor in which the real-time tasks are also executed.

7. CONCLUSIONS

In this paper we presented a hardware architecture to solve a fixed point function. This fixed point function is an exact schedulability condition to guarantee the schedulability of a real-time system.

Several applications were detailed and most of them were not suitable to be implemented during runtime because of the complexity of the method to solve the fixed-point schedulability function. This hardware architecture is intended to be utilised in real-time systems during runtime.

In this paper we proposed an adequate data structure to share the real-time information with the system processor. This memory-based interface makes the scheduling analysis architecture adaptable to different processors without perturbing the execution of the real-time tasks.

8. REFERENCES

[1] J. A. Stankovic, "Misconceptions About Real-Time Computing: A Serius Problem for Next-Generations Systems," IEEE Computer, vol. Octubre, pp. 10-19, 1988.

[2] E. Bini and C. B. Giorgio, "Schedulability Analysis of Periodic Fixed Priority Systems," IEEE Trans. on Computers, vol. 53, pp. 1462-1473, November 2004.

[3] M. Sjödin and H. Hansson, "Improved Response-Time Analysis Calculations," in IEEE 19th Real-Time Systems Symp., 1998, pp. 399-409.

[4] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, Tulika Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenström, "The Worst-Case Execution Time Problem — Overview of Methods and Survey of Tools," Mälardalen University March 2007.

[5] A. Burchard, J. Liebeherr, Y. Oh, and S. H. Son, "New Strategies for Assigning Real-Time Tasks to Multiprocessor Systems," IEEE Trans. on Computers, vol. 44, pp. 1429-1442, 1995.

[6] C. C. Han, "A Better Polynomial-Time Schedulability Test for Real-Time Multiframe Tasks," in IEEE 19th Real-Time Systems Symposium, 1998, pp. 104-113.

[7] T.-W. Kuo, Y.-H. Liu, and K.-J. Lin, "Efficient Online Schedulability Tests fo Real-Time Systems," IEEE Transactions on Software Engineering, vol. 29, pp. 734-751, August 2003.

[8] M. Joseph and P. Pandya, "Finding Response Times in Real-Time System," The Computer Journal (British Computer Society), vol. 29, pp. 390-395, 1986.

46

IMPLEMENTAÇÃO EM HARDWARE DO MÉTODO DE MINKOWSKY PARA O

CÁLCULO DA DIMENSÃO FRACTAL

Maximiliam Luppe†

Departamento de Engenharia Elétrica / Escola de Engenharia de São Carlos

Universidade de São Paulo

Av. Trabalhador são-carlense, 400 – São Carlos – SP – Brazil – 13566-590


ABSTRACT

A Dimensão Fractal é uma ferramenta extremamente

importante na caracterização e análise de formas, incluindo

tarefas que vão desde o processamento de sinais até óptica.

Uma das razões para o interesse tão grande é o seu poder

de expressar corretamente a complexidade e auto-semelhança dos sinais. Além disso, a dimensão fractal

também pode ser entendida como uma indicação de

cobertura espacial de uma forma específica. Diversas

abordagens numéricas foram desenvolvidas para o cálculo

da dimensão fractal como, por exemplo, Box Counting,

amplamente utilizado. No entanto, a sua aplicação a dados

reais não apresentam resultados tão bons como o método

de Minkowsky. O método de Minkowsky, utilizado para o

cálculo da dimensão fractal, envolve uma série de

dilatações do formato original, com relação a vários raios,

definindo a escala espacial. A área das várias dilatações é

representada em termos dos raios em um gráfico log-log e a dimensão fractal é tomada como sendo 2 - (inclinação da

reta interpolada). As dilatações, normalmente obtidas a

partir de operações morfológicas, também podem ser

obtidas por meio da Transformada de Distância Euclidiana

(TDE). A TDE calcula a distância mínima entre um pixel

de fundo e forma. As dilatações são obtidas por meio da

limiarização da imagem gerada pela TDE em todas as

distâncias possíveis que podem ser representadas na

imagem. Neste trabalho é apresentada uma proposta de

implementação de hardware dedicado, com base na TDE,

do método de Minkowski para o cálculo da dimensão fractal, e possível de ser implementado em dispositivos

reconfiguráveis.

1. INTRODUÇÃO

A principal característica do fractal [1] está relacionada à

sua dimensão, com o qual é possível determinar o grau de

complexidade de uma linha ou a rugosidade de uma

superfície, ou, de acordo com Russ [2], a dimensão fractal

é a taxa com a qual o perímetro (ou área de uma superfície) de um objeto aumenta quando a escala de medida é

reduzida. As aplicações da dimensão fractal estão

relacionadas à análise de formas, tal como a análise de

formas de neurônios [3], estudo de infiltração em solos [4],

análise de rugosidade [5], entre outras.

De acordo com Allen et al. [6], as estratégias de análise

para medir a dimensão fractal podem ser divididas em dois

grupos: métodos baseados em vetores e métodos baseados

em matrizes. Dentre os métodos baseados em vetores,

temos o método Structured Walking. Dentre os métodos

baseados em matrizes, temos os métodos de Contagem de Caixas (Box Counting) e de Mapa de Distância, também

conhecido por método da “salsicha de Minkowski”. Além

destes, outros métodos para o cálculo da dimensão fractal

foram criados [7] e avaliados [3], [4], [8]. Apesar de o

método da salsicha de Minkowski ser considerado o mais

preciso e o menos sensível a ruídos e a rotações, de acordo

com Bérubé e Jébrak [8] e Allen et al. [6], pouco tem sido

utilizado. Tem-se dado preferência ao método de

Contagem de Caixas, devido principalmente à sua

facilidade de implementação.

Uma das formas de se implementar o método da

salsicha de Minkowski é através do uso da Transformada de Distância [9], em especial a Transformada de Distância

Euclidiana (TDE). Neste trabalho abordaremos o uso de

uma arquitetura para o cálculo da TDE em tempo real,

proposta em [10], para calcular a dimensão fractal pelo

método da salsicha de Minkowski.

Na seção 2 é feita uma breve descrição da metodologia

para o cálculo da dimensão fractal baseada no método da

salsicha de Minkowski utilizando um Mapa de Distâncias.

Na seção 3 é detalhada a implementação da arquitetura e

na seção 4 são apresentadas as conclusões.

2. CÁLCULO DA DIMENSÃO FRACTAL

O método do Mapa de Distâncias para o cálculo da

dimensão fractal baseia-se no uso de um processo

conhecido por “salsicha de Minkowski”. Neste processo,

cada ponto que pertence ao contorno do objeto é dilatado

ou sobreposto por círculos de raio (figura 1), formando faixas, ou bandas, mais conhecidas por “salsichas”, cuja

área A() é proporcional a (2-D) [8], sendo D a dimensão

†Apoiado pela FAPESP (2007/04657-3)

47

fractal. A dimensão fractal obtida por este método é mais conhecida por Dimensão Minkowski-Boulingand.

Figura 1 – Exemplo do método de Minkowski

Limiarizando-se um Mapa de Distâncias (criado a

partir da TDE aplicada aos pontos de contorno do objeto) em diferentes níveis de cinza, ou valores de �, criamos faixas similares àquelas obtidas pelo processo da salsicha de Minkowski. Assim podemos obter a dimensão fractal por meio de um gráfico do logaritmo da área A(�) pelo logaritmo do raio �, que resultaria numa reta cuja inclinação é igual a 2-D.

Desta forma, podemos obter a dimensão fractal calculando o coeficiente angular da reta de um gráfico do logaritmo da área A(�)�pelo logaritmo do raio �. Uma forma de obter o valor da área em função do raio é por meio do uso de um histograma acumulativo da imagem. Este histograma fornece, para cada nível de cinza, ou valor de �, a quantidade de pixels com valores menores ou iguais a este. Sendo assim, o módulo para o cálculo da Dimensão Fractal necessitaria da implementação de uma estrutura para obtenção do histograma do mapa de distâncias, de uma estrutura para o cálculo de logaritmo e de outra para o cálculo de coeficiente angular. O gráfico de área por raio é obtido através do histograma acumulativo, onde para cada valor de raio� são considerados todos os pontos do mapa de distância com raio menor ou igual a �.

3. IMPLEMENTAÇÃO

O histograma de uma imagem é obtido pela contagem do número de pixels que possui um determinado valor de distância (ou raio), para cada valor de distância presente no mapa de distâncias. Duas são as possibilidades de se implementar esta estrutura: com contadores individuais (um para cada valor de distância), ou com memória. O uso de registradores, apesar de ser mais simples, implica em um grande consumo de células lógicas, enquanto que com a implementação com memória acarretaria no consumo dos blocos de memória já existentes na FPGA e de apenas algumas células lógicas. Sendo assim, optou-se pelo uso da memória.

Para implementar uma contagem de pixels utilizando memória, é realizado um esquema read-modify-write. Neste esquema, uma determinada posição de memória é acessada utilizando o valor do nível de cinza do pixel. O valor armazenado nesta posição é lido, incrementado e armazenado na mesma posição. A melhor forma de implementar este esquema é utilizando memória síncrona

com duas portas e uma técnica conhecida por Clock-2x. Nesta técnica, o sinal de clock utilizado para ter acesso à memória é dobrado, enquanto que o sinal de clock é utilizado para habilitar a escrita na memória. O dado disponível numa das portas é incrementado por meio de um somandor e o resultado da soma é fornecido à outra porta. Na figura 2 temos o esquema da implementação da técnica.

Figura 2 – Módulo de Histograma

Durante o processo de geração do histograma, o valor

do pixel é utilizado como endereço do dado a ser lido, incrementado e armazenado. Para a descarga do histograma, um contador externo é utilizado para a geração dos endereços para a leitura dos dados e conseqüente limpeza (armazenamento do valor 0) da memória. Para esta operação são utilizados multiplexadores controlados pelo sinal ‘sel’.

Para se obter o histograma acumulativo, basta descarregar o conteúdo da memória num circuito acumulador. Este acumulador vai armazenando, e enviando adiante, o resultado da soma anterior com o novo dado. Cada dado do histograma acumulativo representa a área A(�)�coberta pelos pixels para cada valor de distância � representável na imagem pelo mapa de distâncias.

Uma vez calculado o histograma acumulativo, é necessário calcular o logaritmo dos dados deste para obter os dados necessários para gerar o gráfico log A(�)�vs log � e, a partir deste, obter o coeficiente angular da reta que melhor se ajusta aos dados. Para se obter o logaritmo da área é necessário converter os dados, que estão no formato de inteiros, para o formato em ponto flutuante. Tal conversão é realizada pela megafunção ALTFP_CONVERT da ferramenta Quartus II, que tanto pode ser utilizado para converter dados inteiros (32 ou 64 bits) para ponto-flutuante (precisão simples ou dupla), como vice-versa. Para tanto, foram criados os componentes i2048log_i2f e i2048log_f2i para a conversão de inteiro para ponto-flutuante e de ponto-flutuante para inteiro, respectivamente.

Após a conversão de inteiro para ponto-flutuante, é calculo do logaritmo (com o uso do componente i2048log_log, criado a partir da megafunção ALTFP_LOG) e o resultado é multiplicado por 2048 (com o uso do componente i2048log_mul, criado a partir da

48

megafunção ALTFP_MULT) antes de ser novamente convertido para inteiro. A multiplicação por 2048 permite continuar com o uso de inteiros no lugar de ponto-flutuante, o que reduz o consumo de lógica. Na figura 3 temos o esquema do circuito de conversão-logaritmo-multiplicação.

Figura 3 – Módulo de Logaritmo

Da mesma forma, também foi obtido o logaritmo dos

valores de distância �, multiplicados por 2048 e armazenados numa memória ROM (com o uso do componente dimfrac_rom, criado a partir da megafunção ROM: 1-PORT) para serem utilizados, juntamente com os valores de log A(�)� para o cálculo do coeficiente angular. Assim, após a geração do histograma, armazenado numa memória RAM, é possível calcular a dimensão fractal enviando os dados de área e distância para um módulo de cálculo do coeficiente angular. Tanto a memória RAM (com os dados do histograma), quanto a memória ROM (com os dados de distância) são acessados ao mesmo tempo por meio de um contador que realiza a descarga das duas memórias. Para a sincronização dos sinais Yi (log A(�)) e Xi (log �), uma vez que o caminho percorrido pelos dados vindos do histograma é maior, devido ao processamento realizado nestes, foi necessário ainda realizar um atraso no sinal de Xi por meio do uso do componente dimfrac_shr, criado a partir da megafunção Shift register (RAM-based).

Para o cálculo do coeficiente angular é utilizado o Método do Mínimo Quadrático (MMQ). Considerando um conjunto de pares ordenados (xi, yi) que descrevem uma reta y = a + bx, podemos encontrar os coeficientes a e b por meio das seguintes operações:

(1a)

(1b)

Como estamos interessados apenas no coeficiente angular b, visto que D = 2 – coeficiente angular do gráfico log A(�) vs log �, não foi realizado o cálculo do coeficiente linear a. Visto que estamos trabalhando com um conjunto fixo de distâncias, visto que poucas são as distâncias

representáveis e são limitadas pelo tamanho da imagem, podemos observar que alguns parâmetros da equação do coeficiente angular são constantes (dependem apenas da variável xi), além do próprio n:

(2a)

(2b)

Sendo assim, a equação para o coeficiente angular fica:

(3)

O que reduz o cálculo do coeficiente angular de quatro multiplicações, duas subtrações e uma divisão, para apenas duas multiplicações, uma subtração e uma divisão. Todas as operações são realizadas com números inteiros. Tanto a somatória de yi, quanto a somatória do produto de xi com yi, são obtidas da mesma forma que o histograma acumulativo, por meio de um componente acumulativo: um acumulador (ACC) para a somatória de yi e um multiplicador acumulador (MAC) para a somatória do produto de xi com yi. O resultado destes somatórios é enviado ao componente MMQ_cte, apresentado na figura 4.

Figura 4 – Módulo de MMQ

Na figura 5 é apresentado o esquema geral do módulo

para o cálculo da dimensão fractal em tempo real. Em azul está representada a largura dos barramentos de dados e, entre parênteses, o atraso causado por cada módulo, em ciclos de clock.

Figura 5 – Esquema geral do Módulo de Dimensão Fractal

49

Na figura 6 temos o esquema geral do módulo para o cálculo da dimensão fractal, já com o sistema de controle.

Figura 6 – Módulo de Dimensão Fractal

Na figura 7 temos dois exemplos de fractais,

conhecidos por curvas de Koch, utilizados como teste. Em a) e c) temos a imagem original, em b) e d) temos o cálculo da TDE. O fractal da figura 7a) tem dimensão fractal de 1,5000, e o da imagem figura 7c), de 1,4626.

a) b)

c) d)

Figura 7 – Exemplos de processamento da Transformada de Distância

Nas imagens seguintes temos a simulação da

arquitetura para o cálculo da dimensão fractal, no caso da imagem da figura 7a. Na figura 8 temos o início do cálculo da dimensão fractal, após ter sido calculada a transformada de distância. Podemos observar a geração do endereço de acesso das memórias (sinal counter) para leitura dos dados RAM_Yi e ROM_Xi. Na mesma imagem podemos ver a geração do histograma acumulativo, por meio do sinal ACC_Yi.

Na figura 9 temos o início da geração dos sinais para o cálculo do MMQ. Observamos os sinais Yi (após o cálculo do logaritmo) e Xi (após passar pela linha de atraso), além dos somatórios Sx e Sxy. Na figura 10 temos o final do processamento, onde temos o resultado final do MMQ,

com os sinais Quocient e Remain. Como o resultado esperado para a dimensão fractal para linhas é entre 1,0000 e 2,0000, e o cálculo da dimensão fractal por meio da transformada de distância retorna 2 – D, o resultado final da divisão realizada no módulo MMQ será com o quociente igual a zero e com o resto igual a 2 – D.

Figura 8 – Início do processamento – leitura do histograma

Figura 9 – Início do processamento – geração dos

somatórios

Figura 10 – Fim do processamento – geração do

coeficiente angular Abaixo temos a tabela 1 com os valores médios obtidos com o processamento de alguns fractais e os valores teóricos para os mesmos. Valores menores que os teóricos eram esperados, pois os valores da dimensão fractal

50

obtidos pelo método da salsicha de Minkowski são menores que os valores teóricos. Curva Teórico Obtido Line 1,0000 0,974 Contour of the Gosper island 1,0686 1,040 Vicksel Fractal 1,4650 1,260 Quadratic von Koch curve type 1 1,4650 1,180 Peano curve 2,0000 1,754

Tabela 1 – Resultados obtidos

4. CONCLUSÃO

Finalmente, os módulos para o cálculo da dimensão fractal e para o processamento em tempo real foram plenamente implementados e permitem tanto a visualização da transformada de distância, quanto do valor da dimensão fractal do fractal. Algumas melhorias ainda necessitam ser realizadas como a implementação de um módulo de binarização para melhor geração dos dados para o cálculo da transformada de distância (o que possibilitaria o uso da arquitetura para o cálculo de texturas), além de um módulo de detecção de bordas, que permitiria o uso da arquitetura para determinar a rugosidade do contorno de objetos.

Para os resultados finais, foi possível realizar a transformada de distância com raio igual a 35, o que representou 397 distâncias distintas. Todo o sistema consumiu 56279 elementos lógicos de uma FPGA Cyclone II 2C70 da Altera.

5. REFERÊNCIAS

[1] Mandelbrot, B. B., The Fractal Geometry of Nature, W. H. Freeman, New York, 1983.

[2] Russ, J. C., The Image Processing Handbook 2nd ed., CRC Press, New York, 1995.

[3] Jelinek, H. F., Fernandez, E., “Neurons and fractals: how reliable and useful are calculations of fractal dimensions?”, Journal of Neuroscience Methods, v. 81, pp. 9-18, 1998.

[4] Ogawa, S., Baveye, P., Boast, C. W., Parlange, J. Y., Steenhuis, T., “Surface fractal characteristics of preferential flow pattern in field soils: evaluating and effect of image processing”, Geoderma, v. 88, pp. 109-136, 1999.

[5] Hyslip, J. P., Vallejo, L. E., “Fractal analysis of the roughness and size distribution of granular materials”, Engineering Geology, v. 48, pp. 231-244, 1997.

[6] Allen, M., et al, 1995, “Measurement of boundary fractal dimensions: review of current techniques”, Powder Technology, v. 84, n.1, pp. 1-14, 1995.

[7] Asvestas, P. et al, “Estimation of fractal dimension of images using a fixed mass approach”, Pattern Recognition Letters, v. 20, pp. 347-354, 1999.

[8] Béburé, D., Jébrak, M., “High precision boundary fractal analysis for shape caracterization”, Computer & Geociences, v. 25, pp. 1059-1071, 1999.

[9] Rosenfeld, A.; Pfaltz, J. L., “Distance Functions on Digital Pictures”, Pattern Recognition, v.1, pp. 33-61, 1968

[10] Luppe, M. Colombini, A. C., Roda, V. O.; “Arquitetura para Transformada de Distância e sua Aplicação para o Cálculo da Dimensão Fractal”. Terceira Jornadas de Engenharia de Electrónica e Telecomunicações e de Computadores JETC’05, Lisboa, Portugal, November 17-18, 2005.

51

�

52

AN ENTRY-LEVEL PLATFORM FOR TEACHING HIGH-PERFORMANCERECONFIGURABLE COMPUTING

Pablo Viana, Dario Soares, Lucas Torquato

LCCV - Campus ArapiracaFederal University of Alagoas

ABSTRACT

Among the primary difficulties of carrying out integration ofdigital design prototypes into larger computing systems arethe issues of hardware and software interfaces. Completeoperative systems and their high-level software applicationsand utilities differs a lot from the low-level perspective ofhardware implementations on reconfigurable platforms. Al-though both hardware and software development tools aremore and more making use of similar and integrated envi-ronments, there is still a considerable gap between program-ming languages running on regular high-end computers andwire-up code for configuring a hardware platform. Sucha contrast makes digital design too hard to be integratedinto software running on regular computers. Additional is-sues include programing skills at different abstraction lev-els, costly platforms for reconfigurable computing, and longlearning curve for using special devices and design tools.Hence coming up with an innovative high-performance re-configurable solution, besides its attractiveness, becomes adifficult task for students and non-hardware engineers. Thuswe propose a low-cost platform for attaching an FPGA de-vice to a personal computer, enabling its user to easily learnto develop integrated hardware/software designs to acceler-ate algorithms for high-performance reconfigurable comput-ing.

1. INTRODUCTION

Recent discovery of huge oil and gas volumes in the pre-saltreservoirs of Brazil’s Santos and Carioca basins have fueledthe concern about the new challenges and dangers involvedin the off-shore exploration. During the last two decades,the high cost and risk involved in the activity have pushedthe research community to come up with innovative solu-tions for building and simulating virtual prototypes of struc-tures, under the most realistic conditions, to computation-ally evaluate the performance of anchors, risers (oil pipes)and underwater wells before to experience the open sea.

Nowadays, the highly-detailed numeric models that helpengineers to develop and improve new techniques for oil andgas exploration, demand a considerable computing through-

put, pushing research labs on this field to invest on stateof the art solutions for high-performance computing. Theterm High-Performance Computing (HPC) is usually asso-ciated with scientific research and engineering applications,such as discrete numerical models and computational fluiddynamics. HPC refers to the use of parallel supercomput-ers and computer clusters, that is, computing systems com-prised of multiple processors linked together in a single sys-tem with commercially available interconnects (Ethernet Gi-gabit, Infiniband, etc.). While a high level of technical skillis undeniably needed to assemble and use such systems,they need to be daily operated and programmed by non-computer engineers and students, who are intrinsically in-volved in their specific research fields.

Specialist engineers in oil and gas production offshorehave developed in the last 15-20 years programing skills toimplement their own programs and build their prototypesusing state of art programing techniques and following up-dated rules of software engineering. They had to learn aboutdesign patterns, multi-threaded programming, good docu-mentation practices, and started to develop many other skillsfor adopting open-source trends on modern cluster program-ming. In order to explore the processing resources evenmore effectively, HPC engineers are also supposed to beable to take advantage from the parallelism of graphical pro-cessing units (GPUs), that boost processing power of mostsupercomputers figuring on the the Top 500 List [3] .

Researchers on HPC seems to be aware about the needfor continuously pursuit new alternatives to overcome someof the main computing challenges. Power consumption, tem-perature control and the space occupied by large supercom-puters are the main concerns for the next-generation sys-tems [2]. In this context, the promises of reconfigurablecomputing seems to be suitably matched to the demand forinnovative solutions for high performance computing. Be-yond the basic advantages on size, and energy consump-tion of popular reconfigurable platforms, like the Field Pro-grammable Gate Arrays (FPGAs), it is expected that recon-figurable computing can deliver no precedent performanceincrease, compared to current approaches based of commer-cial, off-the-shelf (COTS) processors, because FPGAs are

53

essentially parallel and might enable engineers to freely con-struct, modify, and propose new computer architectures. Itis expected that the parallel programing paradigm shall be-come much more than just parallel threads running on multi-ple regular processors. Instead, we may expect that innova-tive designs may involve also the development of some Pro-cessing Units (PU), specifically designed to deal with partsof an algorithm, in order to reach the highest performance.

Hardware specialists from computer engineering schoolshave been extensively prepared to develop high-performancedesigns on state-of-the-art reconfigurable devices, such asencoders, filters, converters, etc. But paradoxically, most ofthe people interested in innovative high-performance com-puting are not necessarily hardware engineers. In contrast,these users are non-computer engineers and other scientists,with real demands on high performance computing. Theseprofessionals know deeply their needs on HPC and probablywould be empowered if they could wire-up by themselvesinnovative solutions from their own desktops. This class ofusers would need to be capable of rapidly building and eval-uating prototypes even without the intrusive interference ofa hardware designer.

Since there is a considerable learning curve to masterhardware design techniques and tools, it’s straightforward topoint out the problem of shortly training people from otherknowledge areas with specific hardware design skills. In or-der to tackle the problem, we involved Computer Scienceundergraduate students to assist non-computer engineers onimproving the performance of a given existing system, bymixing the legacy software code with hardware prototypesof Intellectual Property cores (IP cores). We then proposedto develop modules specifically designed to be easily at-tached to a regular desktop machine, through a commonUSB (Universal Serial Bus) interface. Such an approachenable non-computer engineers to experience the reconfig-urable computing benefits on their native code, by insertingcalls to the remote procedures implemented in the FPGAdevice across the USB interface. On the other hand, theproposed platform allows computer science and engineer-ing students to develop reconfigurable computing solutionsfor real-world problems. As the result, we propose a inte-grated platform for introducing students and engineers onhigh-performance reconfigurable computing.

This paper is organized as follows. Section 2 is de-voted to discuss the hardware issues that motivated us topropose a simplified platform for teaching reconfigurablecomputing, by defining templates and a protocol to inte-grate logic design to a general purpose computer system.Section 3 illustrates the utilization of reconfigurable com-puting platform on engineering applications, and finally inSection 4 we discuss the achieved results, future improve-ments on the integrated platform and the next applicationson high-performance computing.

2. LOW-COST VERSATILE RECONFIGURABLECOMPUTING PLATFORM

2.1. Hardware Issues

Reconfigurable computing FPGA-based platforms, from dis-tinct manufacturers of logic and third parties, are fairly avail-able on the market. Most of them support a varied numberof interfaces to connect the board with other external de-vices, such as network, VGA monitors, ps2 keyboards, aswell as high performance interconnect standards such as PCIexpress, Gigabit Ethernet, etc.

Basically, state-of-the-art platforms offers high densityprogrammable logic devices with millions of equivalent gatesand support high-speed interconnects, among other facili-ties, allowing the user of such platforms all the versatilityneeded to develop complex designs such as video proces-sors, transceivers, and many other relevant projects. Theseplatforms are suitable for small or complex prototypes, andtheir price range from $500 − $5000, not including all thenecessary software design tools. Although this category ofplatforms offers an attractive support to a great variety ofexperiments and prototype designs, the total cost for acquir-ing a number of boards, becomes its adoption prohibitive onclasses. On the other hand, there are on the market low costplatforms, equipped with medium density devices (around500k equivalent gates), and priced under $200, such thatmost schools and training centers can afford. There are,however, restrictions on the interface support offered by theseplatforms that may restrict their utilization to stand-alonedevices.

As a participant of the Xilinx University Program (XUP),our platform (Figure 1) is based on the low-cost donationboard Xilinx Spartan 3E FPGA, available at our laboratoryfor teaching purposes. Although this specific board containsseveral devices and connectors around the FPGA chip to al-low students to experiment projects integrated with network,video (VGA), keyboard, serial standard RS-232 and someother interfaces, the USB connector present on the boardcan only be used for programming the logic devices (FPGAand CPLD).

Initially, we tried to access the logic resources on theboard from a personal computer over the network interface.The board has an RJ-45 Ethernet connector, as well as aphysical layer chip. But the user needs to implement theData Link layer in order to provide a MAC (Media AccessControl) to the network, besides the next Network Layer thatimplements the basic communication functionalities acrossthe network into the FPGA. This first try rapidly became ahard solution to implement, since most of our students werenot familiar with digital design yet.

The second alternative tried to take advantage from the100-pin Hirose connector available on the board to imple-ment an wide interface to an external device. The external

54

USB

RJ-45

Hirose100-pin

Pmod6-pin

FPGA

Other connectors (VGA, PS2, RS-232, etc.)

Fig. 1. FPGA Platform: Xilinx Starter Kit Spartan-3e

device should have a friendly interface to a personal com-puter through a standard USB interface version 2.0. Dueto specificities of the 100-pin connector, this hard to findadapter would demand a hard to wire device with one hun-dred pins to connect. Again, in order to keep our approachas simple as possible, we decided to adopt the three 6-pinPmod connectors. The 6-pin connectors can be found onthe local market and are easy to connect to a few ports of asmall microcontroller with USB capabilities. Two out of sixpins are dedicated to source power (+5V and GND) and theother four pins can be used for general purpose. Then, only12-pins are actually available for interconnecting the boardthrough the Pmod connectors.

We then proposed a 4-bit duplex interface based on amicrocontroller Microchip PIC18F4550, capable of trans-ferring bytes to and from the FPGA board by dividing eachbyte word (8-bit) into 2 nibbles of 4-bit. Four pins for writ-ing to the FPGA, other four pins to read from the FPGAand other four pins to control the writing/reading operation.Since the whole procedure of the microcontroller to read theUSB port, send to the FPGA, read from the board and finallysend over the USB takes just a little less than 1 microsec-ond, transfers between the FPGA and the Microcontrollercan reach the maximum theoretical throughput of 1MByte/s.

The microcontroller board utilized as interface was rapidlybuilt because we utilized a pre-defined platform, namelyUSB Interface Stargate [1]. The Stargate platform is in-tended to facilitate designers to propose new HID (Human-Interface Device) products. The board offers analog and dig-ital input/output pins, and is capable of easily communicatewith a personal computer by the USB version 2.0 makinguse of native Device Drivers of most popular Operating Sys-tems. We simply made some minor adjustments on the Star-gate’s firmware to implement the defined protocol to com-municate with the FPGA through the 12-pin interface. Inorder to directly connect the Stargate to the FPGA board,

Pmod2 x 6-pin

Protocol control4-pin

USB 2.0Type-B connector

OutPort

InPort

Fig. 2. Scorpion Interface Board: Modified USB InterfaceStargate to interconnect USB to the FPGA.

some minor changes on its layout had to be made, eliminat-ing the analog pins and replacing the input and output pinson the same corner. The updated firmware and the proposedlayout matching the physical requirements of this projectmotivated a new name for this specific platform:ScorpionInterface board (Figure 2).

2.2. Defining a Communication Protocol

We proposed a simple protocol to enable communication be-tween the FPGA board and the interface USB built in themicrocontroller (PIC18F4550). The firmware program PICis intended to send and receive Bytes to and from the FPGAboard, according to a established protocol, exclusively de-fined for the purposes of this integration.

Basically, the firmware is a loop code waiting for datacoming from the USB interface with the PC. As soon as abyte arrives, the word is split up into the low-half and high-half nibbles (Figure 3). The less significant portion is putavailable at the OutPort to the FPGA and the bit flag Write-enable goes high. The data stays available until FPGA’s flagWrite-done is raised. Then, Scorpion interface board makesavailable the second portion bits and reset Write-enable down.The half-byte data is read by the FPGA board, who clearsWrite-done, ending the USB to FPGA writing operation.

If there is any data made available by the FPGA, theRead-enable flag will be set 1. Then, Scorpion reads thelower bits to low-half and sets Read-done high. Now theFPGA makes the higher portion bits available at the InPortand resets Read-enable. The microcontroller at Scorpionread the data and concatenates both parts to recover the wholebyte before send it over the USB to the PC.

2.3. Integration Wrapper

In order to help the logic designer be more focused on thedesign of the processing unit itself, we proposed a parame-terizable wrapper template, which hides the communicationprotocol between the FPGA and the Scorpion boards. Thedata input coming from the Scorpion board is shifted alongthe input operator registers (Op InReg) to become readilyavailable to the Processing Unit (PU). The PUs must bedefined as combinatorial logic, that is, output data will be

55

While(true){

Wait (until receive from USB);OutPort = low_half;Write_enable = 1;Wait (until Write_done == 1)OutPort = high_half;Write_enable = 0;Wait (until Write_done == 0)if(Read_enable == 1){

low_half = Read(InPort);Read_done = 1;Wait (until Read_enable == 0)high_half = Read(InPort);Read_done = 0;

}SendToUSB(high_half+low_half);

}

Fig. 3. Protocol defined in the Scorpion interface firmware

ProcessingUnit

Op InReg1

Op InReg2

Op InRegN

Op OutReg

Protocol Control Unit

Input Control Output

Wrappertemplate

Mapped to FPGA’s Pads connected to the Pmod pins

Fig. 4. Wrapper template around the PU.

available when all the inputs are sourced at the input regis-ters. The output register will keep the data available as longas the input data have not changed (Figure 4).

This restriction simplifies the rapid development of sim-ple Processing Units and their respective integration to theplatform. The integrated PU implemented on the FPGA be-comes available to the end-user, who can immediately eval-uate the hardware/software implementation and explore in-novative design options.

3. INTEGRATING RECONFIGURABLE LOGICINTO AN USER APPLICATION

The proposed platform for integrating the traditional devel-opment on general purpose computer and library modulesimplemented in reconfigurable hardware (FPGAs), is hereessentially the task of including the libhid library on theC/C++ code and make use of the Application Program In-terface (API) designed to send and receive bytes over theUSB port (Figure 5). The API enable the use of the the pro-posed platform, offering an abstraction layer to hide from

sendFPGA(buffer, size);size = receiveFPGA(buffer);

Fig. 5. Basic functions to send and receive data to and fromthe FPGA

the user the enumeration steps of the Scorpion interface as adevice connected to an USB port.

Our illustrative example and first exercise to get startedon the proposed platform requires the student to design anAdder/Subtractor in the FPGA. After have written the HDLcode and synthesized the project into the FPGA, the usermust send 3 bytes over the interface to the FPGA, whichare: The operation value (referring to an ADD=0x00 orSUB=0x01 operation), the first and the second operand. Suchan operation must be carried out by enclosing all the valuesin a buffer and send it to the FPGA. Next, the size returnedby receiveFPGA will determine when the result is availableat the buffer.

4. CONCLUSION

Computer Science students were able to design functionalmodules of processing units in VHDL, synthesize their codesand configure the FPGA device by using the tools from theUniversity Program donation at the lab classes. The pro-posed processing units typically included functional mod-ules, such as arithmetic operators and statistical estimators.At this present time, collaborators from a partner researchlaboratory involved in the non-computer engineering projects,such as oil and gas production, can make use of the pro-posed modules, by attaching the proposed platform with apre-configured FPGA with the functional implementationof a given operator to compare the results obtained fromthe hardware implementations. Although performance is-sues are not the major contributions of this paper, thanks tothe easy-to-use proposed platform, students and engineersare more and more considering reconfigurable computing ontheir academic and research projects. Our next steps includethe use of high-end platforms with high-density FPGAs andhigh-speed interconnects to propose alternative solvers toexisting scientific computing libraries.

5. REFERENCES

[1] USB Interface: http://www.usbinterface.com.br

[2] Experimental Green IT Initiative Launches on Recycled HPCSystem , Scientific Computing (2009)

[3] TOP500 Supercomputing list, available at:http://www.top500.org/

56

DERIVATION OF PBKFD2 KEYS USING FPGA

Sol Pedre, Andres Stoliar and Patricia Borensztejn

Depto. de Computacion, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Airesemail: spedre, astoliar, [email protected]

ABSTRACT

In this paper we analyze the key derivation algorithm used tostart a WPA PSK (Wi Fi Protected Access with Pre SharedKey) session in order to perform a brute force attack and thusobtain the session key. We analyze its computational costand propose improvements to the algorithm that reduces itscomplexity by nearly half. We also analyze which sectionof the algorithm would be fruitful to implement in an FPGAand we implement it. Finally, we compare the performanceof our solution in several FPGAs and with optimized im-plementations for current CPUs. We show our solution is agood engineering solution in means of cost of processed keyper second if compared with current CPU’s performance.We also present ideas for a design that we hope could im-prove the performance in keys per second processed.

1. INTRODUCTION

Communications networks security is a largely studied fieldin constant advance. As new security protocols are devel-oped, new methods to breach that security are developed aswell. A common attack, but computationally costly, is thebrute force attack to the key derivation algorithms used inthose protocols. Given a word dictionary, and the result ofthe derived key, this attack consists on running the deriva-tion algorithm on every word in the dictionary to verify itscorrespondence with the derived key. To try to reduce thetime required for this attack at least two steps are taken. Forone, the algorithm is studied to find another computationallycheaper that produces the same result. On the other hand,efficient implementations are built for the hardware at hand.In our case, we will use FPGAs.

In this paper, we analyze the protocol WPA PSK (Wi FiProtected Access with Pre Shared Key) to perform a bruteforce attack. We will use a weakness in the application ofthe key derivation function PBKDF2 in this protocol thatallows a significant reduction in the algorithm’s complexity,and we implement part of the algorithm in an FPGA.

The rest of this paper is organized as follows: in section2 we explain the protocol WPA-PSK and the needed algo-rithms to derive the session keys. In section 3 we calculatethe cost of the algorithm and propose improvements. In sec-

tions 4 and 5 we explain what was implemented in the FPGAand why, and describe this implementation. Finally, in sec-tion 6 we present results in different FPGAs and a CPU, andin section 7 we draw some conclusions and present ideas forfuture work in order to enhance the obtained performance.

2. WPA-PSK

WPA (Wi-Fi Protected Access) is a protocol created by theWi-Fi Alliance to enhance the security of wireless networkswithin the 802.11 standard.

It´s PSK (Pre Shared Key) operation mode was designedto provide authentication in small networks, as an alterna-tive to the installation of an authentication server. It sim-ply assumes that every node knows a secret passphrase of 8to 63 ASCII printable characters (as required by the IEEE802.11i-2004 standard). The authentication key of the net-work is derived from the secret passphrase and the publicnetwork´s SSID (Service Set Identifier) using the PBKFD2(Password-Based Key Derivation Function) [1] key deriva-tion function.

The parameters of the PBKDF2 function are a passwordP and a salt S (both byte arrays), an integer C that defines theamount of recursive applications of it´s underlying pseudo-random function, and an integer dkLen that indicates thelength of DK, the key that will be derived. Usually, theoutput of the underlying pseudo-random function is shorterthan dkLen. In those cases, DK is defined as the concatena-tion of partial applications of that underlying pseudo-randomfunction over I different blocks. Each of these applicationshas the same inputs but the salt S is redefined as (S||I) ineach block.

PBKDF2 is configured in WPA-PSK to use as the un-derlying pseudo-random function the hash function HMAC-SHA1 (Keyed-Hashing for Message Authentication with Se-cure Hash Algorithm 1) [2]. The passphrase is used as thepassword P and the SSID as the salt S. The amount of re-cursions C is fixed in 4096 and dkLen in 256 bits to de-rive a key DK of that length. As the output of HMAC-SHA1 is 160 bits long, the amount of blocks I is set as 2(160 + 160 = 320 > 256 that is dkLen). In this manner,PBKDF2 in WPA-PSK is defined as:

57

DK = T_1 || T_2

T_1= U_1_1 xor U_2_1 xor ... xor U_4096_1T_2= U_1_2 xor U_2_2 xor ... xor U_4096_2

U_1_I = HMAC-SHA1( P, S || I)U_2_I = HMAC-SHA1( P, U_1_I)...U_4096_I = HMAC-SHA1( P, U_4095_I)

The previous equations show the key derivation fromthe private passphrase and the public SSID in wireless net-works with WPA-PSK authentication and is the completealgorithm that must be run given a word from the dictionaryand the SSID in order to perform the brute force attack.

2.1. HMAC-SHA1

HMAC (Keyed-Hashing for Message Authentication) is amechanism to verify the authenticity of a message usinga cryptographic hash function [2] [3]. This function maybe SHA1 (Secure Hash Algorithm 1) [4] [5] defining theHMAC-SHA1 variant, that is defined by:

SHA1(K xor opad ||SHA1(K xor ipad || text))

Where ipad is the 36h byte repeated 64 times andopad is the 5Ch byte repeated 64 times (i.e, both are 512bits long). K is the key used to authenticate the message. Itshould be 64 bytes (512 bits) long: if it is less than 64 bytesit is extended with zeros. Finally, text is the text whichauthenticity is being verified (any length).

2.2. HMAC-SHA1 in WPA-PSK

In the application of HMAC-SHA1 in WPA-PSK, K is thepassphrase P. The message text is the SSID S concate-nated with the block number I (1 or 2) in the base recursioncase. In the next recursion cases, text is the result of theprevious execution. The application is thus defined by thefollowing equations:

U_1_I = SHA1( P xor opad ||SHA1( P xor ipad || (S || I)))

U_2_I = SHA1( P xor opad ||SHA1( P xor ipad || U_1_I ))

...U_4096_I = SHA1(P xor opad ||

SHA1(P xor ipad || U_4095_I))

2.3. SHA1

SHA1 is a cryptographic hashing function that takes an inputof any length, divides and process it in 512 bit chunks togenerate a 160 bit long hash [4] [5]. In figure 1 an iterationof the main loop of SHA1 is shown. This iteration repeats

80 times for each 512 bit chunk. A,B,C,D and E are 32bit words of the state, F is a non-linear function that variesaccording to the number of iteration t. W is an 80 word arrayconstructed using the original message and K is a constantthat depends on the iteration t. The square sum representsthe saturated sum to 32 bits and <<<n is a n bits left rotate.

Fig. 1. An iteration of the main loop of SHA1

As SHA1 will be the core of the FPGA implementationwe show it’s pseudo-code:

1 Initialize variables h0,h1,h2,h3,h42 Extend the message to a 512 bit multiple anddivide it in 512 bits chunks

3 For each chunk- Divide it in 16 32-bit words W[t](0<=t<=15)- Extend those 16 words to 80:for (t=16;t++;t<= 79)W[t]=(W[t-3] xor W[t-8] xor W[t-14]

xor W[t-16]) rol 1- Initialize the hash for this chunkA = h0; B = h1; C = h2; D = h3;E = h4- Main loopfor( t=0;t++; t<= 79)if 0 <= t <= 19 thenF = (B and C) or ((not B) and D)K = 0x5A827999

else if 20 <= t <= 39F = B xor C xor DK = 0x6ED9EBA1

else if 40 <= t <= 59F = (B and C)or(B and D)or(C and D)K = 0x8F1BBCDC

else if 60 <= t <= 79F = B xor C xor DK = 0xCA62C1D6

TEMP = (A rol 5) + F + E + K + W[t]E=D; D=C; C=B rol 30; B=A; A=TEMP

- Add this chunk’s hash to the totalh0+=A; h1+=B ; h2+=C; h3+=D; h4+=E

4 Produce the final hashhash = h0 && h1 && h2 && h3 && h4

SHA1 doesn’t have configuration parameters. It’s applica-tion in WPA-PSK is no different from it’s execution in anyother context.

58

3. ALGORITHM’S COST REDUCTIONS

In this section we will describe the improvements made inthe algorithm previous to its hardware implementation.

3.1. Metrics

In order to quantify the improvements, we need a metric forthe computational cost. The primitive functions detailed toderive the key using the PBKDF2 function are three: con-catenation (||), logic xor and SHA1. From these, the mostcostly is SHA1, and we will use it in the form “amount of512-bit chunks it processes” as a metric.

3.2. Cost

We will first analyze the cost of one HMAC-SHA1 itera-tion, that has two applications of SHA1 (see section 2.2). Inthe inner application, SHA1 is applied on (P xor ipad||(S||I)) in its base case. About this application we canstate:

• (S||I) is of variable length and shorter than 512

bits: the length of the SSID S is between 1 and 32bytes and the block number I is always expressed asa 32 bit integer, so the length of (S||I) is between40 and 288 bits.

• the length of (P xor ipad) is 512 bits by defini-tion of ipad in HMAC (see 2.1).

• Therefore, (P xor ipad || (S||I)) is alwaysshorter than 1024 and longer than 512 bits, whichmeans that in this application SHA1 runs over two512-bit chunks.

The same analysis stands for the inner applications inthe recursive cases (P xor ipad || U n I)) because(U n I) is 160 bits long (it is the output of a previousSHA1) and therefore the whole chain is 512 + 160 = 672bits long. That means that in all then inner applications,SHA1 runs over two 512-bit chunks.

In the outer application of SHA1 in the algorithm, asSHA1(x) is always 160 bits long, the length of (P xoropad || SHA1(x)) is always 672 bits. Therefore, inSHA1(P xor opad || SHA1(x)) the functionSHA1 always runs over two 512 bits chunks.

In conclusion, in each step of the calculation describedin section 2.2, there are 4 applications of SHA1 over 512-bit chunks. As there are 4096 recursions of HMAC-SHA1for each one of the two blocks, we have a total cost of 4096×4 × 2 = 32768 SHA1 over 512 bits.

3.3. Reduction

The chain (P xor ipad) is the first 512-bit chunk thatSHA1 processes in the inner call to HMAC-SHA1. As Pand ipad are constant during the whole process, (P xoripad) remains constant during the 4096 inner calls of SHA1in HMAC-SHA1. Therefore, it may be preprocessed to cre-ate an intermediate result of 160 bits (a state of SHA1 be-tween the 512-bit chunks) and may be a parameter of a newSHA1p function that continues the execution of SHA1 withthe following 512 bit chunk.

Exactly the same happens with the outer call of SHA1in HMAC-SHA1. The chain (P xor opad) is also 512bit long, and constant in all the intermediate applications ofSHA1 and may be pre-calculated.

In this manner, the four applications of SHA1 on 512-bit chunks needed to perform one step in the algorithm de-scribed in section 2.2 are reduced to only two applications.There are two additional SHA1 to pre-calculate SHA1(Pxor ipad) and SHA1(P xor opad). The new totalis then 4.096× 2× 2 + 2 = 16.386 runs, reducing the com-plexity to near half the original of 32.768 SHA1 runs.

4. HARDWARE MIGRATION

With the intention of replacing the CPU’s processing wedesign an electronic circuit to be instantiated in a reconfig-urable hardware device such as an FPGA.

The first step for this design consists of selecting exactlywhich part of the original algorithm will be implemented,given the finite resources of the FPGA and taking into ac-count the complexity in it’s programming and verification.

As we have already analyzed, the principal componentof the code is the SHA1 algorithm. We showed in 3.3 that16.384 times from the 16.384 + 2 we execute a partial im-plementation of SHA1 that takes as input 160 bits of pre-processed state (that is, SHA1(P xor ipad) or SHA1(Pxor opad) ) and another new 160 bits that must be pro-cessed. Therefore, we chose this part of the problem to beimplemented in hardware.

Another important consideration is the necessary band-width to transfer the partial results between the CPU and theinstantiated FPGA circuit. In our case we use an ethernet100 MB/s connection. If we choose to implement only thepartial SHA1, any gain in the processing time that the cir-cuit could generate would be too small compared to the timeneeded to transmit the 160+160 bits necessary to execute.

As PBKFD2 executes 4095 recursive calls in each block,if that control is also implemented in the FPGA, the trafficis considerably reduced:

1. CPU → FPGA:

(a) preprocessed SHA1(P xor ipad) (160 bits)

(b) preprocessed SHA1(P xor opad) (160 bits)

59

(c) Result of the base case of the first block : U 1 1 (160bits)

(d) Result of the base case of the second block : U 1 2(160 bits)

(e) Total: 160∗4 = 640 bits for all the process execution.

2. FPGA → CPU:

(a) Result T 1 (160 bits)

(b) Result T 2 (160 bits)

(c) Total: 160 ∗ 2 = 320 bits para for all the processexecution.

The time needed to transfer using Ethernet 100 MB/sthese 960 bits is nothing compared with the processing timegain. In conclusion, we will implement in hardware the4095×2 recursive call to the partial SHA1 that has as inputsthe preprocessed 160 bits and processes the remaining 160bits.

5. HARDWARE IMPLEMENTATION

In figure 2 we show the state machine implemented in theFPGA.

Fig. 2. FSM implemented in the FPGA

The Load and Prefetch states are in charge of load-ing the necessary operands for the FPGA processing. Thefollowing states implement both recursive blocks of 4095SHA1. The Unload state transmits the final results back tothe CPU.

As shown in section 2.2, both blocks of 4095 applica-tions of HMAC-SHA1 don’t have data dependency and thusmay be executed in parallel. However, within each block I,every recursive application U n I depends on the result ofthe previous application, and therefore it may only be ex-ecuted sequentially. That’s why the state Process1 corre-sponds to the inner SHA1 application for both blocks simul-taneously, and the state Process2 corresponds to the outerapplication for both blocks as well. The preprocessing statesinitialize the needed variables for the application of SHA1(W, A, B, C, D and E).

5.1. Implementation of the Process states

The core of these states is the implementation of the mainloop of SHA1 that is shown in the pseudo-code in section2.3. All the operations in this loop are done in parallel inone clock, therefore the state is executed 80 times. Herelays the advantage of the hardware implementation.

In figure 3 the registers and logic implemented to solveall the operations in one clock are shown. The rectangles re-present 32 bit wide registers, and the circles represent com-binational logic sections. In each clock, the data flows fromthe registers, feed the combinational logic and the results arestored again in the registers following the arrows. All the ar-rows represent 32 bit wide data exchanges but the arrows inthe control registers t 80 and t 16, as explained in futuresections.

Fig. 3. Implementation diagram for states Process 1 andProcess 2

The circles calc f and calc k correspond to the com-binational calculation of F and K. To calculate TEMP we

60

simply add up the results of previous calculations as shownin figure 3. In the left side of this figure, the connectionsbetween the state registers A,B,C,D and E are shown, in or-der to update those registers as shown in the pseudo-code insection 2.3.

5.1.1. Implementation of array W

The implementation of array W deserves a special mention.It is shown in the right side of figure 3. In the original al-gorithm, W has 80 positions, each one 32 bit wide. In thefirst 16 iterations, the values are loaded with the messagethat is being processed, and then in each iteration t the cor-responding position W(t) is calculated using four values ofW that depend of t.

As in each iteration only the previous 16 positions ofW are used, a first improvement is to store only 16 posi-tions and calculate W(t%16) in each iteration. Such an im-plementation in an FPGA requires four 32-bit multiplexersfrom 16 positions to 1 to select the four needed positions tocalculate the current W(t%16), and one extra multiplexerto select the position in W where the result must be stored.These are 5 very large multiplexers, with a very large areacost and an increase in the delay of the calculations.

Our implementation takes advantage of the fact that thesame positions relative to t are always used. We imple-mented array W as a stack, where position W(j+1) is movedto position W(j) in each clock. This has a minimum costin hardware and completely eliminates the multiplexers, be-cause the same positions are always used in the calculations(0,2,8 and 13) and the result is always pushed onto the stack.

The combinational logic needed for the calculation ofW(t%16) is in circle calc w(t), that feeds the stack Wand is also used in the calculation of TEMP.

5.1.2. Calculation of k, f and W(t%16)

As shown in the pseudo-code described in section 2.3, thecalculation of k and f depends on whether t is smaller than20, 40, 60 or 80. The implementation inferred from the al-gorithm is to use a counter and several comparators to selectthe proper function. Our implementation is better, since ituses less area and eliminates the comparison time, thus re-ducing the delay in each clock.

We use two shift-registers: one 80-bit wide initializedwith 40 bits set followed by 40 cleared bits; and one 40-bit wide initialized with 20 bits set and 20 bits cleared. Ineach clock we take the first bit of both registers, resultingin the pair 00 during the first 20 clocks (corresponding tot<= 20), the pair 01 the next 20 clocks (corresponding to20 <=t<= 40), then the pair 10 and finally the pair 11.These bits select the correct output of a multiplexer in thelogic of calc k and calc f as shown in figures 4 and 5.

Fig. 4. Implementation of k calculation

Fig. 5. Implementation of F calculation

In the calculation of W(t) a comparison is also needed:if t is smaller than 16. In this case, we use the same idea thatin the previous comparisons. We initialize a shift-registerwith 16 bits set, and in each clock we feed it a cleared bit.During 16 clocks, the first bit of the shift-register is set, andthe rest of the 80 clocks it is cleared. This bit selects thecorrect answer in a multiplexer, as shown in figure 6.

Fig. 6. Implementation of W(t%16) calculation

6. RESULTS

To compare the performance of the solution, we synthesizeit for several FPGAs and compare the times with the onesobtained with the aircrack-ng [6] in a Core 2 Duo. It isimportant to notice that the code of the aircrack-ng isa highly optimized code for the Core 2 Duo processor: theSHA1 kernel is programmed in assembler and takes full ad-vantage of the SIMD (Single Instruction Multiple Data) in-structions available in the IA-32 and IA-64 architectures. Inthis way, it processes 4 keys in parallel in each core, takingfull advantage of the hardware capacities of the processor.

To run the tests we used several FPGA from Xilinx [7].We conducted simulation tests with several Spartan 3A and

61

Virtex 4. The results are shown in table 1 comparing thedevice clock frequency, the amount of keys per second pro-cessed, how many keys in parallel are processed, the costin dollars, and finally the cost in dollars for key per secondprocessed.

device clk(Mhz) K/s K/dev U$D $/(K/s)Spartan 3A -4 81 120 4 60 0.12Spartan 3A -5 95 141 4 111 0.2Virtex 4 -10 134 200 33 8076 1.23Virtex 4 -11 155 231 33 10340 1.36Virtex 4 -12 180 268 25 7737 1.15Core 2 Duo 1500 150 8 210 0.7

Table 1. Performance comparison

As we can see in table 1, the achieved solutions in termsof keys per second processed are similar between the dif-ferent Spartan 3A and the Core 2 Duo, while the Virtexachieved better results. We didn’t obtain a significant im-provement in keys per second processed, probably becausethese algorithms are prepared so that the data dependency ishigh and parallel implementations are hard to get, makingthe brute force attack more costly.

On the other hand, when comparing the price per key,significant improvements are obtained in the Spartan 3A fa-mily. We conclude that this is a good solution from the en-gineering point of view. Much more if we take into accountnot only the prices for the chips (as in table 1) but also theprice for the needed electronics so that those chips can work.Given that the needed electronics for a processor to work isfar greater than the one needed for an FPGA, the breach issignificantly broaden.

7. CONCLUSIONS AND WORK IN PROGRESS

In this paper we presented a hardware implementation usingFPGA for the brute force attack to the key derivation algo-rithm used in WPA-PSK. We performed an analysis of thealgorithm and found an improvement that reduces by halfits computational cost. We then analyzed which part of thealgorithm we should implement in FPGA. We implementedthe algorithm and optimized some parts of it thinking specif-ically of the target FPGA architecture. Finally, we comparedthe achieved performance of the implementation in severalFPGAs and in a Core 2 Duo, using code specially optimizedfor that processor.

The results indicate that the FGPA implementation ofthe optimized version of the algorithm is a good engineeringsolution given that the achieved price per key per second ismuch smaller than in a CPU.

7.1. Work in progress

Seeking to improve the amount of keys per second, we aredeveloping an alternative implementation.

In the implementation explained in this work, we fo-cused in performing the greatest amount of parallel opera-tions possible. That’s why all the operations within the mainloop are done in parallel. In this way, the resulting clock fre-quency is low, given that there are a great number of chainedoperations, and the occupied area for the processing of onekey is large, reducing the amount of keys that can be pro-cessed in one chip.

The idea in this new implementation is quite the oppo-site: sequentially execute simple instructions to achieve anelevated clock frequency and reduce the logic needed to pro-cess one key so that many keys can be processed in parallelin one chip. The idea is to implement a simple ALU, specificto this problem, that contains only the needed operations toexecute SHA1. In this way an assembly program for thisALU can be kept in a block ram that controls all the imple-mented ALUs at once, and the state of the key derivation canbe kept in a block ram for each ALU available. This idea isstill in the design stage, so we still don’t have an estimationof its possible performance.

8. REFERENCES

[1] B. Kaliski “RFC 2898,” pp. 1–34, Sept. 2000.

[2] M. Bellare, R. Canetti, and H. Krawczyk, “Keyed hash func-tions and message authentication,” Lecture Notes of Computer

Science, vol. 1109,pp. 1–15, 1996.

[3] M. Bellare, R. Canetti, and H. Krawczyk, “RFC 2104,” pp.1–11, Feb. 1997.

[4] D. Eastlake and P. Jones “RFC 3174,” pp. 1–22, Sept. 2001.

[5] National Institute of Science and Technology, USA, “Se-cure Hash Standard,” Federal Information Processing Standard(FIPS) 180-1, April 1993.

[6] www.aircrack-ng.org

[7] www.xilinx.com

62

AUTOMATIC SYNTHESIS OF SYNCHRONOUS CONTROLLERS WITH LOW ACTIVITY OF THE CLOCK

Jozias Del Rios*, Duarte L. Oliveira

Divisão de Engenharia Eletrônica Instituto Tecnológico de Aeronáutica – ITA São José dos Campos – São Paulo – Brazil

email: [email protected], [email protected]

Leonardo Romano†

Departamento de Engenharia Elétrica Centro Universitário da FEI

São Bernardo do Campo – São Paulo – Brazil email: [email protected]

ABSTRACT

In a digital system the activity of the clock signal is a major consumer of energy. It consumes 15% to 45% of energy consumed. In this article we propose a method for automatic synthesis of synchronous controllers with low activity in the clock signal. The reduction of activity of the clock is obtained by applying two strategies. In the first strategic, our controllers operate in the transitions of both edges of the clock signal. It allows a 50% reduction in the frequency of the clock signal, but with the same processing time. An important feature is that our controllers uses only flip-flops that are sensitive to a single edge of the clock signal (single-edge triggered – SET–FF). In the second strategy, the clock signal is inhibited in our controllers when it encounters a state with self-loop.

1. INTRODUCTION

With the evolution of microelectronics, more and more high-complexity digital systems are been designed. A common characteristic of these systems is the fact that they are battery-fed, and are conceived for different applications such as wireless communication, portable computers, aerospace (satellites, missiles, etc), aviation, automobile, medical applications, etc. Since they are battery-fed, it is desirable that the batteries have a long life span, and, therefore, power dissipation is a very important parameter in the design of such systems [1]. These systems may be implemented in VLSI technology and/or FPGAs (Field Programmable Gate Array). The FPGAs have become a popular means to implement digital circuits. FPGA technology has grown considerably in the past few years, generating FPGAs with up to 50 million gates, allowing that complex digital systems have been programmed in such devices [2]. Traditionally, digital circuits are implemented with components built with CMOS technology. The power dissipated in CMOS components follows the following expression [3]:

PTA=1/2.C.V2

DD.f.N + Qsc.VDD.f.N + Ileakage.VDD (1a) where: PTA is the total average dissipated power, VDD is the supply tension, f is the operation frequency, the N factor is switching activity, that is, the number of transitions at a gate output, and the QSC factor and C are, respectively, the load quantity and the capacitance [3]. In equation (1), the first term represents the dynamic dissipated power. The second term represents the dissipated power related to short current. The third term represents the static dissipated power related to leakage current. In static CMOS technology, the largest dissipated power fraction occurs during the switching of events (dynamic power) [3]. Average power dissipation at g gate may be simplified to the first term of (1):

PAVERAGE-g=1/2.Cg.V2DD. f. NAVERAGE-g (1b)

The synthesis of finite state machines (FSM) plays an important role in the design of battery-fed digital circuits. Many digital circuits are described by an architecture consisting of a network of controllers + data-paths and/or processors [4]. The synchronous controllers of such circuits are often specified as an FSM consisting of several states and transitions between states, i.e., they are specified by a state transition graph (STG). The techniques for reducing dynamic power are applied at the different levels of digital design [1]. In the synthesis of synchronous controllers, solutions proposed for reducing power are being offered at the logic level: 1) clock logic control (gated-clock) [5],[6]; 2) Flip-Flops triggered at the two edges of the clock transition [7],[8],[9]; 3) decomposition [10]; 4) state assignment [11]; 5) logic minimization [12]. In a digital system, the part sequential is the main contributor to power dissipation. Recent studies have shown that in such systems the clock consumes a large percentage (15% to 45%) of the system’s power [5]. So, dissipated power of circuit may be considerably reduced if clock activity is reduced. Among the solutions proposed for

*Is student of Electronic Engineering in ITA †Is a candidate for Master of Science in ITA.

63

reducing the dynamic dissipated power of the synchronous controllers, the first two are very interesting.

1.1. Reduction in the activity of the clock

The first strategy (gated-clock) uses an additional logic to inhibit (stop) the clock signal in states with self-loop. For some FSMs, most clock cycles are used in states with self-loop. In these states there is also dynamic power dissipation, because internally the Flip-Flops (FFs) state change, although no changes occur in the outputs of FFs. Benini et al [5] proposed a method and target architecture for synchronous controllers with inhibition clock. These controllers operate on the single-edge of the clock signal. The second strategy is to use in the controllers, FFs that are sensitive at the rising / falling edges of the clock signal (double-edge triggered - DET) [9]. For a rate of data processing (data throughput), the DET-FFs requires 50% of the clock, therefore reduces the activity of the clock. Comparing DET-FFs and FFs operating on the edge of the simple transition of clock (single-edge triggered - SET), we notice an increase in power consumption and area (number of transistors) [8]. A promising approach that reduces the activity of the clock signal is the union of two strategies: gated-clock + FSM using DET-FF. This approach has two problems: 1) Most libraries VLSI standard-cell do not include this kind of FF (DET), and macro-cells in FPGAs use SET D-FF [13]; 2) for the proposed architecture in [5], inhibition of clock in states with self-loop is sub-optimal when applied to FSM Moore which use DET-FFs. In this paper we propose a tool for automatic synthesis of synchronous controllers Moore model. Our method drastically reduces the activity of the clock signal. This reduction is achieved through two strategies. In the first strategy, our method synthesizes synchronous controllers that operate on both edges of the transition of the clock signal, but uses only SET-FF-Ds. In the second strategy, we propose a new architecture that inhibits the clock signal on both edges of the transition of the clock signal in states with self-loop. Figure 1 shows the target architecture model Moore used to implement our synchronous controllers.

ST A TESVAR IABL ES

EXC IT A TIONLOGIC

IN H IB IT IO NL OG IC

EXC IT A TIONLOGIC

LA T C H

LA T C H

B A N KOF

FL IP-FLOPS

B A N KOF

FL IP-FLOPS

STA T ESVARIAB L ES

O U T PU TL O G IC

ST A T ESVA RIA BL ES

IN PU TS

OU TPUT S

CL K

D

Dh

Gclk 1

G clk 2 Fig. 1. Target architecture proposal: Moore model.

The remainder of this paper is organized as follows. In section 2 present some concepts for understanding of our method; in section 3 introduce our method; in section 4 illustrate our method with an example from literature; in section 5 we discuss the advantages and limitations of our method and some results and finally; in section 6 present our conclusions and future work.

2. PRELIMINARY

Synchronous controller is a deterministic FSM of type Moore or Mealy whose behavior is described by a STG. The vertices represent states and arcs state transitions. The main idea of our method is to partition the Moore STG in two sub-sets of state transitions such that each subset is associated respectively with a bank of FFs-D. A bank of SET-FFs-D operate at the rising edge (CLK +) of the clock signal and the other bank of SET-FFs-D of the falling edge (CLK−). The partitioning is achieved by constructing a graph called a clock transition graph (CTG) proposed in [14].

2.1. Clock transition graph

In this section we present the CTG and the concepts for its manipulation. In the CTG the state transitions between any two states of the STG is defined as a bridge, because the transition is not directional. Definition 2.1. Clock Transition Graph – CTG is an undirected graph <V, A,S>, where V is the set of vertices that describe states, A is the set of edges that describe bridges and are labeled with the clock signal that are polarized in {+, −}. The S is state initial. Figures 2a,b show respectively the STG with inputs and outputs omitted and its CTG without label (clock signal). Figure 2a shows six state transitions. Figure 2b shows four bridges. The self-loop in state B is not a bridge and the two state transitions between states C and D form a single bridge (C—D). The interconnection between a set of bridges forms a path. If the connection of bridges begins and finishes in the same state defines a cycle. Figure 2b shows the cycle {(A—B), (B—C), (C—D),(D—A)}. A bridge is defined as positive if it is associated the rising edge of clock (CLK +). If it is associated with falling edge of clock signal is defined as negative (CLK−). A cycle is called degenerate if all the bridges belonging to the same cycle are the same type, where positive or negative. A cycle is even if he has an even number of bridges, otherwise the cycle is odd. The CTG is called degenerate if all the bridges are of the same type. The SFSM is degenerate if your CTG is degenerate. Figures 3a,b show respectively a cycle with

64

degenerate pair and a cycle with non-degenerate pair. Figure 3a the CTG is degenerate.

B A

C D

B

C

A

D

(a) (b ) Fig. 2. Specification: a) STG ommitting inputs and

outputs; b) CTG omitting clock signal.

(a ) ( b )

B

C

A

D

C L K +

C L K +

C L K +

C L K –

B

C

A

D

C L K +

C L K +

C L K +

C L K +

Fig. 3. CTG: a) cycle with degenerate pair; b) cycle with non-degenerate pair.

3. AUTOMATIC SYNTHESIS: METHOD

Our controller is a FSM incompletely specified of type Moore. Our tool synthesizes our controller using flip-flops D. It follows the traditional procedure and has five steps: 1. Capture the description of the controller in the STG

Moore. The tool accepts the description of STG either in the format kiss2 [15], or in the proposed format called EMS (Explicit Machine Specification) [16].

2. Perform the states minimization of the STG using the partitioning algorithm and get the STGMIN [15]. In this step the partitioning algorithm that is switched for specifications that are completely specified is modified to accept incomplete specifications. The modification uses a heuristic to specify the outputs and states don't-care.

3. From STGMIN to build the CTG (section A). 4. Using the CTG encode the STGMIN with reduce

switching (obtain the STGMIN-COD) (section B) 5. From STGMIN-COD using the Espresso algorithm [15],

extract in the form of sum-of-product the equations of excitation, output and inhibition.

3.1. Generation of CTG

The CTG is obtained from the STGMIN. Our algorithm is composed of three steps [16]: 1. Generate the CTG and extract the set of bridges and

cycles. 2. To adjust the parity of each cycle. 3. To define label (type) for bridges

The algorithm seeks to define in the CTG an alternation in the type of bridge and the definition of the type of bridge in a cycle must satisfy the lemma 3.1. Lemma 3.1 (without proof). Let E={A,B,C,....,Z} the set of states of CTG and Cy a cycle any of CTG, where Cy={(AB),(BC),...,(XA)}, and (DK) ∈ Cy is a bridge of CTG. We say that the CTG is non-encoded in the non-degenerate form if only if in Cy there is a unique path of bridges positive or negative. In the first step the algorithm extracts the orderly bridges without label and all cycles of CTG. This task is realized by traveling by depth the STGMIN. The second step verifies in each cycle to its parity. For cycles with an odd number of bridges, our method optionally adjusts these cycles for an even number of bridges. The adjustment is accomplished by introducing state NOP. The advantage of balanced partitioning is that allows a greater reduction in activity of the clock signal and in the number of state variables. Figure 4a shows a CTG with a cycle of odd bridges. Figure 4b shows the CTG with adjustment of parity by introducing the state NOP. The third step of the algorithm define (labels) the type of each bridge. If the type is negative bridge the label is CLK–. If the type is positive bridge the label is CLK +. Figure 3a does not satisfy the lemma 3.1, because there is only one unique path to bridge positive. Figures 5a,b show respectively the cycles balanced and unbalanced. They satisfy the lemma 3.1 because there are at least two paths of bridges positive and negative. The introduction of NOP states in a CTG is an alternative to satisfy the lemma 3.1.

AB

C D

E

ABC

D E N O P

(a ) (b ) Fig. 4. CTG: a) cycle odd of bridges; b) adjustament of

cycle.

Fig. 5. GTR: a) unbalanced cycle; b) balanced cycle.

3.2. State assignment

In this section we describe the proposed algorithm coding of the CTG. The encoding is realized in three steps [16]: 1. Symbolic coding 2. Reduction code 3. Binary encoding

65

3.2.1. Symbolic coding The target architecture shown in Fig. 1 has as characteristic the partitioning of the FF's in two banks. A bank operates the FF's at rising edge of clock (B_CLK+). The other bank operates in the falling edge of clock (B_CLK−). In the first step the proposed algorithm symbolically encodes the CTG. Each state of the CTG is symbolically coded and each code symbol is formed by two semi-symbolic codes. It must satisfy the rule of partition code (rule_pc) and each half-code (integer value) is related to a bank of FF's. This code must satisfy the rule_pc below, where each state is encoded symbolically with two semi-concatenated codes. Rule_pc: Let the bridge Pj of the states A and B of the CTG and the semi-codes Sci, Sck ∈ B_CLK+ and Scx, Scy ∈ B_CLK− where Sc is an integer value. If Pj is negative, so the states A and B must have the some semi-code ∈ B_CLK+, for example: A=Sci&1Scx and B=Sci& Scy. If Pj is positive, so the states A and B must have the some semi-code ∈ B_CLK−, for example: A=Sci&Scx and B=Sck&Scx. Figures 6a, b show the CTG encoded symbolically, where in Fig. 6a the encoding satisfies the rule_pc and the Fig. 6b does not satisfy, because the transition B→C violates rule_pc. Figure 6a have respectively the symbolic codes positive and negative that are [0,1,2] and [0,1,2].

ABC

D E N OP

(b)

CL K+

C LK+

CL K+

0& 01& 00& 1

0& 22& 1

2& 2

ABC

D E N OP

(a)

CL K+

C LK+

CL K+

0& 01 &01& 1

0& 22& 1

2 &2

CL K

CLK

CLK

CLK

CL K

CLK

Fig. 6. CTG Symbolically encoded: a) satisfies the

rule_pc; b) non-satsfies the rupe_pc.

3.2.2. Reduction of code The reduction algorithm of state variables realizes the merge of semi-codes [16]. The algorithm generates all combinations of merge of semi-codes (brute force). Figures 7a,b,c show the maps of semi-codes.

Semi-neg

Semi-pos 0 1 2

D

C

B

F

A

7

6

5

4

3

E G

I

H

Semi-pos 0 1 3

D

C

B

F

A

7

6

5

4

E G

I

H

Semi-neg

(a) (b)

0 1 2

D

C

B

F

A

7

5

4

3

E G

I

H

(c)

Semi-posSemi-neg

Fig. 7. Maps of semi-codes: a) initial; b) merge of bridges

positive; c) merge of bridges negative.

1 The symbol & means concatenation.

3.2.3. Binary encoding The proposed algorithm for encoding binary of CTG uses the concepts of inversion masks and Snakesequence [16]. The encoding is realized in two steps. The first step is to the bank B_CLK+ and the second to the bank B_CLK−. Figures 8a,b shows the CTG encoded.

Fig. 8. CTG encoded: a) three variables; b) four variables.

4. CASE STUDY

In this section we illustrated our method with the seven.ems benchmark (Fig. 9). The SYntool_DET tool reads the specification at the format kiss2 or format EMS. The second step is states minimization. The states D and E were merged. The third step generates the CTG and was used the option of adjust of parity of the cycle. The state NOP was inserted between the states H and A (Fig.10), and CTG is labeled with the clock signal (CLK) (Fig. 11). In the fourth step is realized at the symbolic encoding of CTG (Fig. 12). The symbolic encoding resulting needed in the B_CLK+ is of three codes [0,1,2] and in the B_CLK– four codes [0,1,2,3]. The next step is to reduce code. There was the elimination of the code 2 in the B_CLK+ therefore replaced by two codes [0,1] (see Fig.13, 14). In the last stage of the fourth step is realized the binary encoding. Figure 15 shows the CTG encoded. The coding needed uses three state variables. Two state variables for the bank B_CLK– and one variable for the bank B_CLK+. The fifth and last step is logic minimization. Figures 16,17,18 show the resulting logic.

B / 0 1 1

E / - 0 0D / 0 -0

A / 1 1 1

F / 1 1 1

H / 1 0 1 G / 0 1 0

C / 1 0 0

b c

a

a c + b c

a + b

b a b

b + c

a b

a

aa b

a b c

cbb

b c

a b

b c

x y z

Fig. 9. STG: benchmark seven.ems

B

D - E

A

F C

H G N O P - H A

Fig. 10. CTG-MIN-CP: benchmark seven.ems with adjust of parity

66

B

D -E

A

F C

H G N OP - H A C L K+

C L K+

C L K+

C L K+

C L K

C L K

C L KC L K

C L K

Fig. 11. CTG labeled

B

D-E

A

F C

H G N OP-H A CLK+

CLK+

CLK+

CLK+

2&31&0

0&2 0&3

0&0 0&1 2&1

1&2

CLK

CLK

CLKCLK

CLK

Fig. 12. CTG symbolically encoded

Sem i-po s

S emi-posSem i -n eg 0 1 2

0

1

2

3

A

N O P -H A

D -E

F

B

C

H

G

0 1

0

1

2

3

A

N O P -H A

D -E

F

B

C

H

G

Se mi -n eg

(a ) (b )

Fig. 13. Map of symbolic codes: a) initial; b) reduced.

B

D -E

A

F C

H G N O P - H A C L K +

C L K +

C L K +

C L K +

1& 31 & 0

0& 2 0 & 3

0 & 0 0& 1 1& 1

1& 2

C L K

C L K

C L K

C L K

C L K

Fig. 14. CTG with reduced symbolic codes

1 1 11 0 0

0 0 1 0 1 1

0 0 0 0 1 0 1 1 0

1 0 1

B

D -E

A

F C

H G N O P - H A C L K +

C L K +

C L K +

C L K +

C L KC L K

C L K

C L KC L K

Fig. 15. CTG encoded

D M 1 Q M 1

Q M 1

F F - D

D M 0 Q M 0

Q M 0

F F - D

D P 0 Q P 0

Q P 0

F F - D

Q P 0Q M 1

Q M 0

a

b

c

Q P 0

Q P 0

Q P 0

Q M 0

Q M 0

Q M 0Q M 1

Q M 1

Q M 1

b

a

Q P 0

Q P 0

Q M 1

Q M 0

ab

Q P 0

Q P 0

Q M 0

Q M 0

Q M 0

Q M 0

aQ M 1

Q M 1

Q M 1b

ab

G c l k 2

G c l k 1

Fig. 16. Logic circuit: excitation logic

L A T C H

L A T C H

C L K

D

Dh

G c l k 1

G c lk 2

Q M 0Q M 1

Q P 0Q M 0Q M 1

ab

ab

Fig. 17. Logic circuit: inhibition logic of clock

X

Y

Z

Q M 0 Q M 1Q P O Q M 0 Q M 1

Fig. 18. Logic circuit: output logic

5. DISCUSSION & RESULTS

Our tool has two options for automatic synthesis, which is the conventional and clock reduction. The conventional synthesis follows the traditional procedure [15]. We applied our tool in ten examples from the literature (benchmarks). Table 1 shows the data of the specification and of conventional synthesis obtained by our tool. These data are: processing time and excitation logic that are related to the number of literals and the number of ports. Table 2 shows the data obtained by the synthesis of clock reduction including excitation logic and inhibition function h. The

67

column of bank of flip-flops shows the number of FFs that will operate on the rising edge (+) and the in falling edge (–) of the clock signal. Compared with the conventional synthesis, our method achieved a 14% reduction in the number of literals and a penalty of 5% in the number of gates.

Table 1. Results of conventional synthesis

T i m e o f P r o c . ( s )

S p e c i f i c a t i o nE x a m p l e s

I n p u t s /O u t p u t s S t a t e s

C o n v e n t i o n a l s y n t h e s i s

L i t e r a l s G a t e s

A l a r m

C o m p l e x

D u m b b e l l

M a r k 1

P m a

S e v e n

S i x

S h i f t r e g

T h r e e

T m a

5 / 3

3 / 3

2 / 2

5 / 1 6

8 / 8

3 / 3

1 / 1

3 / 2

2 / 1

7 / 6

3

1 1

8

1 4

2 4

8

8

6

3

2 0

1 7 7 2

1 0 1 2 9 1 2 0

2 3 1 2 2

1 2 8 3 8 2

4 1 8 1 1 2 1 2 0

6 6 2 8 1

5 1 2 1 2

1 5 8 1

1 5 8 1

2 6 7 8 0 1 2 0

Table 2. Results of synthesis of R_A_clock

S p e c i f i c a t i o nE x a m p l e s

I n p u t s /O u t p u t s S t a t e s

S y n t h e s is w i t hR e d u c e d a c t i v i t y o f t h e c l o c k

L i t e r a l s G a t e s T i m e o f P r o c . ( s )

A l a r m

C o m p l e x

D u m b b e l l

M a r k 1

P m a

S e v e n

S i x

S h i f t r e g

T h r e e

T m a

5 / 3

3 / 3

2 / 2

5 / 1 6

8 / 8

3 / 3

1 / 1

3 / 2

2 / 1

7 / 6

3

1 1

8

1 4

2 4

8

8

6

3

2 0

2 8 8 3

6 9 3 5 1 2 7

2 3 1 1 4

1 3 0 4 9 3

3 4 9 1 0 7 2 8

7 0 3 3 6

4 3 2 0 1

1 7 1 0 2

2 4 1 2 3

2 1 8 7 4 1 8 5

B a n k o f F l i p - F l o p s

1 + / 1 -

2 + / 2 -

2 + / 1 -

2 + / 3 -

4 + / 4 -

2 + / 2 -

1 + / 3 -

2 + / 1 -

1 + / 1 -

2 + / 4 -

6. CONCLUSION

This article presented the Syntool_DET tool that automatically synthesizes synchronous controllers with inhibition of clock and operating in the two edges (rise / fall) transition of the clock signal. The controllers only use FF's that are sensitive to a single edge of the clock signal. This feature allows a large reduction in the activity of the clock, therefore, reduction in power consumption. The tool Syntool_DET was implemented in C+ + with about 10,000 lines of code. For future work we intend to thoroughly test the tool and perform the estimation of power for a large set of benchmarks and compared with the conventional synthesis [17].

7. REFERENCES

[1] Li-Chuan Weng, X. J. Wang, and Bin Liu, “A Survey of Dynamic Power Optimization Techniques,” Proc. Of the 3rd IEEE Int. Workshop on System-on-Chip for Real-Time Applications, pp. 48-52, 2003.

[2] J. J. Rodriguez, at. al., “Features, Design Tools, and Applications Domains of FPGAs,” IEEE Trans. On

Industrial Electronics, vol. 54, no 4, pp. 1810-1823, August, 2007.

[3] F. Najm, “A Survey of Power Estimation Techniques in VLSI Circuits,” IEEE Trans. On VLSI Systems, vol. 2, no. 4, pp.446-455, December 1994.

[4] L. Jozwiak, et al., “Multi-objective Optimal Controller Synthesis for Heterogeneous embedded Systems,” Int. Conf. on Embedded Computer Systems: Architectures, Modeling and Simulation, pp. 177-184, 2006.

[5] Luca Benini and G. De Micheli, “Automatic Synthesis of Low-Power Gated-Clock Finite-State Machines,” IEEE Trans. on CAD of Integrated Circuits and Systems, Vol.15, No.6, pp.630-643, June 1996.

[6] Q. Wu, M. Pedram, and X. Wu, “Clock-Gating and Its Application to Low Power Design of Sequential Circuits,” IEEE Trans. On Circuits and Systems-I: Fundamental Theory and Applications, vol. 47, no.103, pp.415-420, March 2001.

[7] G. M. Strollo et al., “Power Dissipation in One-Latch and Two-Latch Double Edge Triggered Flip-Flops,” Proc. 6th IEEE Int. Conf. on Electronic, Circuits and Systems, pp.1419-1422, 1999.

[8] S. H. Rasouli, A. Kahademzadeh and et al. “Low-power single- and double-edge-triggered flip-flops for high-speed applications,” IEE Proc. Circuits Devices Syst., vol. 152, no. 2, pp.118-122, April 2005.

[9] P. Zhao, J. McNeely, et al., “Low-Power Clock Branch Sharing Double-Edge Triggered Flip-Flops,” IEEE Trans. On VLSI Systems, vol. 15, no.3, pp.338-345, March 2007.

[10] B. Liu, et al., “FSM Decomposition for Power Gating Design Automation in Sequential Circuits,” 76th Int. Conf. on ASIC, ASICON, pp.944-947, 2005.

[11] S. Chattopadhyay, et al. “State Assignment and Selection of Types and Polarities of Flip-Flops, for Finite State Machine Synthesis,” IEEE India Conf. (INDICON), pp.27-30, 2004.

[12] J.-Mou Tseng and J.-Yang Jou, “A Power-Driven Two-Level Logic Optimizer,” Proc. Of the ASP-DAC, pp.113-116, 1997.

[13] Inuernet: www.altera.com, 2009.

[14] D. L. Oliveira, et al., “Synthesis of Low-Power Synchronous Controllers using FPGA Implementation,” IEEE IV Southern Conference on Programmable Logic, pp.221-224, 2008.

[15] R. H. Katz, Contelporary Logic Design, The Benjamin/ Cummings Publishing Company, Inc., 2a edition 2003.

[16] Jozias Del Rios, et al., “Automação do Projeto de Circuitos Controladores Síncronos de Baixa Potência,” Relatório Técnico – ITA Junior, 2008.

[17] J. H. Anderson and F. N. Najm, “Power Estimation Techniques for FPGAs,” IEEE Trans. On VLSI Systems, vol. 12, no. 10, pp.1015-1027, October, 2004.

68

AJUSTE DE HIERARQUIA DE MEMÓRIA PARA REDUÇÃO DE CONSUMO DE ENERGIA COM BASE EM OTIMIZAÇÃO POR ENXAME DE PARTÍCULAS (PSO)

Cordeiro, F.R.; Caraciolo, M.P.; Ferreira, L.P. and Silva-Filho, A.G.

Informatics Center (CIn) Federal University of Pernambuco (UFPE)

Av. Prof. Luiz Freire s/n – Cidade Universitária – Recife/PE - Brasil email: { frc, mpc, lpf, agsf }@cin.ufpe.br

RESUMO

Ajuste de parâmetros de hierarquia de memória em aplicações de plataformas embarcadas podem dramaticamente reduzir o consumo de energia. Este artigo apresenta um mecanismo de otimização que visa ajuste de parâmetros em hierarquia de memória com dois níveis considerando instruções e dados separados para ambos os níveis. O estratégia proposta usa otimização por enxame de partículas (Particle Swarm Optimization - PSO) visando prover ao projetista suporte a decisão. Este mecanismo visa reduzir consumo de energia e melhoria de desempenho de aplicações embarcadas. Este mecanismo de otimização encontra um conjunto de configurações de hierarquia de memória (Pareto-Front) e oferece suporte ao projetista da arquitetura visando prover um conjunto de soluções não dominantes para tomada de decisão. Resultados para 4 aplicações do Mibench suite benchmark foram comparados com outra técnica evolucionária e observou-se melhores resultados em todos os casos analisados.

1. INTRODUÇÃO

Atualmente a construção de circuitos integrados para o desenvolvimento de sistemas embarcados está cada vez mais presente em diversas áreas, tais como robótica, automotiva, eletrodomésticos e dispositivos portáteis. Cerca de 80% dos circuitos desenvolvidos são destinados a aplicações de sistemas embarcados, cuja demanda cresce cada vez mais devido ao rápido desenvolvimento de tecnologias móveis e aumento de sua eficiência. A complexidade de se projetar sistemas de circuitos integrados cresce a cada dia, dobrando a densidade de transistores a cada 18 meses, segundo a lei de Moore. Esse aumento na complexidade implica na agregação cada vez maior de funcionalidades em equipamentos de menor volume, associados a um menor custo, menor consumo de potência e melhor desempenho. Com a expansão e o desenvolvimento de aplicações de sistemas embarcados, o mercado tem requerido soluções rápidas e eficientes em torno de parâmetros como desempenho, área e energia que uma aplicação pode

consumir. A análise desses parâmetros deve ser feita de forma rápida a fim de atender a demanda do mercado. Grande parte dos circuitos integrados desenvolvidos para aplicações embarcadas contém processadores heterogêneos e freqüentemente memórias caches. Sabe-se que atualmente o consumo de energia de hierarquias de memória pode chegar até a 50% da energia consumida por um microprocessador [1][2]. Desta forma, otimizando-se a arquitetura de memória é possível obter uma redução do consumo de energia do processador e, conseqüentemente, do sistema embarcado. Muitos esforços têm sido realizados a fim de reduzir o consumo de energia pelo ajuste de parâmetros de cache, de acordo com as necessidades de uma aplicação específica. No entanto, uma vez que o propósito fundamental do subsistema de cache é fornecer alto desempenho de acesso à memória, técnicas de otimização de cache devem não apenas economizar energia, mas também prevenir a degradação do desempenho da aplicação. O ajuste de parâmetros de memória cache para uma aplicação específica pode economizar em média 60% do consumo de energia [3]. No entanto, encontrar uma configuração de cache adequada (combinação de tamanho total, tamanho de linha e associatividade) para uma aplicação específica pode ser uma tarefa complexa e pode requerer um longo período de análise e simulação. Adicionalmente, o uso de ferramentas que coletem dados diretamente com o chip pode ser lento, principalmente para realização de testes. Desta forma, a utilização de simuladores que realizem a análise dos componentes em um nível mais abstrato tem se tornado mais viável para atender a velocidade e demanda do mercado. Outro problema observado é que exploração de todas as possibilidades de configuração pode requerer uma grande quantidade de tempo, tornando-se inviável a busca exaustiva por uma melhor solução. Para hierarquias de memória com um nível de cache, variando tamanho total de cache, tamanho de linha e associatividade, é possível obter dezenas de configurações com características específicas [4]. Em hierarquias de memória que envolvem um segundo nível de cache, onde ambas são separadas em instruções e dados, centenas de configurações são possíveis [5]. Adicionalmente, para hierarquias de memória que

69

envolvem um segundo nível unificado de cache, as possibilidades envolvem milhares de configurações que poderiam ser testadas, devido a interdependência entre instruções e dados [6]. Com o intuito de reduzir o conjunto de simulações necessárias para se encontrar uma configuração que esteja entre as melhores possíveis, alguns mecanismos de busca têm sido propostos na literatura. A análise estatística desses mecanismos é importante para detectar em que situações cada abordagem traz maiores benefícios na otimização de memória cache. Desta forma, quando se propõe uma nova técnica deve ser feito um estudo sobre o desempenho obtido e uma análise comparativa com técnicas existentes. Neste trabalho é realizado uma implementação do algoritmo de otimização por enxame de partículas (PSO) para otimizar arquitetura de memória, realizando-se uma análise estatística sobre seu desempenho. Os resultados obtidos são comparados com os resultados de outra técnica proposta na literatura TEMGA baseado em algoritmos genéticos, analisando-se a eficiência do PSO em relação a este mecanismo de exploração [8].

2. TRABALHOS RELACIONADOS

Tendo em vista o impacto da redução de do consumo de energia devido ao ajuste dos parâmetros de cache, muitos estudos têm sido realizados com o intuito auxiliar o projetista na escolha desses parâmetros. No entanto, as contribuições para o ajuste de hierarquia de memória cache de dois níveis, com cache de instrução e dados separados, têm sido menores devido a sua maior complexidade.

Gordon-Ross et al. [6] extendeu a heurística de Zhang, direcionada para caches de um nível, e propôs a heurística TCaT, direcionada para hierarquia de dois níveis. O uso da heurística TCaT permitiu uma redução de energia de 53% quando comparado à heurística de Zhang.

Silva-Filho, et al., baseando-se no ajuste dos parâmetros de cache, propôs as heurísticas TECH-CYCLES [7] e TEMGA[8], onde a última foi baseada em algoritmos genéticos. Utilizando-se o TECH-CYCLES foi possível observar uma redução de consumo de energia de 41% para cache de instruções, enquanto que o TEMGA obteve uma redução de 15% para cache de dados.

Pelo nosso conhecimento, nenhum trabalho foi realizado ainda envolvendo uso de PSO para redução do consumo de energia e ciclos em uma aplicação. Por se tratar de um algoritmo eficiente para problemas de otimização é interessante observar seu desempenho comparado a outra técnica proposta na literatura: TEMGA que se baseia em algoritmos genéticos, da mesma área que o PSO, que são mecanismos de busca evolutivos.

3. OTIMIZAÇAO POR ENXAME DE PARTÍCULAS

Em termos de computação, o algoritmo de otimização por enxame de partículas (PSO) proposta por Kennedy [9] é uma técnica de solução de problemas baseado na inteligência de enxame, o qual é inspirado no comportamento social de um bando de pássaros. A técnica PSO simula o movimento dos pássaros em busca de uma solução ótima em um espaço de busca para um determinado problema. Esse conceito está interligado ao modelo simplificado da teoria dos enxames que os pássaros (partículas) fazem uso da sua própria experiência e da experiência do próprio bando para encontrar a melhor região de busca. Neste cenário, as partículas usam processos de comunicação específicos a fim de chegar a uma solução comum adequada, isto é, de boa qualidade ao problema.

O Algoritmo PSO pode convergir em soluções sub-ótimas, porém sua utilização assegura que nenhum ponto do espaço de busca tem probabilidade zero de ser examinado. Toda tarefa de busca e otimização possui vários componentes, entre eles: o espaço de busca onde são consideradas todas as possibilidades de solução de um problema, e a função de avaliação (compensação e custo), uma maneira de avaliar os membros do espaço de busca. Logo, o algoritmo PSO opera sobre um enxame de partículas que possui um vetor de velocidades e outro de posição, a posição de cada partícula é atualizada de acordo com a velocidade atual, o saber adquirido pela partícula e o conhecimento adquirido pelo bando. Assim, as partículas podem fazer a busca em diferentes áreas do espaço de solução, de forma que quando uma partícula descobre uma possível melhor solução, todas as outras partículas irão se mover próximas a ela, explorando esta região de busca mais profundamente durante o processo.

A cada iteração t, a velocidade da partícula i é atualizada conforme a equação:

))()(())()(()()1(

22

11

txtprctxtprctvtv

ig

iiii ��

−+−+=+ ω

, (1)

Onde w é um peso de inércia que controla a capacidade

de exploração do algoritmo, os dois parâmetros de confiança c1 e c2 que indicam o quanto uma partícula confia em si e no bando respectivamente e r1 e r2 que são números gerados aleatoriamente e uniformemente entre 0 e 1. Atualizando a velocidade desta maneira, permite que a partícula i se mova de acordo com sua melhor posição encontrada individualmente pi , e com a melhor posição encontrada por todo o enxame pg.

Baseado na equação de velocidade (1), a nova posição da partícula é calculada segundo a equação:

70

xi (t 1) xi (t) vi (t 1). (2)

Onde a nova posição é atualizada com a combinação da

posição anterior e a nova velocidade. Baseado nas equações (1) e (2), o enxame de partículas tende a se agrupar e simultaneamente cada partícula se move aleatoriamente em várias direções.

Quando o algoritmo PSO é utilizado para solucionar problemas reais, um enxame inicial de partículas é aleatoriamente gerado, onde cada partícula corresponde a uma possível solução do problema.

Durante o processo evolucionário do algoritmo, as partículas se movem pelo espaço de busca com a atualização dos vetores de posição e velocidade, e após são avaliadas, onde se mede o grau de aptidão (fitness) das mesmas. Aptidão nesse contexto reflete o quão próximo a partícula está da região que contém a solução ótima.

Com as partículas avaliadas, extraem-se o pbest e o gbest, isto é, a melhor posição encontrada pela partícula e pelo todo enxame respectivamente. Depois da atualização das velocidades e posições de cada partícula do enxame, caso o critério de parada tenha sido atingido, a solução do problema encontrada é apresentada. Caso contrário, aplica-se novamente a avaliação de fitness a este enxame, atualizam-se os valores de pbest e gbest,caso seja apresentada uma solução melhor, seguido da velocidade e posição de cada partícula do enxame. O laço prossegue até o critério de parada ter sido atingido.

O fluxograma apresentado na Figura 1 representa um esboço do algoritmo descrito acima [10].

Fig. 1. Fluxograma do Algoritmo PSO

4. PROPOSTA DE OTIMIZAÇÃO COM PSO

O modelo proposto neste trabalho é uma variação do algoritmo de otimização por enxame de partículas para otimização da hierarquia da memória cache titulado TEMPSO (Two-Level Exploration Mechanism based on Particle Swarm Optimization). Para o processo de otimização da memória cache, o TEMPSO mapeia uma possível arquitetura cachê a uma partícula, onde cada partícula é uma possível solução do espaço de busca.

O primeiro parâmetro a ser definido é o espaço de busca para busca da melhor solução para o problema. No cenário – alvo deste problema, o espaço de busca é definido por atributos discretizados, conforme apresentado na seção 3. É necessário que os operadores de atualização de velocidade e posição das partículas sejam adaptados para modelar de forma válida as possíveis soluções para o problema. Para isso foi definido um domínio de possíveis valores para velocidades e posições para cada partícula.

Pelo espaço de busca já pré-definido, foram gerados os possíveis valores para a velocidade e para a posição das partículas conforme apresentado na Tabela 2 e 3. Vale observar que para as velocidades tais valores foram definidos para que fossem geradas posições válidas que representam possíveis arquiteturas.

Tabela 2. Possíveis velocidades para as partículas.

Parâmetro Valores Velocidades {‘/4’,’/2’,’*1’,’*2’,’*4’}

Tabela 3. Espaço de posições das partículas.

Parâmetro Cache Nível 1 Cache Ñível 2 Tamanho de Cache 2KB, 4KB e

8KB 15KB, 32KB e 64KB

Tamanho de Linha 8B, 16B e 32B 8B, 16B e 32B Associatividade 1, 2 e 4 1, 2 e 4

No início do processo de execução do algoritmo do

TEMPSO é definido um conjunto inicial de partículas que representam possíveis soluções para o problema proposto. Inicialmente, a partícula com menor função de custo será assumida como a melhor arquitetura. A cada iteração, todas as arquiteturas são avaliadas, baseado em uma função de custo calculada a partir do número de ciclos e da energia consumida pela aplicação. Com as partículas avaliadas, extraem-se o pbest e o gbest, isto é, a melhor posição encontrada pela partícula e pelo todo enxame respectivamente. Após são atualizadas as velocidades e posições das partículas. Devido ao espaço de busca ser restrito e discreto as variações nas equações de velocidade e posição foram propostas a fim de se obter configurações válidas à medida que as partículas se movessem pelas regiões de busca. As equações são apresentadas conforme a seguir. A equação da velocidade: vid = C1*vid • C2*(pid-xid) • C3*(pgd-xid) • C4*vrnd() (3)

71

Onde • representa um operador lógicsão definidos pesos representados peloconfiança C1, C2, C3 e C4 que represinfluenciam na probabilidade de que a npartícula seja a mesma que a anterior locomovendo ou a velocidade influencposição encontrada por ela ou que sbaseada na melhor posição encontrada popor fim ou uma velocidade aleatória. Esselecionando aleatoriamente um das possas quais estão ponderadas por pesosparâmetros de confiança. O mecanismvelocidade é inspirado em um dos mecado melhor indivíduo no processo evolutgenéticos: a roleta [11]. A figura 2 ilustrcomo seria o processo de seleção aadaptada para o TEMPSO.

Fig. 2. Mecanismo Atualização de A equação de atualização de posiçã

fórmula: xid = xid Å vid;

Onde � representa o operador aplica

parâmetros de velocidade que são operade multiplicação ‘*’ e divisão ‘/’. Base(3) e (4), as partículas se movem pelas rebusca de uma boa solução comum e osendo posições válidas de acordo com odefinido para o problema.

Depois da atualização das velocidadcada partícula, caso o critério de paradamelhor arquitetura para o problema é acontrário, aplica-se novamente a avaliaçenxame, atualizam-se os valores de pbestapresentada uma solução melhor, seguidda posição de cada partícula do enxame.até o critério de parada ter sido atingido. Uma análise variando o número enxame é feita na próxima seção a convergência do enxame em apresentar bo

5. AMBIENTE EXPERIME

A abordagem proposta nesse trabalho endo algoritmo PSO para otimizar arquitcache de dois níveis, com cache de i

co ‘OR’. Ou seja, os parâmetros de sentam pesos que

nova velocidade da que ele estava se

ciado pela melhor seja a velocidade or todo o enxame e ssa escolha é feita síveis velocidades, s definidos pelos

mo de seleção de anismos de seleção tivo de algoritmos ra um exemplo de partir da roleta

e Velocidade

ão é definida pela

(4)

ado pelos possíveis adores aritméticos eado nas equações egiões de busca em o mais importante o espaço de busca

des e posições de a seja satisfeito, a apresentada. Caso ão de custo a este t e gbest, caso seja

do da velocidade e O laço prossegue

de partículas do fim de validar a oas soluções.

ENTAL

nvolve a utilização tetura de memória instrução e dados

separados. Esse tipo de arquiutilizado no mercado e pode sseis parâmetros: tamanho daassociatividade, para cada umrepresenta a arquitetura dprocessador MIPS, com um pinstrução (IC1) e dados (DM1nível com caches de instruçãomemória principal (MEM). utilizada, com política de escrde transistor 6-T de 0.08 ºm.

Fig. 3. Hierarquia de mcache de instruções

Foram adotadas configur

aplicações comerciais, ondeacordo com a Tabela 1. Ajusthierarquia de cache tem-combinações, as quais constide todas as configurações pos

Tabela 1. Espaço de Expl

Parâmetro CacheTamanho de Cache 2KB,

8KB Tamanho de Linha 8B, 16Associatividade 1, 2 e

O algoritmo baseado no trabalho é responsável por encdos parâmetros de cache, parâmetro, de forma a reddiferentes configurações.

Foi utilizado um conjunMibench benchmark suíte [12Dentre as aplicações utilizDijkstra_small, Susan_small é um benchmark gratuito e cpara aplicações de sistemas em

Os valores de consumo dpara executar cada aplicaçconfiguração foram obtidas atSimpleScalar [13] e eCACTI

6. RESUL

Para os experimentos utilizseguir, foram definidos o nú

itetura de memória é bastante ser configurada pelo ajuste de a cache, tamanho de linha e m dos dois níveis. A Figura 3 descrita, composta por um primeiro nível com caches de 1) independentes, um segundo o (IC2) e dados (DM2), e uma Uma voltagem de 1.7 V é

rita write-through e tecnologia

memória de dois níveis, com s e dados separados.

rações de cache comuns em e os parâmetros variam de tando-se os parâmetros dessa -se um conjunto de 458 tuem o espaço de exploração

ssíveis.

oração das configurações. e Nível 1 Cache Ñível 2

4KB e 15KB, 32KB e 64KB

6B e 32B 8B, 16B e 32B 4 1, 2 e 4

PSO que é proposto nesse contrar a melhor configuração através do ajuste de cada

duzir o custo de busca por

nto de quatro aplicações do 2] para realizar as simulações. zadas estão Bitcount_small, e Patricia_small. O Mibench omercialmente representativo mbarcados.

de energia e tempo necessário ção para uma determinada través do uso das ferramentas [14].

LTADOS

zados para as simulações a úmero de partículas variando

72

entre 5,10,20 e 40. Os pesos das velocidades (Ci) foram definidos para {2,2,2,0}. As partículas foram inicializadas aleatoriamente pelo espaço de busca e o critério de parada de busca adotado foi o número de iterações. A função fitness adotada para avaliação das partículas foi a função de custo do produto entre o consumo de energia e a quantidade de ciclos para cada possível partícula. O objetivo do problema é minimizar essa função de custo, ou seja, atingido o pareto-ótimo.

A análise do modelo proposto foi dividida em duas etapas: análise de desempenho do TEMPSO e análise comparativa com o TEMGA. Para a análise inicial de desempenho do TEMPSO foi utilizada a aplicação Bitcount_small, do suíte Mibench. Nessa etapa foi realizado um estudo do comportamento do algoritmo variando o número de partículas em 5, 10, 20 e 40. Para cada análise de configuração do TEMPSO foram realizadas 30 simulações, a fim de se obter a média dos resultados.

A Tabela 4 apresenta média e o desvio padrão dos valores de Energia (Joules) x Ciclos das melhores configurações encontradas nas simulações com as diferentes configurações de número de partículas. Pode-se observar que quanto maior número de partículas melhores soluções são apresentadas.

Tabela 4. Resultados da variação do número de partículas.

Configuração (TEMPSO)

Num. De part.

Joules x Cycles T = 10

Joules x Cycles T = 5

Média Desv. P. Média Desv. P. 5 1521,724 298,96 1612,922 427,0936

10 1367,032

163,28 1380,402

160,4924

20 1305,7163

28,3961

1328,27 38,1659

40 1296,02887 15,367 1300,06 20,9466

Outro estudo realizado foi feito observando-se a

velocidade de convergência do algoritmo para cada variação do número de partículas. A convergência foi observada em relação à função fitness, que é definida no algoritmo proposto como o produto entre energia consumida e quantidade de ciclos. O gráfico que representa essa análise é apresentado na Figura 4.

Segundo a Figura 4, a variação com 40 partículas converge mais rápido. Isso ocorre porque existe uma maior quantidade de partículas buscando a melhor solução simultaneamente. Isto prova uma maior comunicação da melhor solução entre todo o enxame, incrementando a capacidade de convergência e exploração do mesmo. A variação de 20 partículas obteve resultado próximo da de 40, mostrando que o efeito maior está entre a variação de 5 para 10 partículas.

Adicionalmente, foi feito um estudo em relação ao número de simulações necessárias por iteração, quando utilizados diferentes números de partículas na configuração do TEMPSO. O número de simulações necessárias quando

se utiliza 40 partículas cresce muito, tornando-se inviável em alguns casos. No entanto, utilizando-se 5 ou 10 partículas o número de simulações é aceitável.

Fig. 4 – Análise de evolução de convergência das 4

variações.

Quando comparado as configurações encontradas pelo algoritmo TEMPSO com 5 partículas com todo o espaço de configurações existentes foi possível encontrar bons resultados. A Figura 6 mostra as configurações obtidas em relação ao espaço de busca pelo método exaustivo, para a aplicação Dijkstra_small.

Fig 5. Configurações encontradas em relação ao espaço de busca.

Como pode ser observado na Figura 5, as configurações encontradas estão entre as que possuem os menores valores de consumo de energia e algumas delas estão também entre as que têm menor número de ciclos. Isso demonstra a eficiência do mecanismo proposto em encontrar as melhores configurações do espaço de exploração.

Na segunda etapa de análise foi realizado um estudo comparativo entre o TEMPSO e o TEMGA, que é um algoritmo baseado em algoritmos genéticos com bom desempenho na área de exploração de arquitetura de memória [8]. Como pôde ser observado na análise da seção anterior, a configuração do algoritmo proposto que obteve melhores resultados foi com 40 partículas, porém o número de simulações necessárias para obter esses resultados torna-se inviável para aplicações em sistemas embarcados.

No entanto, também foi possível observar que com a configuração de 5 partículas o número de simulações

73

necessárias cai bastante, sendo compatível com outras técnicas utilizadas na área. Desta forma, para realizar a análise com o TEMGA foi utilizado o TEMPSO com 5 partículas. Foi realizado um estudo comparativo dos melhores valores de energia e ciclos encontrados por cada técnica. A Tabela 5 ilustra essa comparação, onde a coluna AP com os valores 1, 2, 3 e 4 representam as aplicações Bitcount_small, Dijkstra_small, Patricia_small e Susan_small, respectivamente.

Tabela 5 – Análise comparativa de função de custo

(energia x ciclos) entre TEMPSO e TEMGA. AP TEMPSO

(Joules) TEMPSO (Ciclos)

TEMGA (Joules)

TEMGA (Ciclos)

Valor Ótimo (Joules)

Valor Ótimo (Ciclos)

1 2,925E-4 5,1664E+6 3,04E-4 5,166E+6 2,502E-4 5,163E+62 74,79E-4 22,625E+6 75,17E-4 22,58E+6 42,699-4 21,287E+63 147 E-4 43,973E+6 151,3E-4 43,17E+6 127,5E-4 42,844E+64 6,690E-4 5,2406E+6 6,82E-4 5,196E+6 6,367E-4 5,1745E+6

Conforme apresentado na Tabela 5, a melhor

configuração obtida pelo TEMPSO alcançou valores de energia menor do que o TEMGA em todas as aplicações simuladas. Em relação à quantidade de ciclos os resultados foram equivalentes. Analisando-se em termos de valores ótimos as soluções obtidas obtiveram valores próximos às soluções ótimas.

7. CONCLUSÃO

Nesse trabalho foi proposto um novo modelo para otimização de arquitetura de memória cache, denominado TEMPSO, onde foi realizada uma análise estatística sobre seu desempenho e foi feito um estudo comparativo com outra técnica na área, o TEMGA.

Foram utilizadas 4 aplicações do suíte Mibench para validar o desempenho do TEMPSO, considerando-se um ambiente de arquitetura de memória cachê de dois níveis, com cachê de dados e instruções separados. Na análise realizada, foi possível observar que o algoritmo proposto conseguiu obter melhores resultados em termos de energia consumida em todas as aplicações observadas, quando comparado com o TEMGA. Em relação ao número de simulações também foi possível observar um melhor desempenho do algoritmo proposto, onde o TEMPSO obteve convergência mais rápida ou igual ao TEMGA, sendo necessárias menos simulações para se atingir a melhor configuração encontrada pelo algoritmo. Em trabalhos futuros pretende-se realizar otimizações do TEMPSO, além de estender as aplicações analisadas, obtendo-se aplicações de outros benchmarks.

8. REFERÊNCIAS

[1] M. Verma, and P. Marwedel, Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, Netherlands, 2007.

[2] M. Kandemir and A. Choudhary. Compiler-Directed Scratch Pad Memory Hierarchy Design and Management, In Proceedings of Design Automation Conference (DAC’02), New Orleans, USA, Jun. 2002.

[3] C. Zhang, F. Vahid, Cache configuration exploration on prototyping platforms. 14th IEEE Interational Workshop on Rapid System Prototyping (June 2003), vol 00, p.164.

[4] TRIOLI, M. F.(2004). Introdução à Estatística. São Paulo: LTC.

[5] Gordon-Ross, Ann, Vahid, F., Dutt, Nikil, Automatic Tuning of Two-Level Caches to Embedded Aplications, DATE, pp.208-213 (Feb 2004).

[6] Gordon-Ross, Ann, Vahid, F., Dutt, Nikil, Fast Configurable-Cache Tuning with a Unified Second-Level Cache, ISLPED05, (Aug 2005).

[7] A.G. Silva-Filho, F.R. Cordeiro, R.E. Sant’Anna, and M.E. Lima, “Heuristic for Two-Level Cache Hierarchy Exploration Considering Energy Consumption and Performance”, In: (PATMOS’06), pp. 75-83, Sep 2006.

[8] A.G. Silva-Filho, C.J.A. Bastos-Filho, R.M.F. Lima, D.M.A Falcão, F.R. Cordeiro and M.P. Lima. “An Intelligent Mechanism to Explore a Two-Level Cache Hierarchy Considering Energy Consumption and Time Performance”, SBAC-PAD, pp 177-184, 2007.

[9] J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proc. of the IEEE Int. Conf. on Neural Networks. Piscataway, NJ: IEEE Service Center, 1995, pp. 1942–1948.

[10] D. Bratton and J. Kennedy, “Defining a standard for particle swarm optimization,” in Swarm Intelligence Symposium, 2007. SIS 2007. IEEE, Honolulu, HI, Apr. 2007, pp. 120–127.

[11] Mitchell, M.; “An Introduction to Genetic Algorithms”, MIT Press, 1998.

[12] Guttaus, M. R.; Ringenberg, J.S.; Ernst, D.; Austin, T.M.; Mudge, T.; Brown, R.B.; Mibench: A free, commercially representative embedded benchmark suite. In IEEE 4th Annual Workshop on Workload Characterization, pp.1-12, December 2001.

[13] Dutt, Nikil; Mamidipaka, Mahesh; “eCACTI: An Enhanced Power Estimation Model for On-chip Caches”, TR04-28; set. 2004.

[14] Burger, D.; Austin, T.M.; “The SimpleScalar Tool Set, Version 2.0”; Computer Architecture News; Vol 25(3), pp.13-25; June 1997.

74

IP-CORE DE UMA MEMÓRIA CACHE RECONFIGURÁVEL

Gazineu, G.M.; Silva-Filho, A.G.; Prado, R.G.; Carvalho, G.R.; Araújo, A.H.C.B.S. and Lima, M.E.

<�� A�� (CIn) -��.��0��1�� (UFPE)

Av. Prof. Luiz Freire s/n – Cidade Universitária – Recife/PE - Brasil email: { gmg2, agsf, grc, ahcbsa, mel}@cin.ufpe.br

RESUMO

Este trabalho visa o desenvolvimento de um IP-Core de uma memória cache reconfigurável. A arquitetura foi desenvolvida de tal forma que se permite sua extensão e conexão com processadores soft-cores. Neste trabalho, a arquitetura de memória foi dividida de tal forma que apenas o tamanho da cache pudesse ser reconfigurado, com base na quantidade de linhas da memória. O IP-Core foi validado através de simulação, e uma análise em termos de área da arquitetura de memória foi realizado visando fazer uma análise prévia do dispositivo de FPGA em função do tamanho da cache. Uma equação foi obtida como resultado desta relação e validado com base em dois estudos de caso.

1. INTRODUÇÃO

A computação reconfigurável tem o intuito de diminuir o espaço existente entre os paradigmas de hardware e software, possibilitando os projetistas de computadores uma nova visão de desenvolvimento [1]. Apesar de a computação reconfigurável ser bastante recente e seus conceitos ainda não estarem firmados, esta tecnologia permitiu a implementação em nível de hardware, mantendo o alto desempenho das implementações, e agora também com uma flexibilidade que antes não existia.

Diversas tecnologias de FPGAs estão disponíveis no mercado mundial, no entanto, poucas possuem tecnologia suficiente para prover reconfigurabilidade parcial. Dentre os maiores fabricantes (Xilinx e Altera) [4][5], os FPGAs da Xilinx possibilitam que parte da lógica reconfigurável possa ser reconfigurada, permitindo desta forma, que aplicações tais como uma cache reconfigurável seja implementada.

Ajustar uma hierarquia de memória em tempo de execução pode ser bastante útil tendo em vista que nem todas as aplicações são ideais para uma determinada arquitetura de memória e as soluções comerciais baseadas em ASICs não permitem que sejam reconfiguradas em tempo de execução. Por outro lado, caches reconfiguráveis possibilitam que arquiteruras de memória sejam ajustadas visando atender restrições de projeto tais como desempenho, área e consumo de energia.

O modelo de memória cache implementado foi baseado

em estudos sobre arquiteturas de cache existentes [2]. O intuito deste trabalho não foi desenvolver novos tipos de mapeamento e algoritmos de substituição da cache, a idéia foi codificar um modelo de cache existente usando a linguagem de descrição VHDL visando prover uma arquitetura de uma memória cache reconfigurável.

Este artigo é um trabalho inicial que está direcionado na obtenção de um modelo real no nível RTL, que permita ser conectado a um processador soft-core. Adicionalmente, uma avaliação em termos de ocupação do FPGA em função do dispositivo também é realizada.

2. DESCRIÇÃO E MODELAGEM DA CACHE

Uma cache totalmente parametrizável é um projeto bastante complexo e pode resultar em muito tempo de implementação. Focamos este trabalho, inicialmente na avaliação do tamanho da cache e deixamos para um trabalho posterior parametrizações em outros parâmetros tais como o tamanho da linha e associatividade da cache. Como base em nossas análises, foi implementada uma estrutura básica de memória cache de acordo com a Figura 1. Nesta etapa de nossos estudos estamos preocupados em analisar os efeitos da variação da quantidade de linhas da memória cache em termos de área ocupada num dispositivo reconfigurável (FPGA) para vários dispositivos da família Virtex II da Xilinx. Dentre algumas vantagens de escolha de tal família de FPGA estão: preços ainda competitivos no mercado quando comparados com a família Virtex 6, assim como o potencial de reconfigurabilidade parcial suportado pelos FPGAs da Xilinx, essencial para projetos envolvendo caches reconfiguráveis. Adicionalmente, é importante deixar claro que o trabalho desenvolvido permite ser aplicado para famílias mais recentes de FPGAs. Encontrar o FPGA adequado para um determinado SoC muitas vezes torna-se atrativo considerando que os custos são elevados entre os diferentes dispositivos para uma mesma família. O trabalho de Mamidipaka e Dutt [3] descreve em nível de detalhes os principais componentes de uma arquitetura de cache. Este trabalho foi fundamental para o entendimento do projeto, para esclarecer algumas idéias e

75

ajustar a descrição final do trabalho antes de partir para implementação.

Fig. 1. Arquitetura de Memória Cache desenvolvida O projeto final da arquitetura da cache ficou definido da seguinte maneira (Figura 1): um módulo decodificador, um comparador de bits, dois buffers de armazenamento temporário (um buffer para armazenar dados e outro para armazenar um endereço), um banco de armazenamento para guardar dados e instruções, outro banco para armazenar rótulos identificadores de linha e o módulo cérebro da cache, o controlador. O decodificador é responsável por receber o endereço que vem do processador e definir qual a linha da cache vai ser endereçada na operação de escrita/leitura da cache. Quando o processador deseja executar uma operação na cache, ele envia um endereço que referencia uma única linha na cache e envia também sinais de controle para o controlador da cache. O comparador é um modulo que compara duas palavras (string de bits) bit-a-bit para verificar se seus valores são iguais. O comparador, dentro da cache, tem a função de sinalizar ao controlador se a instrução (ou dado) requerida pelo processador esta ou não na cache. O buffer desempenha um papel importante da arquitetura de cache. O buffer foi inserido na arquitetura de cache para tentar amenizar o tráfego no barramento do sistema com solicitações da cache. A quantidade de buffers varia dependendo da implementação e do algoritmo de atualização da cache (aplicamos o algoritmo escrita de volta). No desenho esquemático da cache na Figura 10 só contém dois buffers, um para armazenar dados e outro para endereço. Quando o processador realiza uma leitura na cache, duas operações são executas em paralelo para que o resultado da leitura seja retornado mais rapidamente. Uma das operações é selecionar a linha da cache (a informação)

que será retornada, e a outra é saber (comparando rótulos) se esta linha selecionada é a esperada pelo processador. A compreensão desta abordagem é muito simples, o dado selecionado é armazenado no buffer independentemente do resultado da comparação (no comparador), se houver um acerto (cache-hit) na comparação, ou seja, o dado selecionado é o esperado pelo processador, o controlador libera a informação que esta armazenada no buffer de dados, caso o resultado da comparação seja negativa (cache-miss) o controlador apenas descarta a informação que esta armazenada no buffer e parte para a atualização da linha da cache. Na arquitetura codificada foram usados dois bancos de armazenamento. Um dos arrays foi implementado para comportar os dados e as instruções necessárias para execução do processador, e o outro para conter somente os rótulos identificadores das linhas. Um dos componentes mais importantes da arquitetura de cache sem dúvidas é controlador, ele é quem gerencia toda comunicação entre os outros módulos, recebendo sinais, tratando-os e decidindo o que fazer. A cada novo ciclo de máquina, o controlador da cache adota um comportamento diferente, direcionando o fluxo dentro da arquitetura da cache através de sinais de controle.

3. ESTADOS DA CACHE

O controlador da cache desenvolvido utilizou uma única máquina de estado para modelar quatro possíveis fluxos de informações dentro da cache, que são: uma leitura com acerto, uma leitura com erro, uma escrita com acerto e uma escrita com erro. A Figura 2 representa o diagrama da máquina de estado utilizada. O sinal de controle ‘reset’ inicializa todos os componentes do sistema, quando setado para nível lógico alto (um) o controlador vai para o estado inicial, as memórias (dados e rótulos) e os buffers (dados e endereço) são apagados, se ele estiver em nível lógico baixo (zero) o sistema permanece o mesmo (sem alterações).

Fig. 2. Diagrama da máquina de estado do controlador. No primeiro estágio da máquina de estado, as variáveis e sinais de controle são inicializados e o controlador aguarda o sinal de controle que habilita a operação. Este

76

sinal indica se o processador deseja executar uma escrita ou uma leitura na cache. Se o sinal de controle ‘rws’ estiver em nível lógico alto, o controlador executa uma operação de escrita, caso contrário efetua uma operação de leitura (nível lógico baixo). A Figura 3 é um diagrama minimizado da máquina de estado da operação de leitura. No estado inicial o controlador decide se vai executar uma escrita ou leitura baseado num sinal de controle chamado de ‘rws’ (read-write signal).

Fig. 3. Máquina de Estados da operação de leitura. O próximo passo do controlador é saber se a informação desejada esta (ou não) na cache. O estado “compara”, do diagrama de leitura, é responsável pela comparação dos rótulos, neste estágio o controlador aguarda o resultado do módulo comparador, se o dado estiver na cache (acerto - hit) a máquina de estado pula para o próximo passo que é somente liberar o dado para o processador. Caso haja um erro na leitura da cache (miss), o controlador precisa atualizar a linha da cache com um novo bloco da memória principal antes de liberar a informação para o processador. No estado de atualização da cache (ver Figura 14), o controlador envia o endereço do processador para memória principal, solicitando o novo bloco para troca na cache. Neste estado o controlador fica aguardando que a memória principal forneça o novo bloco e sinalize (ready_mem = ‘1’) que o novo dado já esta disponível para atualização, enquanto isso não ocorrer o controlador permanece no mesmo estado até a nova entrada ser recebida. O último estado do controlador é de retorno da informação buscada pelo processador. Neste estágio o controlador apenas libera a informação que esta armazenada no buffer dados para o processador e sinaliza o término da operação (ready_proc = ‘1’). No estado de atualização de cache, o controlador sobrescreve a linha referenciada pelo endereço (decodificada no estado inicial) com o dado que é enviado pelo processador. Quando toda a operação de escrita na

cache é terminada, o controlador fica incumbido de atualizar a memória principal. No estado de atualização da memória principal, o controlador envia o endereço para a memória (endereço armazenado previamente no buffer) e envia os sinais de controle que habilita a memória principal e sinaliza a escrita. Neste estado o controlador aguarda uma resposta da memória principal, sinalizando que a escrita foi efetuada com sucesso (ready_mem = ‘1’), caso isso não aconteça o controlador permanece neste estado durante o próximo ciclo. O último estado é o de finalização da operação de escrita. Neste estado o controlador zera todos os sinais internos da cache, sinaliza (ready_proc = ‘1’) para o processador o término da operação de escrita e vai para o estado inicial da máquina de estados.

Fig. 4. Máquina de Estados da operação de escrita. O diagrama de estados da operação de escrita tem três estados, além do estado inicial descrito anteriormente. Estes estados são: atualização da cache, atualização da memória e finalização (ver Figura 4). No estado de atualização de cache, o controlador sobrescreve a linha referenciada pelo endereço (decodificada no estado inicial) com o dado que é enviado pelo processador. Quando toda a operação de escrita na cache é terminada, o controlador fica incumbido de atualizar a memória principal, que é o próximo estado (atualização da memória) após a atualização da cache. No estado de atualização da memória principal, o controlador envia o endereço para a memória (endereço armazenado previamente no buffer) e envia os sinais de controle que habilita a memória principal e sinaliza a escrita. Neste estado o controlador aguarda uma resposta da memória principal, sinalizando que a escrita foi efetuada com sucesso (ready_mem = ‘1’), caso isso não aconteça o controlador permanece neste estado durante o próximo ciclo. Quando um dado é escrito na cache, tão brevemente ele é atualizado na memória principal.

77

4. COMPONENTES INSTANCIADOS NO FPGA

O trabalho visa o desenvolvimento de uma estrutura básica de uma memória cache com esquema de escrita mapeado diretamente, política de substituição write-through e tamanho da palavra fixada em 32 bits. A abordagem proposta divide a área do FPGA em duas partes: (i) uma que será reconfigurável dinamicamente composta de Dados e Rótulos e (ii) outra parte fixa composta de controlador, comparador, decodificador e buffers.

Os elementos que fazem parte do conjunto reconfigurável são os componentes que variam consideravelmente quanto a sua estrutura física. A Figura 7 mostra uma visão geral da implementação da cache dentro de um FGPA indicando parte fixa e reconfigurável. No lado esquerdo estão os arrays de dados e rótulos que crescem de tamanho de acordo com a variação das quantidades de linhas da memória cache e são eles que aumentam a região ocupada dentro do FGPA. No lado direito da figura estão os componentes que tem a implementação fixa. Entre os componentes da parte fixa na verdade existem dois componentes que sofrem pequenas mudanças no código: o decodificador e comparador.

O decodificador recebe sempre o mesmo endereço de 32 bits, mas a quantidade de bits (saídas do módulo) que são designados para endereçar a linha e o rótulo (para comparação) são diferentes de implementação para implementação. Já o módulo comparador tem sempre a mesma saída, que é um bit que sinaliza o resultado da comparação, mas em compensação tem sempre as entradas diferenciadas entre os projetos, pois os tamanhos das palavras (rótulos) comparadas dependem de qual organização de cache está utilizada. As variações ocorridas por esses componentes são tão pequenas que não causam alterações na quantidade de blocos lógicos que os implementam no FPGA.

Fig. 7. Elementos da arquitetura de cache.

5. ANÁLISE DE ÁREA E DISPOSITIVO

Nesta etapa dos estudos preocupou-se em analisar os efeitos da variação da quantidade de linhas da memória cache em termos de área ocupada num dispositivo reconfigurável (FPGA) para vários dispositivos da família Virtex II da Xilinx. Foram geradas cinco implementações de cache variando somente os arrays de armazenamento e mantendo os outros componentes fixos. Depois de coletar a variação de espaço dentro do FPGA para cada um dos cinco tamanhos de cache, foi gerada uma tabela reunindo todas as informações relevantes. A Tabela 1 mostra relação entre o tamanho da cache e quantidade de unidades lógicas (LUTs) que foram necessárias para implementá-las, levando em consideração a arquitetura final da cache com os componentes, as matrizes de roteamento e os circuitos de interfaceamento. É possível verificar ainda na tabela, a área ocupada (em valores percentuais) em cada uma das implementações da cache. Pode-se observar que as caches de 128 e 256 kb ocuparam 100% do FPGA (overmap) em alguns dispositivos da família Virtex II (células amarelas na tabela), fazendo com que estes dispositivos não fossem suficientes para acomodar tais implementações.

Tabela 1. Quantidade de LUTs por tamanho de cache no FPGA da família Virtex II.

CACHE Nº de LUTs xc2v500 xc2v1000 xc2v1500 xc2v2000 xc2v3000 xc2v4000 xc2v6000 16kb 741 12% 7% 5% 3% 2% 1% 1% 32kb 1545 25% 15% 10% 7% 5% 3% 2% 64kb 3084 50% 30% 21% 14% 10% 6% 4% 128kb 6184 101% 60% 40% 28% 21% 13% 9% 256kb 12404 201% 121% 81% 57% 43% 26% 18%

Os dispositivos “xc2v4000” e “xc2v6000” têm uma quantidade de blocos lógicos muito grande, fazendo com que todas as diferentes implementações de cache ocupassem menos de 30% destes dispositivos. Se o escopo desta monografia estivesse analisando uma arquitetura com outros componentes, por exemplo, microprocessador, memória principal, displays, com certeza os melhores

(ideais) dispositivos da família Virtex II seriam os devices: “xc2v1500”, “xc2v2000” e “xc2v3000”.

A partir da Tabela 1, foi possível representar graficamente a relação entre o tamanho de cache e a quantidade de LUTs utilizadas em cada implementação. O gráfico ficou simples, pois se for selecionado um mesmo tamanho de cache, a quantidade de LUTs necessária para

78

implementá-la é a mesma, independente do dispositivo para uma mesma família de FPGA. Esse gráfico é importante para visualizar os pontos que serão usados para encontrar a equação que determinará qual o FPGA ideal para um projeto de hardware desenvolvido por um projetista. A Figura 8 mostra em uma escala real a variação linear da quantidade de LUTs para caches de tamanhos 16k, 32k, 64k, 128k e 256kbytes (Este gráfico foi traçado usando os valores das duas primeiras colunas da Tabela 1).

Fig. 8. Relação tamanho da cache x LUTs.

Atrelado a este gráfico outra informação é necessária para decidir sobre qual dispositivo FPGA deve ser usado dado uma configuração de cache: a quantidade máxima de LUTs permitido para cada dispositivo (Tabela 2). Através desta quantidade de LUTs que é determinado, com o auxílio da Tabela 2, qual o dispositivo ideal para a implementação.

Tabela 2. Quantidade de LUTs por dispositivo da família VirtexII da Xilinx

Device LUTs

xc2v80 1024 xc2v250 3072 xc2v500 6144 xc2v1000 10240 xc2v1500 15360 xc2v2000 21504 xc2v3000 28672 xc2v4000 46080 xc2v6000 67584 xc2v8000 93184

Visando obter uma equação que representasse a área do FPGA em termos de LUTs, foi realizada uma regressão linear sobre os pontos obtidos na figura 8 e obtivemos como resultado a seguinte equação linear:

QLUT = 48.59 * SIZE - 36.53 (1)

onde SIZE é o tamanho da cache em kbytes e QLUT é a quantidade de LUTs para um dado tamanho de cache.

6. EXEMPLOS

Em uma implementação real, outros componentes devem ser considerados tais como processador e memória principal. Com isso, para exemplificar a abordagem proposta, considere um sistema composto de processador Leon2, memória cache e memória principal.

Nessa abordagem, consideramos que o FPGA está dividido em duas partes: uma parte fixa (não reconfigurável) e uma parte reconfigurável como ilustrado na figura 3. A parte fixa é composta de alguns componentes fixos da memória cache (controlador, decodificador, comparador e buffers), memória principal, processador e módulo ICAP, perfazendo um total de aproximadamente 5000 LUTs. Na parte reconfigurável encontra-se apenas o array de dados e tag da cache que varia de acordo com sua configuração. Consideramos o método de reconfiguração parcial usando o ICAP (Internal Configuration Access Port) que pode ser instanciado e está disponível como recurso de lógica interna do FPGA. A principal vantagem da utilização do ICAP para aplicações envolvendo caches reconfiguráveis é o poder de auto-reconfiguração do FPGA, possibilitando, por exemplo, que a própria aplicação se ajuste a uma nova configuração de cache.

Fig. 9. Elementos da arquitetura de cache.

Exemplo 1: Considere uma cache de 64kbytes, qual FPGA deve ser usado neste projeto?

QLUT = 48.59 * (64) – 36.53

QLUT = 3073.23 LUTs

0

2000

4000

6000

8000

10000

12000

14000

0 50 100 150 200 250 300

Cache Size (kBytes)

LUTs

ICAP

μC

MEM

Controller

Decoder

Comp.

Buf

F i x e dR e c o n f i g u r a b l e

Data Tag

ICAP

μC

MEM

Controller

Decoder

Comp.

Buf

F i x e dR e c o n f i g u r a b l e

Data Tag

79

Neste caso, obtivemos um total de 3073.23 LUTs. Esta quantidade acrescida de 5000 LUTs da parte fixa resulta em um total de 8073.23 LUTs. Com o auxílio Tabela 1, verificamos que o dispositivo xc2v1000 se enquadra nas condições de tal aplicação. Possivelmente dispositivos que contém quantidades máximas de LUTs acima do escolhido acarretaria um aumento relevante no custo do projeto. Considerando que poucas modificações são realizadas quando é considerada uma hierarquia de memória cache, esta abordagem pode ser estendida. Com isso, considere outro exemplo como a seguir. Exemplo 2: Temos uma hierarquia de cache composta de cache de instrução (CI) de 32kbytes e cache de dados (CD) de 128kbytes. Qual o FPGA que deve ser usado neste projeto?

QLUT(CI) = 48.59 * (32) – 36.53 = 1518.35

QLUT(CD) = 48.59 * (128) – 36.53 = 6182.99

TOTAL = (5000LUTs)fixa + 1518.25 + 6182.99

TOTAL = 12701,24 LUTs

Device = xc2v1500

Resumindo, a equação (1) ajuda ao projetista, antes da etapa de implementação, decidir qual dispositivo FPGA da família Virtex II deve ser usado no desenvolvimento de um projeto contendo hierarquia de memória cache e processador, a partir do tamanho da cache em kbytes.

7. CONCLUSÃO

Foi implementado em VHDL uma memória cache e seu funcionamento validado através de simulação. Foi apresentada uma nova equação que permite mensurar a área ocupada por uma memória cache em termos de LUTs. Este trabalho possibilita demonstrar que a família Virtex II da Xilinx consegue prover satisfatoriamente, em termos de espaço ainda disponível, sistemas de caches reais. Considerando um SoC composto de processador e hierarquia de memória, e de posse da área ocupada pelo processador, é possível através da abordagem proposta, escolher o dispositivo da família Virtex II adequado para determinada aplicação. Outras relações podem ser obtidas considerando outras famílias de FPGAs que suportem reconfiguração parcial.

8. REFERÊNCIAS

[1] MARTINS, C.A.P.S., ORDONEZ, E.D.M. e CARVALHO, M.B. Computação Reconfigurável: Conceitos, Tendências e Aplicações, Disponível em: http://ftp.inf.pucpcaldas.br/CDs/ SBC2003/pdf/arq0251.pdf ; Access: 25/11/2009.

[2] STALLINGS, W. Arquitetura e Organização de Computadores: Projeto para o Desempenho. 5. ed. Trad. de Carlos Camarão de Figueiredo e rev. de Edson Toshimi Midorikawa. São Paulo: Prentice Hall, 2002.

[3] MAMIDIPAKA, M. e DUTT, N.. eCACTI: An Enhanced Power Estimation Model for On-chip Caches. Disponível em: <http://www.cecs.uci.edu/technical_report/TR04-28.pdf>. Acesso em: 15 de Fevereiro de 2005.

[4] Xilinx: The Programmable Logic Company, Disponível em: <http://www.xilinx.com/ ise/logic_design_prod /foundation.htm>. Acesso em: 12 de Março de 2005.

[5] Altera FPGA: Disponível em: www.altera.com

80

A NOTE ON MODELING PULSED SEQUENTIAL CIRCUITS WITH VHDL

Alberto C. Mesquita Júnior*

Departamento de Eletrônica e Sistemas, Universidade Federal de Pernambuco R. Acad. Hélio Ramos, s/n, 50740-530 Recife-PE-Brazil

e-mail: [email protected]

ABSTRACT

In this paper is discussed how to use VHDL to describe pulsed sequential circuits. It is emphasized the difficulty in elaborating such descriptions once it seems that VHDL is not well provided with attributes or resources for detecting the occurrence of pulses. Examples are presented.

1. INTRODUCTION

In a level synchronous sequential circuit, the state changes on the rising or falling edge transition of the clock signal, and the next state depends on the logic levels present at the inputs and on the past logic levels of inputs and outputs. In a pulsed synchronous sequential circuit, the state changes only after the occurrence of a pulse (positive or negative) at one of the inputs and the next state depends on the past combinations of such pulses. This paper is divided in three sections, this introduction is the first. Next, the hardware model is explained, examples are synthetized and discussed. Finally, the conclusions.

2. THE HARDWARE MODEL

The hardware model of a pulsed synchronous sequential circuit is shown in figure 1. A pulse signal is characterized by the successive occurrence of two opposite transitions. If a signal is at rest in the low logic level, a positive pulse occurs after a rising transition followed by a falling transition on this signal. The schematic and the VHDL description of an S-C pulsed master-slave flip-flop are shown, respectively, in figures 2 and 3 as an example of a pulsed memory cell. Note that is forbidden simultaneous or overriding pulses on S and C. Now, with the behavioral VHDL description of this type of memory cell, one is able to elaborate structural VHDL descriptions of general pulsed sequential circuits. For the purpose of this paper, only behavioral descriptions are developed and one has to develop models for the next logic state for this kind of sequential machines. The VHDL codes shown were compiled and simulated using the Quartus II 9.0sp1 Web

Edition with the standard options. For all models and examples, three devices were used for the evaluation. I mean, each VDHL description of the hardware models and examples present in this paper was compiled and simulated for three cases: the first case used the EPM7032SLC44-5 Max 7000S device; the second, the EP1S10F484C5 Stratix and the third, the EP2S15F484C3 Stratix II.

Fig.2. Schematic of the S-C Pulsed Master-Slave Flip-Flop.

Q

Q

mQ

mQ

master latch slave latch

C

S

combinational logic

Mealy

ZMealy m1

pulses

combinational logic

Moore

ZMoore m2

m3

memory cell (pulsed)

X

Q s

n

Fig.1. Pulsed synchronous sequential circuit model.

*The author would like to express his gratitude to Dr. Edval J. P. Santos for reading this paper and making suggestions for improvements.

combinational logic

excitations

81

2.1. A simple model for the next state logic

In figure 4, a simple model for the next logic state is presented. It is supposed that timing restrictions are satisfied. As an example, the description of a pulsed binary up/down counter is analyzed. This counter has two pulsed input lines, Xup e Xdw. When a pulse occurs at Xup, the counter’s content is incremented and if a pulse occurs on Xdw, it is decremented. Only the third case has a good result produced by the simulation. The VHDL code and the simulation timing diagram follow in figures 5, 6, 7 and 8 respectively.

2.2. A second model

This second model works in cases where the first model in subsection 2.1 has failed to work. This new hardware model is presented in figure 9. The same example was simulated using this model. The new VHDL code is almost the same as the last one, except the three lines that begin in the line labeled “next_state_logic”. These lines were rewritten as:

next_state_logic: dcount<=sumup when sel="10" else sumdw when sel="01" else null;

The new VHDL code was compiled and simulated considering the three cases as above and all simulations presented good results. The simulation timing diagrams follows bellow only for the two cases that have not worked in subsection 2.1. See figures 10 and 11.

-- S-C Pulsed Master-Slave Flip-Flop ENTITY scpmsff IS PORT (S, C : IN BIT; Q,NQ: OUT BIT); END scpmsff; ARCHITECTURE behaviour OF scpmsff IS signal qm, qmb :BIT; BEGIN masterslave: PROCESS (S, C) BEGIN IF (S ='1') THEN qm <='1'; qmb <='0'; ELSIF C='1' THEN qm <='0'; qmb <='1'; ELSE Q <=qm; NQ <=qmb; END IF; END PROCESS masterslave; END behaviour;

Fig.3. VHDL description of an S-C Master-Slave F-Flop.

Fig.4. A simple next-state hardware structure.

s

register

n

random logic

X

DQ

s

library ieee; use ieee.std_logic_1164.all; ENTITY dpulsedcounterupdw IS Generic (max: integer:=15);

PORT (Xup, Xdw: in std_logic; counting: out integer range 0 to max); END dpulsedcounterupdw; ARCHITECTURE behavior OF dpulsedcounterupdw ISsignal dcount, countsl:integer range 0 to max; signal sumup,sumdw :integer range 0 to max; signal sel: std_logic_vector (1 to 2); signal ck: std_logic;

BEGIN ck<=Xup OR Xdw; sel<=Xup&Xdw; counting <=countsl; sumup<=0 when countsl=max else countsl+1; sumdw <= max when countsl=0 else countsl-1;

next_state_logic: dcount <= sumup when sel="10" else sumdw when sel="01" else countsl; reg:PROCESS (ck) BEGIN IF falling_edge(ck) THEN countsl<=dcount; END IF; END PROCESS reg; END behavior;

Fig.5. Pulsed up/down counter VHDL code.

Fig.6. Simulation result: pulsed up/down counter Timing diagram – 1st case: EPM7032SLC44-5

Max7000S.

82

2.3. A third example

This third example is more general. A pulsed sequential machine that has three pulsed input lines, Reset, X1 and X2 and two lines as Mealy output, Z1 and Z2. The output Z1 will be equal to X1 (Z1=X1) always an odd number of pulses on X2 occurs between two consecutives pulses of X1. The output Z2 will be equal to X2 (Z2=X2) always an even number of pulses X1 occurs between two consecutives pulses of X2. Overlapping sequences are considered. When a pulse occurs on the Reset line the initial state is granted. The VHDL description is shown in figure 12. The simulations results are present in figures 13, 14 and 15. In the second case (EP1S10F484C5 Stratix), although some messages about violation of time restriction as “not operational: Clock Skew > Data Delay” and “Warning: Circuit may not operate. Detected 1 non-operational path(s) clocked by clock "X1" with clock skew larger than data delay”, the simulation presents good result. In the third case (EP2S15F484C3 Stratix II), the same messages are displayed by Quartus II and, as can be seen in figure 15, a spurious pulse is produced. As an alternative way to avoid spurious pulses and a best synthesis approach is to rewrite the next state logic using concurrent commands. The new VHDL code is equal the last code, except the description of the next state logic. This VHDL code produces a good synthesis without the timing violation messages for all cases. Figure 16 presents this part of the new code and the figure 17, the simulation results for the third case (EP2S15F484C3 StratixII).

3. CONCLUSION

The synthesis of pulsed sequential circuit and the classical clocked sequential circuit are similar but the VHDL and synthesis tools do not have resources to generate the hardware with pulsed master-slave flip-flops. The designer has to create hardware artifices in the context of the available tools. In the models given in this paper, the input signals are used, at the same time, to calculate the next state and its storage in the memory cells, so the success of operations of the synthesized circuits are strongly dependent on the propagation time associated with the next state logic and the setup time and the hold time of the memories. Since the Quartus II does not allow the control of the time characteristics of the devices then, as shows in this paper, the only possibility is to carefully select devices and VHDL descriptions style.

Fig.8. Simulation result: pulsed up/down counter Timing diagram – 3rd case: EP2S15F484C3 StratixII.

Fig.7. Simulation result: pulsed up/down counter Timing diagram – 2nd case: EP1S10F484C5 Stratix.

Fig.9. Next state hardware structure with a latch.

ns

master latch

DQ

En

slave register

DQ

s

randomlogic s

X

Fig.10. Simulation result Timing diagram – first case with a master latch

EPM7032SLC44-5 Max7000S.

Fig.11. Simulation result Timing diagram – second case with a master latch

EP1S10F484C5 Stratix.

83

4. REFERENCES

[1] Fredrick J. Hill and Gerard R. Peterson, “Introduction to Switching Theory and Logical Design”, Chapter 11: Pulse – Mode Circuits, Wiley 1974.

[2] Victor P. Nelson, H. Troy Nagle, Bill D. Carroll and J. David Irwin, “Digital Logic Circuit Analysis & Design”, Chapter 10: Asynchronous sequential Circuits, Prentice Hall 1995.

[3] Roberto d’Amore, “VHDL: Descrição e Síntese de Circuitos Digitais”, LTC, 2005.

[4] Volnei A. Pedroni, “Circuit Design with VHDL”, MIT Press 2004.

[5] Altera Publishing, “Quartus® II Introduction for VHDL Users”, PDF Tutorial – Quartus II 9.0 web edition software 2007.

Library ieee; use ieee.std_logic_1164.all; Entity dpulse_machinewcntZ1Z2 is Generic (inp:integer:= 3;estX1:integer:=3; estX2: integer:=2); Port (X2,X1,Reset: in std_logic; Z2,Z1:out std_logic); end entity dpulse_machinewcntZ1Z2; Architecture func of dpulse_machinewcntZ1Z2 is signal stX1m, stX1e: integer range 0 to estX1; signal stX2m, stX2e: integer range 0 to estX2; signal ck:std_logic; Begin ck<= X2 or X1 or Reset; Z2<= X2 when stX1e=3 else '0'; -- counting X1 Z1<= X1 when stX2e=2 else '0'; -- counting X2 countingX1: Process (Reset, X2, X1) Begin If Reset='1' then stX1m<=0; elsif X2='1' then stX1m<=1; elsif X1='1' then case stX1e is when 0 => stX1m<=0; when 1 => stX1m<=2; when 2 => stX1m<=3; when 3 => stX1m<=2; end case; end if; end process countingX1; countingX2: Process (Reset,X2,X1) Begin If Reset='1' then stX2m<=0; elsif X1='1' then stX2m<=1; elsif X2='1' then case stX2e is when 0 => stX2m<=0; when 1 => stX2m<=2; when 2 => stX2m<=1; end case; end if; end process countingX2; update: Process (ck) -- registers Begin If falling_edge(ck) then stX2e<=stX2m; stX1e<=stX1m; end if; end process update; end func;

Fig.12. VHDL description.

Fig.13. Simulation result - Timing diagram – 1st case EPM7032SLC44-5 Max7000S.

Fig.14. Simulation result - Timing diagram – 2nd caseEP1S10F484C5 Stratix.

Fig.15. Simulation result - Timing diagram – 3rd caseEP2S15F484C3 StratixII.

-- next state logic stX1m <= 0 when reset='1' else 1 when X2='1'else 0 when (X1='1' and stX1e=0) else 2 when (X1='1' and stX1e=1) else 3 when (X1='1' and stX1e=2) else 2 when (X1='1' and stX1e=3) else null;

stX2m <= 0 when reset='1' else 1 when X1='1'else 0 when (X2='1' and stX2e=0) else 2 when (X2='1' and stX2e=1) else 1 when (X2='1' and stX2e=2) else null;

Fig.16. New code for the next-state logic.

Fig.17. Timing diagram – 3rd case next-state logic with concurrent commands

EP2S15F484C3 StratixII.

84

COMPARATIVE STUDY BETWEEN THE IMPLEMENTATIONS OF DIGITAL WAVEFORMS FREE OF THIRD HARMONIC ON FPGA AND MICROCONTROLLER

Diogo R. R. Freitas, Member IEEE, Edval J. P. Santos, Senior Member IEEE

Laboratório de Dispositivos e Nanoestruturas, Departamento de Eletrônica e Sistemas Universidade Federal de Pernambuco

Av. Acadêmico Hélio Ramos, s/n, Cidade Universitária – Recife – PE – Brasil 50.740-530 email: [email protected], [email protected]

ABSTRACT

Third harmonic measurements are used to determine the linearity level of passives, such as: resistors, capacitors, and inductors, as recommended by IEC/TR 60440 standard. Signal generators with very low third harmonic content have to be developed for such application. Although, a high purity analog sine wave generator is the natural option, it has been demonstrated that one can generate digital waveforms free of third harmonic. This paper presents a comparison between free of third harmonic digital waveform generators built using FPGA - Field Programmable Gate Array, and microcontroller.

1. INTRODUCTION

During the fabrication process of passives components, such as: resistors, capacitors, and inductors, it is required to assess the linearity of the fabricated component to determine whether it has passed the quality test, as recommended by IEC / TR 60440. The CLT10/CLT20 by Danbridge A/S is an instrument for linearity testing. This instrument generates a pure sine waveform of 10 kHz and measures the third harmonic level at 30 kHz [1]. In a linear resistor the relationship between voltage and current is constant and its value equal to the resistance. For a nonlinear resistor the relationship between voltage and current is a nonlinear function )(vfi = . The transfer function shown in Fig. 1 can be defined as presented in Equation (1) [2].

(1)

where OV and IV are the DC components and ov and iv the AC components of input and output voltages. Expanding the output voltage in power series, one obtains Equation (2).

(2)

Fig. 1. (a) Reference points for plotting the transfer

function of resistors. (b) Example of nonlinear transfer function.

Using that OVa =0 . This equation can be simplified as in Equation (3).

(3)

The term 1c in Equation 3 is the linear gain. If the components nc for n > 2 are not zero the circuit generates harmonics at the output voltage. In real circuits these components are rarely zero. Assuming that the input voltage is a cosine,

)cos( 1tVvi ω= , ov is an even function of time. Therefore the Fourier series sine coefficients are all null.

(4)

�+++= 33

221 iiio vcvcvcv

�++++=+ 33

2210 iiioO vcvcvccvV

)( iIoO vVfvV +=+

�++= 21211 )cos()cos( tVctVcvo ωω

85

Fig. 2. Block diagram of the waveform generator circuit.

The cosine coefficients na of the Fourier series are related to the terms nc of power series, as follows:

(5)

(6)

(7)

(8)

Considering Fig. 1, if two reference points are marked at the terminals of a generic resistor, say point 1 and point 2 (see Fig. 1(a)), one can plot the transfer curve for this resistor. When current flows through the resistor from point 1 to point 2, the current is chosen positive and terminal 1 voltage is higher than terminal 2 voltage. When current flows in the opposite sense, the voltage reverses its polarity. One can see from this analysis that the voltage versus current on a resistor should be an odd function

)()( iviv −−= . Real resistors are nonlinear, as shown in Fig. 1(b). One can inject a voltage and measure the current. If a free of third harmonic periodic current waveform is injected odd harmonics arise, due to non-linearity of resistance. While the pure sine wave has only the fundamental frequency with higher order harmonics equal to zero, there exist special waveforms free of third harmonic, but not necessarily free of higher order harmonics [3], [4]. The objective is to generate the proposed waveforms, called Type I and Type II, using digital circuits, and measure their frequency spectrum to compare with the theoretical results. As calculated by Santos and Barybin [4], it is expected that the Type I waveform has 2.23% third harmonic level, when slew rate is 10 V/ºs. For the Type II waveform is expected third harmonic content equal to zero in all cases evaluated. This paper is divided in five sections, this introduction is the first. Next, materials and methods are discussed. Third, the generated waveforms are evaluated. Fourth, the discussion, and finally the conclusions.

Fig. 3. (above) Type I waveform generated by FPGA.

(below) Frequency spectrum.

2. MATERIALS AND METHODS

To build the waveform generator, two different programmable devices are used: FPGA and microcontroller. The first generator was based on FPGA. In Fig. 2, the block diagram of the proposed circuit is presented. The waveform generator is described in VHDL. For implementation of this first generator, the development board “UP2 Education Kit” from Altera [5] was used. This board has a MAX7000 FPGA model EPM7128S. The software used for circuit description in VHDL was Quartus II 9.0 from Altera. Modelsim from Xilinx was used to simulate the VHDL code. The harmonic analysis using Fast Fourier Transform (FFT) was performed with the Agilent oscilloscope model DSO3062A. The VHDL code has the function of generating the digital words which are responsible for producing the desired waveform. These words are sent to a digital to analog converter (DAC) external to the FPGA, via SPI. The generator used a 12-bit DAC with reference voltage of 5 volts. With 4096212 = voltage levels, that is, mVV 22.140965 = of resolution. The second generator is built with a microcontroller. The selected microcontroller has an integrated communication circuit SPI. The microcontroller code was written in the C language to generate the words and manipulate the SPI interface, and send them to the external DAC for conversion. For this second generator, the tool “eZ430-F2013” from Texas Instruments [6] was used. This board uses the microcontroller MSP430 model MSP430F2013. The C code was generated and compiled using the software IAR

�

�

�

�

++=

++=

++=

++=

55333

44222

3311

44220

165

4

22

43

83

2

VcVca

VcVca

VcVca

VcVca

86

Fig. 4. (above) Type II waveform generated by FPGA.

(below) Frequency spectrum.

Embedded Workbench Kickstart for MSP430 from IAR Systems. In addition to development boards, a 12-bit DAC from Microchip model MCP4921 [7] was also used. This converter has an integrated SPI interface communication. The block diagram for this is illustrated in Fig. 2. The SPI block gets the words and sends them to the shift register. The register routes the words to the DAC for conversion.

3. EVALUATION OF GENERATED WAVEFORMS

Waveforms Type I and Type II generated from the development board were analyzed by an oscilloscope with Fast Fourier Transform (FFT), for evaluation of harmonics. The results are presented next.

3.1. Evaluation of the Type I and Type II waveform generated by the FPGA

For this waveform the analysis using the FFT is based on the curve shown in Fig. 3. For analysis by FFT we have horizontal resolution of 10 kHz per division (10 kHz / div). In the vertical resolution we have 100 mV rms per division (100 mVrms / div). The analysis of the harmonics shows the absence of even harmonics, as expected, and the absence of the third harmonic. Other harmonics are presented, the fifth, seventh and ninth harmonics subtly. For Type II waveform, the FFT harmonic analysis is shown in Fig. 4. This second waveform displays almost identical harmonic content to that of Fig. 3. The third harmonic is not present.

Fig. 5. (above) Type I waveform generated by the

microcontroller. (below) Frequency spectrum.

3.2. Evaluation of the Type I and Type II waveform generated by the microcontroller

The waveform generated by the microcontroller is shown in Fig. 5, together with the analysis by FFT. One notes the great similarity to that of Fig. 3. The Type II waveform generated by the microcontroller is shown in Fig. 6, along with the analysis by FFT. Here one sees a difference with the Type II waveform generated by FPGA. The second, third and sixth harmonic are clearly presented. The fourth harmonic is absent.

4. DISCUSSION

After carefully analyzing the waveforms generated one observed that Type II generated by the microcontroller (Fig. 6) does not have symmetry, for positive and negative voltage peaks. The C code used defines exactly half of the reference voltage of 12-bit DAC (7FF). This lack of symmetry is also observed in the waveform Type I, Fig. 5, but not observed significant changes in harmonics when compared to Figs. 3 and 5. In the waveforms generated by the FPGA, this symmetry between the positive and negative voltage peaks is observed. This symmetry ensures that the spectrum has only the fifth and seventh harmonics. The Type II waveform was simulated taking into account the non-symmetry of the positive peak. Another simulation was performed with the Type II waveform symmetric. The results of spectral analysis using FFT are shown in Fig. 7. The software used was Matlab 7.

87

Fig. 6. (above) Type II waveform generated by the microcontroller. (below) Frequency spectrum.

The symmetric Type II waveform simulated presents the fifth and seventh harmonics. This result match with the measurement performed in Fig. 4. The non-symmetric Type II waveform simulated features even harmonic (second and fourth). The measurement performed in Fig. 6 shows subtly the third harmonic, which is not present in the simulated wave in Fig. 7. This difference is possibly caused by noise in the positive peak of the wave observed in Fig.6. For the real Type II waveform, variation in the frequency spectrum was not observed when a 30° lag is included in the traditional waveform (Fig. 2). This delay was simulated in Matlab and no variation was observed in the frequency spectrum.

5. CONCLUSION

As expected by the analysis by Santos and Barybin, the real Type II waveform did not present the third harmonic content. These measurements confirm the theoretical results obtained. So this waveform can be used to assess non-linearity of passive components. The proposal to generate a signal free of third harmonic was more successful in FPGA. Symmetry defects were observed in the signal generated by the microcontroller used, compromising the final result. The next step is to apply the waveforms generated in real resistors and evaluate its linearity using spectral analysis.

Fig. 7. Spectral analysis performed on Matlab 7 for non-

symmetric Type II waveform (above) and symmetric Type II waveform (below).

6. REFERENCES

[1] Danbridge A/S, CLT 10 Component Linearity Test Equipment Application Note. 2002. <http:// danbridge.dk.web13.123test.dk/Files/filelement30.pdf>

[2] D. Pederson, K. Mayaram, Analog Integrated Circuits for Communication. New York: Springer, 2008.

[3] P. Corey, “Methods for optimizing the waveform of stepped-wave static inverters,” AIEE Summer General Meeting, Jun. 1962.

[4] E. Santos, A. Barybin, “Stepped-waveform synthesis for reducing third harmonic content,” IEEE Transactions on instrumentation and measurement, vol. 54, no. 3, pp. 1296-1302, Jun. 2005.

[5] Altera Corporation, University Program UP2 Education Kit User Guide. Dec. 2004. <http://www.altera.com/ literature/univ/upds.pdf>

[6] Texas Instruments, eZ430-F2013 Development Tool User’s Guide. 2006. <http://focus.ti.com/lit/ug/slau176b/ slau176b.pdf>

[7] Microchip Tecnology Inc., 12-Bit DAC with SPI Interface. 2004. <http://ww1.microchip.com/downloads/en/ DeviceDoc/21897a.pdf

88

AUTHORS INDEX

Araújo, A. H. C. B. S. ..................................................................................................75 Barreto, R. S. ..............................................................................................................29 Belmonte, J. ................................................................................................................7 Boemo, E. ..............................................................................................................19 Borensztjen, P. ............................................................................................13, 57 Caraciolo, M. P. ..................................................................................................69 Caruso, D. M. ................................................................................................................1 Carvalho, G. R. ..................................................................................................75 Cayssials, R. ............................................................................................................. 43 Cordeiro, F. R. ................................................................................................. 69 Corti, R. ............................................................................................................... 7 Crepaldo, D. A. ................................................................................................. 25 D’Agostino, E. ................................................................................................... 7 De Farias, T. M. T. ..................................................................................................... 35 De Lima, J. A. G. ................................................................................................. 35 De Lima, M. E. ........................................................................................................... 75 De Maria, E. A. A. ................................................................................................. 39 Del Rios, J. ............................................................................................................. 63 Dias, W. R. A. ............................................................................................................. 29 Ferreira, L. P. ............................................................................................................. 69 Ferro, E. ............................................................................................................. 43 Freitas, D. R. R. ................................................................................................. 85 Gazineu, G. M. ................................................................................................. 75 Giandoménico, E. .................................................................................................. 7 Luppe, M. ............................................................................................................ 47 Maidana, C. E............................................................................................................. 39 Martin, R. L. ............................................................................................................ 25 Martínez, R. .............................................................................................................. 7 Mesquita Júnior, A. C. .....................................................................................81 Moreno, E. D. ............................................................................................................ 29 Mosquera, J. ............................................................................................................ 13 Oliveira, D. L. ............................................................................................................ 63 Ortega-Ruiz, J. ........................................................................................................... 19 Pedre, S. .......................................................................................................13, 57 Prado, R. G. ............................................................................................................ 75 Romano, L. ............................................................................................................ 63 Sacco, M. ............................................................................................................ 13 Schiavon, M. I. .......................................................................................................... 25 Santos, E. J. P ............................................................................................................ 85 Silva-Filho, A. G. ......................................................................................... 69, 75 Soares, D. ............................................................................................................ 53 Stoliar, A. .......................................................................................................13, 57 Szklanny, F. I. ............................................................................................................ 39 Torquato, L. ............................................................................................................ 53 Tropea, S. E. ...............................................................................................................1 Urriza, J. ............................................................................................................ 43 Viana, P. ............................................................................................................ 53

ISBN:

Documents

DF2010ProceedingsBody Revisado Cca2.PDF A4 v9