15
STIGLitz: Efficiently Generating Video with an 8 bit MCU using Software Thread Integration  Alexander G. Dean, Benjamin Welch, Shobhit Kanaujia Center for Efficient, Scalable and Reliable Computing Dept. of Electrical and Computer Engineering North Carolina State University Raleigh, NC 27695-7256  [email protected]  1. INTRODUCTION 1.1 Background Generating video signals for CRTs or LCDs in software on simple processors with minimal hardware support has been a popul ar proj ect amon g micr ocon tro ller enth usiasts. Pion eers in the 1970s inc lude Don L ancaster [18] and th e Timex/Sinclair 1000 (Sinclair ZX81). More recently Ricard Gunee [17], Robert Lacoste [15] [16], Bruce Land [20], Alberto Riccibitti [22], Eric Smith [23] and David Thomas [21] have demonstrated systems. There are various useful websites with surveys and tutorials [17][19] [24]. The recent systems use PIC, Ubicom (formerly Scenix) and AVR processors due to their predictable performance and sufficiently fast clock rates. These developers performed manual timing analysis of code and scheduled video operations with nop (“no operation”) instructions to fill in idle time within a video line. These nops ac count for a larg e fraction of th e CPU’ s time (59% for STI GLitz), wasting proces sing capacity and leaving little time for other code to execute (0.5 MIPS for Land [20]). Some people have manually recovered portions of this idle time for other work by manually moving instructions in an ad hoc (or “add hack”) manner. This reclamation effort is not limited to display applications, but shows up in other application s which need fin e-gr ain concurrency , such as comm un ications s ystems like modems [ 25] and network in terfac es. These in divi duals usuall y descri be th eir work wi th discourag in g term s such as “h ug e eff ort” [16] and “difficult to write and debug” [25]. Our approach brings a method to this madness by using modern compiler technology to create code which executes useful system work in place of these nops. In order to use a single processor effectively, it must be shared, for example with a scheduler, a real-time operating system, and/or interrupt service routines. Most processors are poor at simulating concurrency, as each context switch requires time- or event-based triggering as well as the actual switching. This overhead time rises as switches become more frequent, reducing processor MIPS available for application code. W e hav e develo ped methods to mer ge fun ctio ns from mult iple th reads in to on e fun ction , gi vin g multi th readed con curr ency on a simpl e uni pro cessor . This softwar e threa d integ ration (STI ) let s us use cheap, common microcontrollers instead of faster, more expensive ones. We have been building these methods into our optimizing compiler back-end (called Thrint) to help squeeze performance out of generic processors for applications with fine-grain concurrency such as controllers for video refresh and embedded networks. This article describes our STIGLitz project, in which an Atmel Atmega 128 8-bit microcontroller running at 20 MHz generates a monochrome NTSC video signal with STI and simple hardware. It provides a 256 by 240 pixel frame-buffer-based display with two bits per pixel and rendering of lines, circles, sprites and text. It reclaims the time between pixel-output operations for graphics primitive rendering and serial communication use, increasing the system’s graphics performance by a factor of 4x to 13x. STIGL it z: Ef f icien tl y Gen erat in g V ide o wit h a n 8 b it MCU using So ft ware... ht tp://www.ce sr .ncsu.edu/ ag dea n/st igl itz /ST IG Litz _Ext en ded .h tm 1 of 15 21-07-2011 19:11

STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

Embed Size (px)

Citation preview

Page 1: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 1/15

STIGLitz: Efficiently Generating Video with an 8 bit MCU usingSoftware Thread Integration

Alexander G. Dean, Benjamin Welch, Shobhit Kanaujia

Center for Efficient, Scalable and Reliable Computing

Dept. of Electrical and Computer Engineering

North Carolina State University

Raleigh, NC 27695-7256

[email protected]

1. INTRODUCTION1.1 BackgroundGenerating video signals for CRTs or LCDs in software on simple processors with minimal hardware support has been apopular project among microcontroller enthusiasts. Pioneers in the 1970s include Don Lancaster [18] and theTimex/Sinclair 1000 (Sinclair ZX81). More recently Ricard Gunee [17], Robert Lacoste [15] [16], Bruce Land [20],Alberto Riccibitti [22], Eric Smith [23] and David Thomas [21] have demonstrated systems. There are various usefulwebsites with surveys and tutorials [17][19] [24]. The recent systems use PIC, Ubicom (formerly Scenix) and AVRprocessors due to their predictable performance and sufficiently fast clock rates. These developers performed manualtiming analysis of code and scheduled video operations with nop (“no operation”) instructions to fill in idle time within avideo line. These nops account for a large fraction of the CPU’s time (59% for STIGLitz), wasting processing capacity

and leaving little time for other code to execute (0.5 MIPS for Land [20]).Some people have manually recovered portions of this idle time for other work by manually moving instructions in anad hoc (or “add hack”) manner. This reclamation effort is not limited to display applications, but shows up in otherapplications which need fine-grain concurrency, such as communications systems like modems [25] and network interfaces. These individuals usually describe their work with discouraging terms such as “huge effort” [16] and“difficult to write and debug” [25]. Our approach brings a method to this madness by using modern compiler technologyto create code which executes useful system work in place of these nops.

In order to use a single processor effectively, it must be shared, for example with a scheduler, a real-time operatingsystem, and/or interrupt service routines. Most processors are poor at simulating concurrency, as each context switchrequires time- or event-based triggering as well as the actual switching. This overhead time rises as switches becomemore frequent, reducing processor MIPS available for application code.

We have developed methods to merge functions from multiple threads into one function, giving multithreadedconcurrency on a simple uniprocessor. This software thread integration (STI) lets us use cheap, commonmicrocontrollers instead of faster, more expensive ones. We have been building these methods into our optimizingcompiler back-end (called Thrint) to help squeeze performance out of generic processors for applications with fine-grainconcurrency such as controllers for video refresh and embedded networks. This article describes our STIGLitz project,in which an Atmel Atmega 128 8-bit microcontroller running at 20 MHz generates a monochrome NTSC video signalwith STI and simple hardware. It provides a 256 by 240 pixel frame-buffer-based display with two bits per pixel andrendering of lines, circles, sprites and text. It reclaims the time between pixel-output operations for graphics primitiverendering and serial communication use, increasing the system’s graphics performance by a factor of 4x to 13x.

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

5 21-07-20

Page 2: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 2/15

Figure 1. Processor utilization for video signal generation by STIGLitz. Without integration, 12 MIPS of idle timeis trapped in fragments only 9 cycles long and is unusable for other processing. Software thread integration reclaims that idle

time to improve rendering performance and service high-spee d serial communication.

Figure 1 shows the processor utilization for our system under various conditions. The leftmost bar (No Integration)shows that before applying STI, the MCU spends most of its time refreshing the display or executing the nops betweenvideo output instructions. This leaves only 0.36 MIPS for foreground processing. The four bars to the right demonstrateprocessor utilization when rendering various types of lines. STI reclaims large amounts of idle time, providing 1.3 to 4.5MIPS of line rendering and 2.1 MIPS of serial communication processing instead. Some time is wasted in the dispatcheror context switching, while some is lost because STI integration is not completely efficient when dealing withunpredictable loops. This article describes how to integrate code using STI, describes how the STIGLitz videogeneration platform works, and how to use it.

1.2 NTSC Video

Monochrome NTSC (RS170) video generation requires the generation of periodic synchronization (sync) pulses.Generating these sync pulses in software is a real-time problem; a late sync pulse deteriorates the picture quality. In thissection we briefly review the NTSC monochrome signal to provide the reader with a good understanding of the real-timerequirements in video generation.

An NTSC frame contains 525 scan lines and is composed of interlaced raster scan lines. Two sets of raster scans(called fields) are drawn per frame; we refer to these fields as even and odd fields. The fields are made up of 262.5lines and consist of alternate lines on the screen (hence the term interlaced). Each scan line lasts for 63.5 microseconds,of which only 52.6 microseconds is the visible section and the remaining time is occupied by synchronization.

There are two types of synchronization signals, vertical and horizontal. These reset the scan to the beginning of the nextfield. As shown in Figure 2, vertical synchronization sequence is composed of three pre-equalization pulses, threevertical synchronization pulses and three post-equalization pulses. The vertical synchronization pulses occur at the field

rate of 60 Hz.

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

5 21-07-20

Page 3: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 3/15

Figure 1. Vertical Synchronization Pulses in an NTSC monochrome signal.

The horizontal sync signals reset the scan to the next line in the field and occur at the line rate of 15.75 kHz. Figure 3shows an entire scan line of video, including the three parts of the Horizontal synchronization (front and back porch andthe horizontal sync) and their respective durations.

Figure 2. Horizontal Synchronization Pulses in an NTSC monochrome signal.

1. STI OVERVIEW1.1 Software Thread Integration

Figure 3. Overview of hardware to software migration with STI. Idle time is statically filled at compile time withother useful work from the system.

STI works by merging two functions into one implicitly multithreaded function, as shown in Figure 4 [10][4][5][6][7][8]. A software implementation of hardware typically has some idle time between time-critical groups of instructions. When used for real-time software, STI enables the placement of time-critical instructions from one functionso they execute at a given time relative to the beginning of the integrated function, regardless of the control or data flowcharacteristics of either thread. The function with internal idle time is called the primary or guest function, while thefunction with which it is integrated is called the secondary or host function.

We place various restrictions on the functions to be integrated in order to simplify integration and minimize theadditional processing needed. No subroutine calls are allowed in either the primary or secondary functions. The primaryfunction can have no local variables or arguments, as these would require merging the stack frames of both functions

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

5 21-07-20

Page 4: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 4/15

and modifying secondary function call sites. Interrupts within the system are disabled for the duration of the integratedthreads, leading to increased latency. We recently have developed methods [13] to support interrupts through pollingand cut latency to acceptable levels. In general, loops must have a known number of iterations. However, in manyintegration cases this restriction is lifted; this is described in more detail below. The integrated line-drawing functions allcontain loops with unknown iteration counts.

Several steps are required before thread integration can begin. The first involves structuring the program so that work can be accumulated for an integrated function to perform later. In STIGLitz we use various queues to accumulategraphics primitive rendering work for later processing by secondary functions. The functions to be integrated will

dequeue an item of work and process it; one example in STIGLitz is STIGLitz_Service_DrawDiagonalLine.The second step is to write the functions which will be integrated. We used C for the graphics rendering functions (aswell as most of the other code), and assembly language for the video refresh function. The functions should bedebugged, optimized and tested as much as practical at this point. Functions which will be integrated must share theregister file, so we partition the register file to restrict each function to using particular registers (see Table 1). Thissimplifies later steps in integration. There are more efficient methods for allocating registers, but we choose partitioningfor practicality. C code which must be integrated is compiled to assembly code using a convenient command-line switchfor GCC (e.g. –ffixed-r9) which prevents the use of specific registers.

Table 1. Register Use for Integrated Functions

Function Registers Used

Pointer Immediate Other

PumpPixel r26-27 (X) r19

UART_Tx, UART_Rx r30-31 (Z) r20, r24-25

Non-sprite graphics rendering functions(partial context switch)

r28-31 (Y,Z) r16-18, r20-25 r0-7

Sprite graphics rendering functions (fullcontext switch)

r28-31 (Y,Z) r16-18, r20-25 r0-15

The third step is to perform static timing analysis on the primary and secondary functions to be integrated. Thisidentifies the best-case and worst-case start times for each instruction and region of code. This is a tedious and

error-prone chore, which is why we use our optimizing compiler back-end Thrint to perform the analysis on theassembly code (e.g. thrint –Gc –hp –hC file.s). This generates a graph description file which can be processed by theprograms aiSee (Windows) or XVCG (Unix) [1] to create a diagram which includes instructions and cycle counts, asshown in Figure 5 and Figure 6.

Figure 4. Thrint automatically forms a control dependence graph of STIGLitz_Service_DrawDiagonalLine,performs timing analysis and then pads away timing variations in conditionals with nop instructions (indicated with greynodes). Thrint creates a graph description file which is rendered by other tools (aiSee or XVCG) to create this image.

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

5 21-07-20

Page 5: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 5/15

Figure 5. Detail of CDG shows best- and worst-case starting times for each instruction and region as well as codesizes.

Thrint uses a tree-based form called a control dependence graph (CDG) to represent the control-flow of code, ratherthan the traditional control-flow graph. This hierarchical form makes the code much easier to comprehend, analyze andmodify for STI. Control dependence regions such as conditionals and loops are represented as non-leaf nodes, andassembly language instructions are stored in leaf nodes. Conditional nesting is represented vertically while executionorder is horizontal (left to right). The type of edge connecting a node to its parent determines the condition under whichit will be executed (always, when the condition is true, or when it is false). Program regions such as loops andconditionals as well as single basic blocks are moved efficiently in a coarse-grain fashion, yet instructions can bescheduled on a fine-grain basis (within nodes) as needed.

The fourth preparatory step is to pad uneven duration conditionals in both the primary and secondary functions with nopinstructions so the program takes the same amount of time regardless of control flow path. Blocks of nops or loops canbe used for padding. Thrint can perform this automatically as well (e.g. thrint –Gc –hp –hC –i file.s file_pad.id) but nowrequires an integration directives file (e.g. file_pad.id, Listing 1). Thrint creates an output file file.int.s with the paddedfunction in assembly language.

TOLERANCE 0 CY

PROCEDURE STIGLitz_Service_DrawDiagonalLine DISCRETE_PADDED

END

Listing 1. Integration directives file for padding a function.

The fifth step in preparation is to evaluate the duration of the primary and secondary functions after padding, considerthe idle time available in the primary function, periods of tasks, allowable interrupt latencies, and then determine anapproach for integration. In some cases a primary function may need to run so often that there is not enough time toexecute all of the secondary function before the next instance of the primary. In this case we partition the secondaryfunction into segments slightly shorter than the available time and integrate the primary function into each segment.Figure 13 shows various long functions broken into segments before integration together in STIGLitz. On the first call,the dispatcher (Figure 12) invokes the function. At the end of each segment but the last a context switch with acoroutine call saves the function’s registers. As time becomes available for executing the integrated code, the functionresumes execution with another context switch and coroutine call. In this manner the integrated function will executesegment by segment. Short-latency tasks such as ISRs are handled with polling servers. These polling servers areintegrated one or more times into the primary function. More details on the segmentation and polling are available [13].

CLOCK_FREQUENCY 20 MHz

PROCEDURE main INTO main ENDS

TOLERANCE 50 ns

BLOCK EnableCounters AT 0.8 us

TOLERANCE 0 CY

LOOP PixelLoop PERIOD 800 ns ITERATION_COUNT 64

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

5 21-07-20

Page 6: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 6/15

BLOCK PixelOut INTO_LOOP PixelLoop FIRST_AT 1.2 us

END

Listing 2. Integration directives file with target t imes for primary function.

We next define target time ranges (earliest and latest) for each region of primary code to begin execution. In the case of the video generation code, the earliest and latest times for a given region are equal, as any timing variation will lead tothe shifting of pixels on the display and degrade image quality. An example of an integration directives file with timingtargets appears in Listing 2.

At this stage we finally perform actual integration. First we present the basic integration methods which handle primaryregions not contained in loops (called single-event code). Then we introduce methods for handling primary regionswithin loops (called looping-event code). These techniques allow integration of arbitrary code.

STI consists of moving regions of code (which could be single instructions) from the primary thread into the secondarythread to execute at the correct times, as presented in Figure 7a. Using the time-annotated CDG of the secondary codeas a map, we recursively identify the proper location to place each primary region based upon its target time range. If this range overlaps with a gap between secondary nodes, we can place the primary region between those nodes.Otherwise the target time range is contained completely within a region, and the secondary code must be modified toopen up a gap which overlaps with the target time range. If the secondary region is a code node, it can simply be split atthe appropriate time. If the region is a conditional, the search for the proper gap is repeated on both the true and falseconditional cases.

a) General transformations for integrating code b) Transformations for integrating loops together

Figure 6. Code transformations for software thread integration allow functions with different control flows to beintegrated.

If the region is a loop, integration is handled as follows, shown in Figure 7b. A loop with a known number of iterationscan either be split and peeled or left intact. Consider the case of a primary region which needs to begin execution in themiddle of iteration 5 of a secondary loop. Splitting and peeling the loop involve duplication; the first copy of the loop

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

5 21-07-20

Page 7: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 7/15

executes the first four iterations, the fifth iteration is executed by a peeled-off copy of the loop body, and the remainingiterations are executed by the second copy of the loop. The search for the proper gap is now repeated on the peelediteration of the secondary loop. The alternative to splitting and peeling a loop is to insert the primary region under aguard region (a conditional branch) which only executes the primary region on a specific iteration. Inserting this guardregion requires search for a gap slightly before the target time range. The timing analysis for this approach is moreinvolved, as inserting the guard code disrupts the previous timing analysis. For STIGLitz we use loop splitting andpeeling.

If the primary code contains events within loops, the following approach is used. First, the execution schedules of the

primary and secondary functions are compared, accounting for the fact that the secondary function will only execute inthe idle time of the primary function. Portions of primary loops containing events that do not overlap with secondary areunrolled and integrated as single events. If a primary loop containing events overlaps with a secondary loop, theoverlapping portions of the loops will be fused. Loop fusion consists of merging the two loop bodies and modifying theloop control test to repeat only when both loop-controlling conditions are true. Remaining primary or secondary loopiterations are processed by dedicated “clean-up” loops following the fused loop. An additional aspect to loop fusioninvolves matching the secondary thread loop iteration time to the available idle time in the primary thread loop body byunrolling the shorter loop. The resulting loop body is padded as needed to meet the timing requirements of the primarylooping events.

STI can be performed on code with no more than one loop per thread with an unknown iteration count. This restriction iseliminated in the case of long secondary functions and short primaries. Here the secondary is broken into segmentswhich are shorter than the idle time and have at most one loop of unknown iteration count. Although this may seem to bea significant restriction, the line rendering functions integrated in STIGLitz have from one to four such loops yetbenefited extensively from STI.

1.2 Memory ExpansionSTI produces code which is more efficient than context-switching or busy-waiting. The processor spends fewer cyclesperforming overhead work. The price is expanded code memory. STI may duplicate code, unroll and split loops and addguard instructions. It may also duplicate both host and guest threads. The memory expansion can be reduced by tradingoff execution speed or timing accuracy. This flexibility allows the tailoring of STI transformations to a particularembedded system’s constraints. For more details please see [5] and related work.

1.3 Tools for Automating STIWe have developed our optimizing post-pass compiler Thrint in C++ over the past five years. It totals over 20,000 linesof code. In past work [5] we have used Thrint to automatically integrate video refresh code with rendering primitives forthe 64-bit Alpha processor. We are in the process of retargeting the appropriate portions to the AVR architecture. Thrintparses AVR and Alpha assembly code, builds control flow and dependence graphs (for internal program representationand user visualization), predicts best- and worst-case code execution schedules, measures idle time and timing jitter,evaluates register data flow, attempts to predict loop iteration counts, plans integration, pads timing variations inconditionals, moves and replicates code regions, unrolls, splits and peels loops, verifies timing correctness of integratedcode and finally regenerates a file with flat assembly code. Currently we use Thrint for timing analysis of AVR code, asthe full sequence of integration steps is not yet supported for the AVR processor. Table 1 presents useful Thrintcommand-line switches.

Table 2. Common Thrint Command-Line Switches-Gi Create program dependence graph for aiSee including instructions

-Gc Create program dependence graph for aiSee including instructions and start cycle for each

-i Follo w directiv es in specified .id file-hC Form CDGs with improv ed methods

-S Regenerate assembly code

-Tg Form histogram of idle time

-hn Do no t u se NOP loops when padding t iming var iat ions

-ha Skip data-flow analysis

-hp Selec t opt ions fo r execution in PC env ironmen t

-he Assume all SRAM accesses take one ex tra cyc le

Thrint operates in a Unix environment; we use Cygwin on Windows XP. We encourage users to download the Thrintexecutable and experiment it for their timing analysis needs. Other software we use includes avr-gcc 3.2 for compiling,

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

5 21-07-20

Page 8: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 8/15

linking and assembling code, gnu binutils for handling binary files, AVR Studio 3.56 for downloading new code to theMCU, and aiSee for visualizing the program control dependence graphs.

2. STIGLitz2.1 System OverviewSTIGLitz generates an RS170 (NTSC-compatible monochrome) video signal, provides a library of graphics primitiverendering functions integrated with video refresh code and high-speed serial communications but uses very simplehardware. We use Atmega 128 [1] from the Atmel AVR architecture, which features 8 bit native word size, 32 general-purpose registers, and limited support for 16 bit operations. The processor includes 128 kilobytes of Flash program

memory, 4 kilobytes of on-board data SRAM and numerous peripherals. The CPU core features a two-stage pipeline;most instructions take one cycle, but some take up to five. An Atmel STK500 evaluation board and STK501 processorexpansion card are used to execute the integrated code. These are overclocked at 20 MHz. 64 kilobytes of externalSRAM (IDT71124) are used as well, with a one cycle performance penalty, so loads and stores take three cycles. TheC compiler used is GCC 3.2 [3]. No operating system is used, although STIGLitz does not preclude the use of one.

The video data portion of the NTSC signal is the most demanding part, as a pixel of video data must be generated every200 ns (for 256 pixels per row). We use an external shift register to serialize a byte packed with data, reducing theprocessor loading. On a 20 MHz CPU this corresponds to 16 clock cycles per byte, which is too frequent for contextswitching or dynamic scheduling. With a 256 pixel wide, two-bit-per-pixel display, sending out a byte takes four cycles,so the idle time remaining is 9 cycles per byte. This comes to 756 cycles (37.8 us) for 64 bytes of video data, and 11.9million cycles per second, or nearly 60% of the MCU’s time.

A digital-to-analog converter (DAC) converts the serialized pixels from the data byte to an analog voltage for the NTSCoutput. There are additional features in a video signal (vertical sync and equalization pulses); these are generated by oursoftware as well. Our system generates a monochrome 256 x 254 pixel image with two bits per pixel, althoughresolutions of up to 512 x 525 with 1 bit per pixel are possible with minor modifications.

2.2 Hardware

Figure 7. The video serializer board is used in conjunction with Atmel’s STK500 AVR Starter Kit and STK501

Atmega processor daughtercard.

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

5 21-07-20

Page 9: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 9/15

Page 10: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 10/15

Figure 9. Overview of software architecture. Deferrable graphics work is enqueued for later rendering during videorefresh.

A large structure called STIGLitz holds the data needed for video generation and graphics rendering. It includes theframe-buffer, pen and background colors, queues to hold deferred work, and flags to control work deferral. The graphicsrendering and serial communication functions are presented in Table 3. Serial communication at up to 115.2 kbaud ishandled by a USART, two circular queues, and routines for enqueueing and dequeuing data. Note that at this rate thereis significant timing error due to mismatches between the microcontroller’s baud rate generator and the 20 MHz systemclock. Because of this, 57.6 kbaud is the maximum practical baud rate for a 20 MHz clock.

Table 3. STIGlitz FunctionsLitz_Init STIGLitz_SetForeground STIGLitz_EraseSprite STIGLitz_GIFDecodeInit

Litz_Destroy STIGLitz_SetBackground STIGLitz_Sprite_MaskGen STIGLitz_UART_Init

Litz_DumpFrameBuffer STIGLitz_FillRectangle STIGLitz_Sprite_TextOut STIGLitz__UART_Dequeue_Rx

Litz_DrawLine STIGLitz_DrawSprite STIGLitz_Sprite_intOut STIGLitz_UART_Enqueue_Tx

Litz_DrawCircle STIGLitz_DrawSprite_OVR STIGLitz_GIFDecode STIGLitz_UART_Enqueue_String_Tx

Figure 10 shows the software structure which allows the application to gather work (graphics primitives to render) toperform during video refresh. The application program specifies if rendering work is to be performed immediately orcan be deferred by setting a flag (e.g. DrawLineDefFlag) in STIGLitz. For deferral, parameters for each deferredprimitive are saved in the appropriate queue. DrawLine is split into five sub-functions based on line type in order tosimplify integration.

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

15 21-07-20

Page 11: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 11/15

Figure 10. Timeline shows which function is active when. ISR calls dispatcher, which then calls or resumes anintegrated function which refreshes the display and services the UART.

Figure 11. The dispatcher function is called once per scan line to process work in the graphics queues or to resume

execution of a previously started integrated function. If there is no work, a default function (PumpPixel_UART) is called.

A periodic timer-based interrupt triggers an ISR (T_OVF1 in ntsc20MHz.s) which generates the video signal. Its tworesponsibilities are to draw a full field (which takes 16.17 milliseconds and occurs 60 times per second) and to generatethe equalization pulses of the vertical synchronization signal, as seen in Figure 2. To generate a scan line the ISR callsthe subroutine Dispatch (in dispatcher.s). As shown in Figure 12, this function first determines if it is in the middle of executing an integrated thread. If so, it resumes that thread using a coroutine call. Two types of cocall are used; oneswaps all 32 registers while the other swaps only 24 (for speed reasons). If no integrated thread is executing, thedispatcher examines the queues and selects one of the integrated functions (if data is present in the queue) or else adedicated busy-wait refresh function. Note that these functions have also been integrated with polling server code whichservices USART1 using serial communication queues UART_tx_q and UART_rx_q. The chosen thread then readsvideo data from the frame buffer in memory and sends it out to the CRT through the DAC.

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

15 21-07-20

Page 12: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 12/15

2.3.2 Integrated Functions

Figure 12. Overview of which functions are integrated together to create segmented integrated functions whichservice the USART, refresh the display and perform graphics rendering.

Figure 13 shows how PumpPixel, the USART polling servers, and several graphics rendering functions are integratedtogether to create the functions called by the dispatcher. These functions service the UART and generate video asneeded. An integrated version of the function PumpPixel is called once per scan line (262 times per video refreshinterrupt) to send out a row of video data from the frame buffer. At 620 cycles, the idle time within a single call toPumpPixel_UART is too short for most graphics rendering primitives. For example, rendering an 80-pixel long x-majorline takes 435 us. As a result, we partition the graphics primitive functions during STI to allow partial progress.

2.4 Debugging Support

STIGLitz generates various debugging signals on Port D to help determine processor activity. A digital oscilloscope isinvaluable for monitoring these signals. Bit 4 is a one when the video ISR or a function it calls it is active. Bit 5 is aone when the dispatcher is active. Bit 6 is one when PumpPixel_UART is active; this indicates there is no work in therendering queues. Bit 7 is used for various general debugging purposes. It is defined in stiglitz_defs.h and can beconfigured to be active when specific segments of integrated functions execute.

3. PERFORMANCE ANALYSISWe evaluated the timing accuracy of the integrated code through oscilloscope-based timing measurements and empiricaltesting; the video signal successfully drives all the television sets tested.

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

15 21-07-20

Page 13: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 13/15

Figure 13. Performance of graphics rendering code shows integration increases performance by 4x to 13x.

We measured the line-drawing performance of STIGLitz using the benchmark function SpeedTest in main.c. Thisfunction responds to key presses on the STK500 board and draws a set of 128 lines of with pixel lengths evenlydistributed from 1 to 240 (horizontal), 130 (vertical), 170 (diagonal) and 134 (x-major). Figure 14 shows the renderingperformance for two different design alternatives: the original discrete rendering and display refresh, and integratedrendering and refresh. In the first case, the graphics primitives are rendered with discrete (non-integrated) code, whichcan run only when the video refresh ISR is not active, or during the 1.8% of the total time available. The second caseuses integrated code when possible to render graphics primitives. Integration speeds up rendering time by 3.99x to13.54x over the discrete case. The variation in speed-up comes from the amount of rendering work performed persegment and the number of segments needed per secondary thread. Each loop with an unknown iteration count requiresat least one segment; this is very inefficient if the loop’s execution time is much less than the idle time of the segment.The horizontal, vertical and diagonal functions all contain a single such loop with a single level of conditionals, allowingefficient integration and only three segments. The x-major function has a much more complex CDG and contains foursuch loops, one doubly nested. These loops require the formation of nine segments, wasting much of the available idletime. Also, horizontal rendering seems slow because it is drawing more long lines than the other functions.

Table 4. Sizes of Original and Integrated Functions

Function Origina l Siz e(bytes)

P added Size Inte gra ted Siz e Code Ex pansionRat io

DrawHorizontalLine 212 260 3120 14.72

DrawVerticalLine 182 246 2718 14.93

DrawDiagonalLine 214 262 3058 14.29

DrawXMajorLine 758 1010 9810 12.94

Table 4 shows how code sizes of functions increase by a factor of 13x to 15x after integration. Various factors contributeto the increase, including padding, loop unrolling and splitting, and code replication into conditionals. Although thesecode size increases are significant, they apply only to integrated functions, and are an acceptable price to pay given thedramatic performance improvement. Overall, STIGLitz uses 73 kilobytes of ROM and 40 kilobytes of SRAM. About16 kilobytes of SRAM are used for the GIF decoder, which could be deleted if not needed.

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm

15 21-07-20

Page 14: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 14/15

Page 15: STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

8/6/2019 STIGLitz_ Efficiently Generating Video With an 8 Bit MCU Using Software Thread Integration

http://slidepdf.com/reader/full/stiglitz-efficiently-generating-video-with-an-8-bit-mcu-using-software-thread 15/15

[9] A. Dean. “STIGLitz Project Manual,” CESR Technical Report, www.cesr.ncsu.edu, June 2003.

[10] Barret Krupnow, Jimmy Hill, Craig Nowell, Paul Lee. “Video Software Thread Integration,” CESR Technical Report, www.cesr.ncsu.edu,December 2002.

[11] Nagendra J. Kumar, Siddhartha Shivshankar and Alexander G. Dean. “Asynchronous Software Thread Integration for Efficient SoftwareImplementations of Embedded Communication Protocol Controllers,” Center for Efficient, Scalable and Reliable Computing TechnicalReport, NC State University ECE Department, May 2003.

[12] Ed Nisley, “Rising Tides”, Dr. Dobb’s Journal, #346, March 2003

[13] B. Welch, S. Kanaujia, A. Seetharam, D. Thirumalai, A. Dean, “ Extending STI for Demanding Hard-Real-Time Systems,” InternationalConferences on Compilers, Architecture and Synthesis for Embedded Systems (CASES 2003), November 2003

[14] J.E. Bresenham, “Algorithm for Computer Control of a Digital Plotter," IBM Systems Journal, 4(1), 1965, pp. 25-30

[15] Robert Lacoste, “PIC’Spectrum Audio Spectrum Analyzer," Circuit Cellar, September 1998, #98, pp. 24-31

[16] Robert Lacoste, “The XY-Plotter: Drive High-Resolution LCDs for Less,” Circuit Cellar, September 2003, #133, pp. 42-51

[17] Ricard Gunee, “Software Generated Video,” http://www.rickard.gunee.com/projects/

[18] Don Lancaster, Cheap Video Cookbook, Howard W. Sams & Co. Inc., 1978

[19] Don Lancaster, Don Lancaster’s Tech Musings, #134, 1999, www.tinaja.com

[20] Bruce Land, “AVR Video Generator: Teaching Programming and Graphics,” Circuit Cellar, January 2003, #150, pp. 40-43

[21] David Thomas, http://dt.prohosting.com/pic/pong.html and http://dt.prohosting.com/pic/vidclock.html

[22] Alberto Riccibitti, http://www.geocities.com/CapeCanaveral/Launchpad/3632/dvm.htm

[23] Eric Smith, PIC-Tock and PIC-Pong, http://www.brouhaha.com/~eric/pic/

[24] Ubicom Video Virtual Peripheral Design Challenge and Contest, http://www.sxlist.com/techref/ubicom/contests.htm

[25] Tom Napier, “Use Frequency Modulation to Send ASCII Data,” Circuit Cellar, January 2003, #150, pp. 12-16

Litz: Efficiently Generating Video with an 8 bit MCU using Software... http://www.cesr.ncsu.edu/agdean/stiglitz/STIGLitz_Extended.htm