10
Interactive Parallelization of Embedded Real-Time Applications Starting from Open-Source Scilab & Xcos Oliver Oey, Michael Rückauer, Timo Stripf, Jürgen Becker emmtrix Technologies GmbH Karlsruhe, Germany { oey, rueckauer, stripf, juergen.becker }@emmtrix.com Clément David, Yann Debray ESI Group Paris, France { clement.david, yann.debray }@esi-group.com David Müller, Umut Durak German Aerospace Center (DLR) Braunschweig, Germany { david.mueller, umut.durak }@dlr.de Emin Koray Kasnakli, Marcus Bednara, Michael Schöberl Fraunhofer Institute for Integrated Circuits (IIS) Erlangen, Germany {koray.kasnakli, marcus.bednara, mi- chael.schoeberl}@iis.fraunhofer.de AbstractIn this paper, we introduce the workflow of interactive parallelization for optimizing embedded real-time applications for multicore architectures. In our approach, the real-time applications are written in the Scilab high-level mathematical & scientific pro- gramming language or with a Scilab Xcos block-diagram approach. By using code generation and code parallelization technology com- bined with an interactive GUI, the end user can map applications to the multicore processor iteratively. The approach is evaluated on two use cases: (1) an image processing application written in Scilab and (2) an avionic system modeled in Xcos. Using the workflow, an end- to-end model-based approach targeting multicore processors is ena- bled resulting in a significant reduction in development effort and high application speedup. The workflow described in this paper is developed and tested within the EU-funded ARGO project focused on WCET-Aware Parallelization of Model-Based Applications for Heterogeneous Parallel Systems. Keywordsmulti-core processors; embedded systems; automatic parallelization, code generationg, Scilab, Xcos, model-based design I. INTRODUCTION Developing embedded parallel real-time software for multi- core processors is a time-consuming and error-prone task. Some of the main reasons are 1. It is hard to predict the performance of a parallel pro- gram and therefore hard to determine if real-time timing constraints are met. 2. New potential errors like race conditions and dead locks are introduced. These errors are often hard to reproduce and therefore hard to test for. 3. The parallelization approach of an application is opti- mized for a specific number of cores resulting in a high porting effort when there is need for changing the num- ber of cores. In this paper, we want to demonstrate parts of the ARGO ap- proach to show a simple way for the development of applica- tions with real-time constraints. The major benefits of this flow are the more abstract modelling of the application (compared to plain C), making the programming easier, automatic algorithms that handle many error prone tasks automatically and a great flexibility regarding the target platforms. A. State of the Art Model-Based Design Model-Based Design refers to the development of embedded systems starting from a high-level mathematical system model. It is a subset of a larger concept called Model-Based System Engineering. Model-based design has seen a rising interest from the industry in the last couple of decades, especially in Aero- nautics, Automotive and Process industries, using more and more electronics and software. The main reason for this trend is the possibility to manage the development process from a higher point of view, making ab- straction of the low-level design of systems. This results in the gain of time and costs, but has disadvantages in terms of control on the knowhow. With the rising complexity of the systems in- tegrated in today’s and tomorrow’s products, this abstraction layer shifts the design challenges to the tools vendors and tech- nology providers, and the real-time requirements needs to be addressed in a collaborative manner on both hardware and soft- ware level. We recently observed a consolidation on the market of tools vendors, in the favour of Product Lifecycle Management play- ers, such as Dassault Systèmes and Siemens PLM (the latter ac- quiring Mentor Graphics in 2017 for 4.5 billions $). Two spe- cialists in the segment of Simulation and Analysis remain inde- pendent and provide the more appealing solutions for Model- Based Design for both Aeronautics and Automotive, namely

Interactive Parallelization of Embedded Real-Time

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Interactive Parallelization of

Embedded Real-Time Applications

Starting from Open-Source Scilab & Xcos

Oliver Oey, Michael Rückauer, Timo Stripf, Jürgen Becker

emmtrix Technologies GmbH

Karlsruhe, Germany

{ oey, rueckauer, stripf, juergen.becker }@emmtrix.com

Clément David, Yann Debray

ESI Group

Paris, France

{ clement.david, yann.debray }@esi-group.com

David Müller, Umut Durak

German Aerospace Center (DLR)

Braunschweig, Germany

{ david.mueller, umut.durak }@dlr.de

Emin Koray Kasnakli, Marcus Bednara, Michael Schöberl

Fraunhofer Institute for Integrated Circuits (IIS)

Erlangen, Germany

{koray.kasnakli, marcus.bednara, mi-

chael.schoeberl}@iis.fraunhofer.de

Abstract—In this paper, we introduce the workflow of interactive

parallelization for optimizing embedded real-time applications for

multicore architectures. In our approach, the real-time applications

are written in the Scilab high-level mathematical & scientific pro-

gramming language or with a Scilab Xcos block-diagram approach.

By using code generation and code parallelization technology com-

bined with an interactive GUI, the end user can map applications to

the multicore processor iteratively. The approach is evaluated on two

use cases: (1) an image processing application written in Scilab and

(2) an avionic system modeled in Xcos. Using the workflow, an end-

to-end model-based approach targeting multicore processors is ena-

bled resulting in a significant reduction in development effort and

high application speedup. The workflow described in this paper is

developed and tested within the EU-funded ARGO project focused

on WCET-Aware Parallelization of Model-Based Applications for

Heterogeneous Parallel Systems.

Keywords—multi-core processors; embedded systems; automatic

parallelization, code generationg, Scilab, Xcos, model-based design

I. INTRODUCTION

Developing embedded parallel real-time software for multi-core processors is a time-consuming and error-prone task. Some of the main reasons are

1. It is hard to predict the performance of a parallel pro-gram and therefore hard to determine if real-time timing constraints are met.

2. New potential errors like race conditions and dead locks are introduced. These errors are often hard to reproduce and therefore hard to test for.

3. The parallelization approach of an application is opti-mized for a specific number of cores resulting in a high porting effort when there is need for changing the num-ber of cores.

In this paper, we want to demonstrate parts of the ARGO ap-

proach to show a simple way for the development of applica-

tions with real-time constraints. The major benefits of this flow

are the more abstract modelling of the application (compared to

plain C), making the programming easier, automatic algorithms

that handle many error prone tasks automatically and a great

flexibility regarding the target platforms.

A. State of the Art Model-Based Design

Model-Based Design refers to the development of embedded

systems starting from a high-level mathematical system model.

It is a subset of a larger concept called Model-Based System

Engineering. Model-based design has seen a rising interest from

the industry in the last couple of decades, especially in Aero-

nautics, Automotive and Process industries, using more and

more electronics and software.

The main reason for this trend is the possibility to manage the

development process from a higher point of view, making ab-

straction of the low-level design of systems. This results in the

gain of time and costs, but has disadvantages in terms of control

on the knowhow. With the rising complexity of the systems in-

tegrated in today’s and tomorrow’s products, this abstraction

layer shifts the design challenges to the tools vendors and tech-

nology providers, and the real-time requirements needs to be

addressed in a collaborative manner on both hardware and soft-

ware level.

We recently observed a consolidation on the market of tools

vendors, in the favour of Product Lifecycle Management play-

ers, such as Dassault Systèmes and Siemens PLM (the latter ac-

quiring Mentor Graphics in 2017 for 4.5 billions $). Two spe-

cialists in the segment of Simulation and Analysis remain inde-

pendent and provide the more appealing solutions for Model-

Based Design for both Aeronautics and Automotive, namely

Ansys Scade (from the acquisition of Esterel Technologies) and

Matlab Simulink.

Scilab Xcos represents an open-source alternative to those

dynamic system modelling & simulation solutions (for both

time continuous and discrete systems). It is also packaged with

domain specific libraries for signal processing and control sys-

tems. It bases on the same kernel than Matlab, for matrix com-

putation and linear algebra LAPACK& BLAS [1]. Xcos pro-

vides a graphical block diagram editor in order to model sys-

tems. The blocks contain functional description of the compo-

nents of the system, in the form of transfer functions, and the

blue links between the blocks convey signal data at every step

of the clock synchronizing the simulation. Time synchroniza-

tion is propagated to the blocks requiring this information in

their behaviour, by red links from the clock (special block). The

particularity of Xcos in comparison with Simulink is its asyn-

chronous behaviour. Indeed it is possible in Xcos to represent

different time sampling clocks to represent asynchronism of

embedded systems.

B. State of the Art Parallel Programming with real-time

constraints

In practice, the real-time embedded implementations for im-

aging applications are achieved in the following way: Starting

from a high-level model in MATLAB or Scilab, the algorithms

are modified for constant runtime. Especially with complex al-

gorithms, data dependent computation is present. These data-

dependent processing elements need to be identified and condi-

tional execution needs to be re-written. For example, execution

of both branches and mask-based combination of the results

must be manually implemented. This code is then ported to an

embedded C/C++ code and further optimized for the target plat-

form. The parallelization is carried out manually, by distrib-

uting the work on the target architecture. This manual process

is time consuming, error-prone and the result is fixed to a single

architecture.

Parallelizing applications for embedded systems with real

time constraints is a broad topic with several different ap-

proaches. The parMERASA project uses well-analyzable par-

allel design patterns [2] to parallelize industrial applications [3].

The patterns cover different kinds of parallelism (e.g. pipeline,

task or data) as well as synchronization idioms like ticket locks

or barriers. In doing so, existing legacy code can be parallelized

and executed on timing-predictable hardware with real time

constraints. Using these well-known parallel design patterns

eases the calculation of the worst-case execution time (WCET).

The work of [4] proposes compiler transformations to parti-

tion the original program into several time-predictable inter-

vals. Each such interval is further partitioned into memory

phase (where memory blocks are prefetched into cache) and ex-

ecution phase (where the task does not suffer any last-level

cache miss and it does not generate any traffic to the shared

bus). As a result, any bus transaction scheduled during the exe-

cution phases of all other tasks, does not suffer any additional

delay due to the bus contention.

The work of [5] attempts to generate a bus schedule to im-

prove both the average-case execution time (ACET) and the

worst-case execution time (WCET) of an application. This tech-

nique improves the ACET while keeping its WCET as small as

possible

Other approaches define extensions for programming lan-

guages in order to describe different kinds of parallelism within

the program. In [6] an OpenMP inspired infrastructure is intro-

duced that allows annotating parallelism in the source code in

order to automatically extract data dependencies and insert syn-

chronization. In this paper, we introduce a semi-automatic, interactive par-

allelization approach for applications written in an abstract pro-gramming language or model. It covers a subset of the ARGO toolchain and although lacking complete WCET analysis for the sequential and parallel program, transformations that optimize the WCET and WCET aware scheduling, can already be used for applications with real time requirements.

II. APPLICATION USE CASES

A. Polarization Image Processing

This application is a specialized image processing system for

image data originating from a novel polarization image sensor

(POLKA) developed at Fraunhofer IIS [7]. This camera is used

in industrial inspection, for example in in-line glass [8] and car-

bon fiber [9] quality monitoring. Polarization image data is sig-

nificantly different from 'traditional' (i.e. color) image data and

requires widely different – and significantly more computation

intensive - processing operations as shown in Fig. 1.

A gain/offset correction is performed on each pixel to equal-

ize sensitivity and linearity inhomogeneity. For this purpose,

additional calibration data is required (G/O input in Fig. 1).

Since each pixel only provides a part of the polarization in-

formation of the incoming photons, the unavailable information

is interpolated from the surrounding pixels (similar to Bayer

Fig. 1 Exemplary polarization image processing pipeline

Raw data G/O data

Pixel preprocessing

Gain offset correction

Interpolation

Stokes vectors

AOLP and DOLP

RGB Conversion

RGB Image

pattern interpolation on color image sensors). From the interpo-

lated pixel values we now compute the Stokes vector, which is

a vector that provides the complete polarization information of

each pixel. By appropriate transformations, the Stokes vectors

are converted into the degree of linear polarization (DOLP) and

angle of linear polarization (AOLP). These parameters are usu-

ally the starting point for any further application dependent pro-

cessing (which is not shown here). For demonstration purposes,

we convert AOLP and DOLP into a RGB color image that can

be used for visualizing polarization effect.

Polarization image processing is currently used in industrial

inspection. For example, inline glass inspection is depicted in

Fig. 2 and Fig. 3. Glass products are transported at up to 10

items per second and images are captured. Typically, a single

inspection PC will handle multiple cameras and requires at least

20 fps processing capabilities. Currently, for one camera, this

rate can be achieved, but in case of multiple camera outputs

processed by one PC or in case of different use-cases where the

number of output measurement frames increases, it can drop to

6-10 fps. This obligates for each use-case to reconsider/investi-

gate further optimization possibilities. Our aim is to achieve a

minimum of 25 fps as a hard constraint independent of use-case

and processing elements in the algorithm chain. This is a hard

constraint knowing that without any optimizations and parallel-

ization, we can only achieve around 6 fps.

Fig. 2 shows the POLKA Polarization Camera with glass

measurements performed in a single shot per item. Since this is

a measurement device, the precision of the measured data is of

uttermost importance. Therefore, the standard algorithm is fur-

ther adapted for each sensor and polarization data is further pro-

cessed for different use cases. Especially trigonometric compu-

tation leads to a large computation overhead.

An alternative based on a number of COTS cameras is shown

in Fig. 3. This system complements the POLKA capabilities

with increased spatial resolution and lower system cost. This

construction, however, requires additional image fusion. The

required registration and alignment further increase the compu-

tational complexity of the measurement operation [10].

In both cases, their underlying algorithms need to be adapted

to each use case, starting from the Scilab high-level algorithmic

description, all the way down to the embedded C / VHDL im-

plementation.

B. Enhanced Ground Proximity Warning System

An Enhanced Ground Proximity Warning System (EGPWS) is one of various Terrain Awareness and Warning Systems (TAWS) and defines a set of features, which aim to prevent Con-trolled Flight Into Terrain (CFIT). This type of accident was re-sponsible for many fatalities in civil aviation until the FAA made it mandatory for all turbine-powered passenger aircraft regis-tered in the U.S. to have TAWS equipment installed [11]. There are various TAWS options available in the market for various platforms in various configurations. The core feature set of an EGPWS is to create visual and aural warnings between 30 ft to 2450 ft Above Ground Level (AGL) in order to avoid controlled flight into the terrain. These warnings are categorized in 5 modes:

1. Excessive Descent Rate: warnings for excessive de-scent rates for all phases of flight.

Fig. 3 Inline glass inspection with COTS cameras

Fig. 4 Reduced ARGO EGPWS Scilab Xcos block diagram

Fig. 2 Inline glass inspection with PolKa

2. Excessive Terrain Closure Rate: warnings to protect

the aircraft from impacting the ground when terrain is

rising rapidly with respect to the aircraft.

3. Altitude Loss After Take-off: warnings when a signif-

icant altitude loss is detected after take-off or during a

low altitude go around.

4. Unsafe Terrain Clearance: warnings when there is no

sufficient terrain clearance regarding the phase of the

flight, aircraft configuration and speed. 5. Excessive Deviation Below Glideslope: warnings

when the aircraft descends below the glideslope.

Additionally, an EGPWS provides some enhanced func-

tions, like the Terrain Awareness Display and Terrain Look Ahead Alerting based on a terrain database.

Fig. 4 shows a reduced Scilab Xcos model as it was used for debugging during the development of the ARGO EGPWS, in this case for the Mode 1 block. Fig. 5 gives an understanding of the corresponding algorithm. The three aircraft have the same altitude of about 2000 ft, but different Rates of Descent, which is demonstrated by their position in the graph. While the green aircraft is in a safe flight state, the orange one’s Rate of Descent causes a warning. The red aircraft, however, is sinking much too fast considering its low altitude, requiring immediate action by the pilot.

Most important among the Terrain Awareness features is the

Terrain Awareness Display. It is not a separate device, but an

enhancement to the Navigation Display (ND) that is already ex-

istent in a conventional airliner cockpit. As a background to the

displayed information, an abstracted image of the terrain ahead

can be turned on by the push of one button. The range of the

ND can be as little as 10 nm and as much as 160 nm (18.5 km

or 296 km, respectively), which then also applies to the radius

of the semicircular terrain image. The first step for the terrain visualization is the extraction of an area of interest from the database, based on the position and ori-entation of the aircraft. The range set on the ND is also im-portant, as it determines the size of the AOI. Given the level of detail of the database, which is just above 90m between data points, the AOI’s size can range between 200 by 400 and 6400 by 12800 points. The elevation data of each point in the AOI is compared to the aircraft’s altitude to create a color map, which has to be converted to an image with a much lower resolution in

order to be displayed on the ND. The conversion yields the high-est elevation point in a given part of the AOI to make sure that no critical elevation information is lost.

Another feature is the Terrain Look Ahead Alerting. A virtual

box predicting various possible flight paths for the next 60 sec-

onds flies ahead of the aircraft. By checking the box for colli-

sions with the covered terrain points in the AOI, the system is

able to alert the pilot early enough before a terrain collision will

occur. The principle is shown in Fig. 6.

III. INTERACTIVE PARALLELIZATION WORKFLOW

The interactive parallelization workflow as shown in Fig. 7 is

designed to assist the user with the parallelization process

Fig. 5 Graph depicting the foundation for the implementation of

Mode 1, Excessive Descent Rate

Fig. 6: Collision detection based on comparison of terrain data-

base with a box shaped flight path prediction

Fig. 7: Overview of the interactive parallelization flow

through abstraction and automation. Algorithm development

can be performed using abstract, mathematical programming

languages like Scilab or MATLAB or their respective model-

based extensions Xcos or Simulink. This allows focusing on the

functionality while timing and hardware-specific optimizations

will be handled later in the tool chain.

A. Front end

The front-end tools parse Scilab and Xcos files in order to

transform them into a functionally equivalent sequential C code

representation. Constraints from the end user are taken into ac-

count for front end transformations and potential additional in-

formation from the Scilab source code is preserved as pragma-

based source code annotations. The generated C representation

uses a subset of the C99 standard excluding constructs like

function pointers and pointer arithmetic, which can dramati-

cally reduce the compile time predictability.

The Xcos to Scilab code generation is a Scilab toolbox reus-

ing Xcos model transformation. It takes an Xcos diagram, a

sub-system name and a configuration Scilab script as input and

outputs Scilab code for both the scheduling and block imple-

mentations for the selected sub-system. The generated Scilab

code is later used as an input to generate C code using the Scilab

to C frontend.

The Scilab to C code generation generates efficient, compre-

hensible and compact embedded C code from Scilab code. It

supports a wide range of the Scilab language features and ex-

tensions as well as embedded processor architectures. Develop-

ers can easily integrate the C code into existing projects for em-

bedded systems or test it as standalone application on the PC as

the code has not yet any optimizations for any specific target

platforms.

The C code generator can analyze the worst-case execution

count of each block of the generated C code. The analysis uses

value range information from sparse condition constant propa-

gation (SCC). The value range information contains the maxi-

mum values of variables that effect e.g. the maximum or worst-

case execution count of for loops. If no worst-case information

can be derived automatically, special functions can be used to

manually specify worst-case information within the Scilab

code. The result of the analysis is generated as pragmas into the

generated C code. Furthermore, all data accesses are taken into

account in order to generate code with static memory allocation.

B. Parallelization

The parallelization tool generates statically scheduled parallel

C code for a specific target platform. A user can control the

process through a graphical representation of the program as

can be seen in Fig. 8. The width of the blocks represents the

duration of the sequential program as calculated using a perfor-

mance model of the hardware platform. Hierarchies on the Y-

axis show different control structures like function calls, loops

or if blocks. We use the term task to describe a unit of work.

During the later code generation, tasks will be clustered for the

individual cores and depending on the configuration or the tar-

gets operating system form threads or processes.

A user can interact with the parallelization process in several

ways:

Assigning core constraints to tasks in order to enforce or

forbid the execution of a task on a specific core.

Setting cluster constraints in order to limit the granularity

on which the automatic parallelization algorithm works.

Applying code transformations to specific code blocks of

the program. More details about this concept are de-

scribed later in this section.

Fig. 8 Hierarchical representation of a sample program

Fig. 9 Example for a scheduling view

As basis for the graphical representation and for the automatic

scheduling the well-known hierarchical task graphs (HTG) [12]

are used. Their main concept is hiding complexity caused by

cyclic dependencies through the introduction of hierarchies. For

each loop, a new hierarchy level is created and the loop is

placed inside. Task dependencies can only connect tasks on the

same hierarchy level. By introducing these new hierarchies for

loops, cycles on the same level are avoided. This representation

eases the analyzability of the whole program and enables more

accurate predictions of the performance, which are necessary to

meet the real time constraints.

We handled the scheduling with a modified version of the es-

tablished Heterogeneous Earliest Finish Time (HEFT) algo-

rithm [13]. It prioritizes the execution of tasks with a high rank,

which is defined by its computing cost, its number of succeed-

ing tasks and the overall communication costs for the necessary

variables. Being a greedy algorithm, it can fail to find the opti-

mal solution but has the advantage of a fast execution time. This

is key for the interactivity with the user. The modified HEFT

algorithm is able to handle hierarchical structures and to take

into account core and cluster constraints assigned by the user.

An example of a resulting mapping and schedule can be seen in

Fig. 9. For each core of the target platform, the mapping of tasks

over the time is shown. Arrows represent data and control de-

pendencies to guide the user with the parallelization process. As

a reference, the sequential execution time of the program is

shown on the right hand side of the figure. All user interaction

described with the HTG view is also applicable for the sched-

uling view. By generating a static schedule of the whole pro-

gram, our flow does not rely on the scheduler of the operating

system.

The performance estimation used for the parallelization is

based on the worst-case execution count as determined by the

front-end tools. The data is acquired by a combination of static

analysis of the source code and profiled execution on the host

platform. In doing so, the number of iterations for each loop can

be determined. Additionally, a performance model of the exe-

cution times of instructions on the platform is used to perform

a static analysis of the complete sequential program in order to

determine the runtime of the program on the target platform.

The execution times of instructions were directly measured on

the target platforms. These measurements also take into account

different types of cores and memory configurations.

In compiler design, a code transformation is typically applied

to the whole program. In the context of ARGO and paralleliza-

tion, this behavior is problematic. A transformation exposing

coarse grain parallelism makes only sense on code regions that

require more coarse grain parallelism. On all other locations, it

would have negative effect i.e. performance overhead, larger

code size or memory footprint or the incompatibility to other

optimization transformations. An example of this is splitting a

for-loop into several independent for-loops. The potential for

parallelism is increased as these new loops can be executed in

parallel. However, this usually comes at the cost of additional

temporary or duplicated variables, which have a negative im-

pact on the performance and/or the memory footprint of the ap-

plication when all loops are executed sequentially. Therefore,

we need a concept for selectively applying code transfor-

mations to code regions only where it makes sense from the

global schedule point of view. Thereby, we must solve the

phase ordering problem since the code transformations are ap-

plied before scheduling and mapping. Furthermore, on a spe-

cific code region or code position the order of applying code

transformation must be controllable.

We solve the problem by using a code transformation frame-

work that applies all potential transformation in a single pass.

In a top down approach, the pass visits each task where first all

potential transformations are analyzed for applicability. E.g. a

simple loop unrolling transformation can only be applied to

“for” loop blocks matching a specific init, step and condition

template. If a transformation candidate is found, the task is

marked to have a potential transformation, which can then be

set in the HTG view. In parallel, it is checked if a decision value

is set for the transformation in the GUI from a previous itera-

tion. Based on the value the corresponding transformation is ap-

plied to the task. Afterwards, all children of the block are vis-

ited. This approach opens up a large design space where several

transformations can be applied to different tasks of the program.

In this first iteration of the flow, the user can dynamically select

transformations for tasks and will get feedback about the per-

formance impact through the scheduling view. All available

transformations like loop splitting, tiling, fission or unrolling

preserve the predictability of the program as the new execution

times can be calculated from the existing data.

Parallelization can be categorized into different levels like

shown in Fig. 10. We already covered the code transformation

level and the task level as well as their impact on the predicta-

bility of the program. Above these two, there is the algorithmic

level. Many problems can be solved by different algorithms

which may have different performance, memory requirements

or can be parallelized differently. A common example is the

Fast Fourier Transformation (FFT) where a 1024-point FFT can

also be calculated with two 512-point FFTs. Both can be calcu-

lated in parallel and we therefore have a different behavior re-

garding the parallelization. Within our tool flow, the user can

make the choice of the algorithm in the interactive view. How-

ever, the selection presented to the user is already made by the

front end, which recognizes functions/algorithms with different

implementations, and provides them to the interactive GUI.

When the user selects a different implementation, the flow

starts from the beginning with the new selection, thus recalcu-

lating all necessary performance information.

Fig. 10 Parallelization levels

Changes in the communication level of the application are

taken into account in the back end of the flow. During the

scheduling, only a rough estimation of the chosen communica-

tion model is used for the performance prediction.

With all these different parallelization methods, the user is

able to iteratively optimize the performance of the application.

C. Back end

The back end of the tool flow covers the communication/syn-

chronization and the generation of parallel C code. Currently,

we use a distributed memory model for all target platforms.

This means, that each core that needs access to a variable has

its own copy of it. Data dependencies are analyzed using a static

single assignment representation [14] so that all edges, which

have different cores for the definition and usage of a variable,

can be used to insert communication in the program. As both

cores have their own version of the variable, this explicit com-

munication is necessary and has the benefit of avoiding access

to the same memory areas from different cores. This greatly en-

hances the predictability of the resulting parallel program. The

timing estimation of the communication overhead is closely

coupled with the target platform and its capabilities. Important

factors are the operating system, how the data is transferred and

whether the system load affects the timing or not. Two different

access patterns can be differentiated:

Multiple cores read the same data: in this case, one core

is the owner of the data either by calculating it or by ac-

quiring it from an external interface. The core will then

send the data to all other cores that need access to it. The

order of the communication is determined by the static

schedule to minimize the waiting times of the receiving

cores. When the data is initialized at the beginning of the

program, this procedure is performed on all cores and

further communication is not necessary.

Multiple cores modify the same data: as each core has its

own copy of a variable, the modifications do not directly

affect each other. When data needs to be modified in a

specific order, the values are synchronized between the

corresponding cores before the modifications. In the case

that the variable is an array or a matrix of values, it is

split into several independent variables and joined back

into one after the processing.

To improve the predictability of parallel programs, which

contain control structures like loops that are partially executed

by multiple cores, the back end duplicates the control flow on

all involved cores. This means rebuilding control structures like

loops or if-blocks as well as statements like break or continue.

When necessary, each iteration of the loop is synchronized by

evaluating the condition on one core and sending the result to

the other cores. In doing so, each core has the same amount of

iterations which eases predicting the performance of the loop.

The generated C code is compiled into a single binary that is

executed on each core.

IV. PARALLELIZATION OF APPLICATIONS

A. Polarization Imaging

The algorithm is fairly simple, however computationally

cumbersome in parts such as 2D convolutions and intensity

mappings in demosaicing. Nevertheless, the uniformness of the

computation over the data array permits a high potential for data

parallelization.

The graphical representation of the parallelization is shown

in Fig. 11. As can be seen in the hierarchical view, the main

processing is a chain of several consecutive steps. Most of them

can be parallelized using loop transformations on the task par-

allelization layer. The resulting schedule can be seen in Fig. 12.

All four cores of the target platform are occupied through most

of the program resulting in a speedup of up to around 3 com-

pared to the sequential execution.

In order to achieve such a tight schedule on core mapping,

different tiles of the input image should be able to be processed

independently of each other. The toolchain allows us to exploit

this, without changing the original Scilab code in overall, but

by adding the necessary functions, provided by the front end in

user-defined positions in the algorithm. Thus, the end-user

Fig. 12 Schedule of the polarization imaging

Fig. 11 Graphical representation of polarization imaging

gains access through the Scilab code to scheduling and mapping

in hardware level for any desired parts of the code without re-

quiring the knowledge of underlying specifics for paralleliza-

tion.

In our processing pipeline, we divide the image data after the

step Pixel Preprocessing in Fig. 1 into 4 tiles with a variable

splitting transformation, each mapped to one core and use loop

splitting to run the tiles in separated loops for each tile. Interpo-

lation step introduces 4 additional image arrays each of size of

the input image array. These are divided into 4 tiles again and

mapped to corresponding cores. The Stokes Calculation step is

a linear combination of its predecessor step and reduces the ar-

ray number to 3. These arrays are again mapped to 4 cores and

so on and so forth, until output arrays AOMP and DOLP are

calculated. Thus, there is a high degree of locality of the data,

since the same tile stays on the same core among processing

steps.

Once a hierarchical view of the algorithm is built, different

constellations of computation blocks on cores can be played

with for achieving speed-up from within the available GUI. By

clicking on the desired computation block and updating the

scheduling view, the corresponding scheduled timing for this

block can be zoomed in and the tasks on other processing cores

can be observed for that time instance continuously.

For example, it is important to control the 2D convolution

part of our algorithm, such that it is distributed on (4) cores in a

tight schedule. This can only be achieved by trial and error from

simulating for the previously mentioned different constella-

tions. Fortunately, the GUI of the toolchain provides also an as-

sessment about the expected speed-up by showing the estimate

of the runtime-ratio between the sequential and parallel pro-

gram on the right side, to get a feeling what is worth to run to

test for performance on the target hardware.

B. Enhanced Ground Proximity Warning System (EGPWS)

The EGPWS algorithm is fundamentally different from the

image processing algorithm from the first test case. It is imple-

mented in Xcos and by the dataflow-oriented description, one

would assume a lot of task-level parallelism.

However, the performance estimation of the algorithm out-

lined that the Mode 1 to 5 calculations are not computational

intensive. The hot spots are the Terrain Awareness Display and

Terrain Look Ahead Alerting calculation requiring many

checks to the terrain database. Both hotspots are implemented

as a Scilab function. To gain sufficient task parallelism, we fur-

ther parallelized the hotspots by loop transformations and par-

tition of variables. This results in the hierarchical view of the

application as shown in Fig. 13 where one can see a broad struc-

ture of the whole program. The transformations created many

small tasks with two blocks that stand out because of their

length. One of them is located in Mode 2 and the other in Mode

4. Each of them does a bilinear interpolation, which means that

for a given x- and y-value an appropriate z-value is calculated.

For this, they use many mathematical operations with double

precision, which takes more time than most of the other tasks in

the application, like the linear interpolations that are done in

Mode 1 and 5. As the two interpolations do not have a direct

dependency between them, they can be assigned to two differ-

ent cores. All other shorter tasks will then be scheduled auto-

matically to reduce the amount of communication between

Fig. 13 Hierarchical view of the EGPWS

Fig. 14 Schedule of the EGPWS

them. Fig. 14 shows an example schedule for the EGPWS on

the Aurix Tricore TC297 processor with three cores. The

speedup of the whole application is around 1.8.

Even though the ARGO EGPWS is intended for use only

within DLR’s Air Vehicle Simulator (AVES), this use case can

in addition serve to examine the ARGO toolchain with respect

to software and software tool development regulations like DO-

178C [15] and its supplements DO-330 [16] and DO-331 [17].

This, however, is currently out of scope and regarded as a future

work.

V. EVALUATION

For evaluation of the performance and correctness of the par-

allel application, we used as a first step a Raspberry Pi 2 running

a linux operating system equipped with a 4-core ARM Cortex-

A7 as target architecture. We used locking-free queues to apply

a message-based communication between the four processor

cores. In the next step, we used an Aurix Tricore TC297 running

a FreeRTOS operating system. Similar locking free queues

were used for the message-based communication between the

three cores. These two platforms are not the target architectures

intended to run the applications but are taken as examples to

evaluate whether the workflow is feasible or not.

During evaluation, we could show three major results:

1. Using the interactive parallelization workflow, we

could increase the performance of the applications from two

different domains between the factors 1.8 (on the Aurix Tri-

core) and 3 (on the Raspberry Pi 2) compared to the sequential

execution. The three respective four cores of the target platform

could all be utilized to improve the performance of the applica-

tion.

2. We could enable an iterative parallelization approach

for optimizing the application. Going from Scilab or Xcos to a

parallel application running on the target architectures, was pos-

sible in less than one minute. In each iteration, the end users

could explore an alternative parallelization or further optimize

the input code.

3. The implementation effort for parallelizing applica-

tions was reduced by over 50% compared to a manual parallel-

ization effort. Furthermore, the workflow enables a model-

based development approach for multicore processor that was

not possible before. Analyses from literature show that a model-

based development approach can reduce the overall develop-

ment effort by 50% for single-core processor. This results in

reduction of the overall effort by 60-80% for multi-core proces-

sors. By using two different target architectures, we could also

show that using this approach allows fast switching of the plat-

forms as many decisions that were made for the parallelization

can easily be ported to other platforms. Oftentimes, a simple

change of a parameter, e.g. splitting a loop into 3 instead of 4

independent loops, is enough, to get a better performance for

another platform.

VI. CONCLUSION

In this paper, we have shown that applications with real time

constraints, which are modelled in open source programs like

Scilab and Xcos, can be used as a basis for a tool flow for the

generation of efficient parallel C code for embedded target plat-

forms. By starting at a higher abstraction level, the intermediate

sequential C code can be generated to be optimized for auto-

matic analysis passes. It also allows the option to select differ-

ent implementations of algorithms depending on the target plat-

form.

The iterative flow enables fast and simple optimizations of

the program until the desired timing is achieved. This can di-

rectly be verified by testing on the actual hardware. Switching

to a different platform can also be done without changes to the

original model or source code and is mostly handled by the

framework.

During the ARGO project, the concepts shown in this paper

will be enhanced by an integration of WCET into the flow. They

are computed for the sequential C code and are directly taken

into account by the scheduling algorithms. Additional transfor-

mations allow further optimizations of the WCET of the pro-

gram.

ARGO (http://www.argo-project.eu/) is funded by the Euro-

pean Commission under Horizon 2020 Research and Innova-

tion Action, Grant Agreement Number 688131.

REFERENCES

[1] Univ. of Tennessee; Univ. of California, Berkeley;

Univ. of Colorado Denver; and NAG Ltd.,

LAPACK — Linear Algebra PACKage,

http://www.netlib.org/lapack, 2017.

[2] M. Gerde, R. Jahr and T. Ungerer, "parMERASA

Pattern Catalogue: Timing Predictable Parallel Design

Patterns," Augsburg, 2013.

[3] T. Ungerer, C. Bradatsch and M. Frieb, "Parallelizing

Industrial Hard Real-Time Applications for the

parMERASA Multicore," ACM Transactions on

Embedded Computing Systems (TECS), July 2016.

[4] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell,

M. Caccamo and R. Kegley, "A predictable execution

model for COTS-based embedded systems," in In IEEE

Real-Time and Embedded Technology and

Applications Symposium, 2011.

[5] J. Rosén, C. Neikter, P. Eles, Z. Peng, P. Burgio and

L. Benini, "Bus access design for combined worst and

aver-age case execution time optimization of

predictable real-time applications on multiprocessor

systems-onchip," in IEEE Real-Time and Embedded

Technology and Applications Symposium, 2011.

[6] O. Neugebauer, M. Engel and P. Marwedel, "A

parallelization approach for resource-restricted

embedded heterogeneous MPSoCs inspired by

OpenMP," Journal of Systems and Software, pp. 439-

448, March 2017.

[7] S. Junger, W. Tschekalinkij, N. Verwaal and N.

Weber, "On-chip nanostructures for polarization

imaging and multispectral sensing using dedicated

layers of modified CMOS processes," in Proceedings

of the SPIE conference on Photonic and Phononic

Properties of Engineered Nanostructures, 2011.

[8] A. Nowak, "In-line residual stress measurement,"

Glass Worldwide, pp. 72-73, 2015.

[9] M. Schöberl, K. Kasnakli and A. Nowak, "Measuring

Strand Orientation in Carbon Fiber Reinforced Plas-

tics (CFRP) with Polarization," in 19th World

Conference on Non-Destructive Testing, Munich,

2016.

[10] T. Richter, C. Saloman, N. Genser, K. Kasnakli, J.

Seiler, A. Nowak and A. S. M. Kaup, "Sequential

Polarization Imaging Using Multi-View Fusion," in

IEEE International Workshop on Signal Processing

Systems, Lorient, France, 2017.

[11] Federal Aviation Administration, Advisory Circular

23-18, Installation of Terrain Awareness and Warning

Systems (TAWS) Approved for Part 23 Airplanes,

Washington, D.C.: U.S. G.P.O., 2000.

[12] M. Girkar and C. D. Polychronopoulos, "Automatic

extraction of functional parallelism from ordinary

programs," IEEE Transactions on Parallel and

Distributed Systems, pp. 166-178, 1992.

[13] H. Topcuoglu, S. Hariri and M. Wu, "Performance-

effective and low-complexity task scheduling for

heterogeneous computing," IEEE Transactions on

Parallel and Distributed Systems, pp. 260-274, 2002.

[14] B. Rosen, M. N. Wegman and F. K. Zadeck, "Global

value numbers and redundant computations," in

Proceedings of the 15th ACM SIGPLANT-SIGACT

symposium on Principles of programming languages,

San Diego, CA, USA, 1988.

[15] RTCA, Inc., „DO-178C "Software Considerations in

Airborne Systems and Equipment Certification",“

2012.

[16] RTCA, Inc., „DO-330 "Software Tool Qualification

Considerations",“ 2012.

[17] RTCA, Inc., „DO-331 "Model-Based Development

and Verification Supplement to DO-178C and DO-

278A",“ 2012.