4

Click here to load reader

[IEEE 2010 Second International Conference on Computer Research and Development - Kuala Lumpur, Malaysia (2010.05.7-2010.05.10)] 2010 Second International Conference on Computer Research

  • Upload
    den

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 Second International Conference on Computer Research and Development - Kuala Lumpur, Malaysia (2010.05.7-2010.05.10)] 2010 Second International Conference on Computer Research

ARM7TDMI Optimization Based on GCC

Den Wenjian Center of Informantion,

Guangdong Textile College, Foshan City, Guangdong Province, China(528300)

[email protected]

Abstract—The paper discusses optimization of hardware architecture from angle of compiling. The angle of compiling, which the paper refers to, is the located compiling technology. That is to say, the paper will be analyzed how to optimizing instruction set, register location and pipelining of hardware architecture from GCC compiling technology, such as peephole, diagram coloring and instruction scheduling.

Keywords-ARM7TDMI;GCC; Optimization

I. INTRODUCTION In computer system, operating system and application

software are often integrated into computer hardware system. And compiler can reduce amount of jobs of bottom programmer, can help programmer programming by using the same advanced program language.

Software has been involved into hardware. So we should not design software and optimize hardware regardless of performance of software.

By now, someone has raised that we should design system by integrated into hardware and software, so that the whole system can be raised up in performance. However, VAX which would be design has taken into compiling factors consider and can be designed into orthogonality highly and allow to map one sentence in advanced language into one instruction. But design of VAX is failure. The reason is that when we designed hardware architecture, we didn’t located compiling structure and compiling optimizing technology.

So what the paper discusses is that, provided location of modern compiling structure and compiling technology, we do some optimizing on specific machine platform.

II. GCC RELATED TO SPECIFIC MACHINE PLATFORM GCC(GNU C Compiler)is a kind of typical C compiler

published by GNU. Structure which it has as figure 1:

Figure 1. GCC Compiler Structure

GCC compiling structure is consist of three parts including frontend, intermediate code and backend. Through lexical analysis, syntax analysis, and semantic analysis, GCC frontend would generate analytical trees and then be translated into intermediate code. And then backend of the compiler would play a role of change intermediate code into assemble through optimizing. When translating, INSN optimizing is RTL INSN optimizing. The result of optimizing affects the efficiency of assembly code executing and size of space taking.

INSN optimizing includes optimizing nothing to do with machine platform and that something to do with machine platform. Optimizing will not be discussed in my paper. Importantly in my paper, compiling optimizing related specific machine platform is consist of instruction combination, register location and instruction scheduling and those will affect result of assembly code executing after optimizing. For example, the instruction combining would combine several instructions into one. What the combining finish, would minimize the size of code directly, even improve efficiency of code. In GCC, technology named peephole is used in instruction combination. Meanwhile, instruction scheduling will adjust the queue of instruction serials in order to improve the efficiency of code executing. GCC make use of basic block instruction scheduling to adjust queue of instruction serials. And register location is a necessary step for GCC backend. It is quiet benefit for generating better assembly code if the compiler has a better register location algorithm. Diagram coloring would be made to use in register location in GCC.

To be concluding, we would like to improve efficiency of code executing and velocity of compiling, we must start form the three optimizing technology of compiling.

III. CHARACTERISTIC OF ARM7TDMI ARM7TDMI is a typical processing unit in ARM7

serials, which is a kind of 32-bit RISC processor.16-bit compressed Thumb instruction set, debug of chip and embedded hardware multiplier would be supported.

ARM7TDMI instruction set has been divided into jump instructions, data processing instructions, PSR(Programming Status Register) processing instructions, load/store instruction and exception generation instructions.

These instructions are with the characteristic of RISC instruction set, that is, the instruction set would be provided with typical Load-Store structure, fixed length of instruction and smart addressing. Besides, these instructions are also with the characteristic of non-RISC instruction, that is, it is

Second International Conference on Computer Research and Development

978-0-7695-4043-6/10 $26.00 © 2010 IEEE

DOI 10.1109/ICCRD.2010.139

639

Page 2: [IEEE 2010 Second International Conference on Computer Research and Development - Kuala Lumpur, Malaysia (2010.05.7-2010.05.10)] 2010 Second International Conference on Computer Research

(define_peephole [(set (match_operand:SI 2 "memory_operand" "=m") (match_operand:SI 0 "s_register_operand" "r")) (set (match_operand:SI 3 "memory_operand" "=m") (match_operand:SI 1 "s_register_operand" "r"))] "TARGET_ARM && store_multiple_sequence

(operands, 2, NULL, NULL, NULL)" "* return emit_stm_seq (operands, 2);" )

allowed some specific variable implementation cycle instructions such as multiplication, division, multiply-addition instructions and some of condition instructions. These instructions must be executed more than one cycle.

ARM7TDMI has 16 user-visible registers. In assemble R0~ R13 are general register that must be keeping data or address values. They are not special register for hardware.R14 used in linked register for two special functions, the one, for restoring return address of subroutine in normal modes, and the other, for setting exception return address in exception modes. And R15 used in programming counter, for point to the address which is checking on the time. The last one R16, named CPSR(Current Programming Status Register), keeps the current status after operations.

ARM7TDMI would make used of three-class pipeline including checking address, decoding and executing. The three classes will finish in different function units independently. Generally, register operation instructions would be executed in a cycle. However, some of jump instructions would keep from pipelining so that the efficiency of the system is lowered.

On the whole, it is important from the angle of GCC to optimize instructions, registers location and pipelining. Because the three aspects will affect compiling efficiency and code efficiency.

IV. 4. GCC OPTIMIZING ANALYSIS FOR ARM7TDMI The section will introduce to analyze how to optimize

instruction set, register location and pipelining from angle of peephole, diagram coloring and instruction scheduling of GCC technologies.

A. ARM7TDMI instruction optimizing based on peephole optimizing

It’s very simple for the principle of peephole optimizing. That is, Compiler can find out to improve locally through checking short serials of neighboring operations. Initially to peephole-optimizer, it is supplied for executing codes after all of other optimizing steps. As checking, peepholes can generate new assembly codes. The optimizer has a moving window called ‘Peephole’, which can move through code serials. Every step, peephole-optimizer checks operations in the window, and finds out specified modes instead of original modes.

Because ARM7TDMI instruction set is provided with characteristic of non-RISC, such as batch store/load instructions, multiplication/division instructions and multiple-addition instructions. So when compiling, the compiler will make used of specific modes to combine instruction. For batch load/store instructions, GCC-ARM will combine two to four load/store instructions into one

batch load/store instructions. The next codes introduce machine description about how to combine two instructions to one:

Figure 2. machine description on combining two instructions to one

And combination of more than three store/load instructions will defined through another peephole optimizing. However, the size of window of peephole will be defined four as maximum limits. So we can combine four store/load instructions to one batch processing instruction in maximum. Obviously, it is limited for capability of compiling optimizing. So while combining, size of window should be set 2. As a result, batch load/store instruction should be optimized to double load/store instruction.

Now, we will have a check mult and mult-add instruction. In ARM instruction set, mult instruction and mult-add instruction will add difficulty of pipelining design of ARM processor. But if we will check from angle of compiling, we could have mapped from multiplication in high-level language to shift-add instruction in assemble. As a result, it’s unnecessary of mult and mult-add instruction considering of compiling.

B. ARM7TDMI register optimizing based on diagram-coloring

Diagram-coloring technology considers register location as a kind of diagram-coloring. Registers location looked as using color, and what should be colored are values of register location.

Register structure of ARM7 make used of 16-register structure, and the fifteen are general registers. Diagram-coloring will locate the different values with fifteen colors. Actually, it can be totally located for a general GCC test program (non-kernel). However, that will result in problems of restore of spilling data and it can not be dealt with by diagram-coloring. GCC has an optimizing schema that it can design specify register called cache register. How many cache registers will be defined will located by rules. Generally, for ARM7, we can define a cache register of the fifteen registers to restore spilling data.

C. ARM7TDMI pipelining optimizing based on instruction schedule

Generally, the compiler need to re-array operations to adapt to specific capability demands of specific platform, all in order to generate more rapid codes. That is so called instruction-scheduling of compiling. Instruction scheduler as next figure:

640

Page 3: [IEEE 2010 Second International Conference on Computer Research and Development - Kuala Lumpur, Malaysia (2010.05.7-2010.05.10)] 2010 Second International Conference on Computer Research

Figure 3. Instruction scheduler

There is only there-class pipeline, that is including instruction direct, instruction decode and instruction execute for ARM7TDMI. ARM7 is a kind of three-bus structure, which has sixteen registers. So, the operation about restoring/loading of register-memory generally would not come across problems related to resource non-supporting. As a result, it’s not necessary for the compiler to solve the problem to what related to structure.

For the problem to what related to data of ARM7TDMI pipeline, the compiler will solve the kind of problem with instruction scheduler. However, there is an exception. Batch load instruction, which is to load several values, interact with memory. So the instruction will be executed during three instruction circles resulting in delaying of the next instruction. It is due to compiler to solve the dot with instruction scheduling. However, the lap for scheduling is quite large.

The compiler will get the right direction after executing jumping so that the next instruction will not do some waste. After all, cleaning pipeline will waste much.

Actually, three-class pipeline is less effective than five-class pipeline for series executing after delay slot inserting in it. Because the third pipeline of ARM7TDMI read register first and compute logically and arithmetically and write back finally. The serials of actions are very complex. It’s possible that it will not be successful for scheduling. Because it will result in prolonging the time. The principle of GCC scheduler is making use of local list scheduling. So the span of scheduling is not big enough to extend over a block. It is not good for compiler to load too much units at one time with batch instructions.

We can reduce to a circle of pause of pipelining with the way of adding hardware as what have said in last section, to the problem which is related to ARM7TDMI pipeline control. However, the only circle can not stop branches from pausing with scheduling. Because ARM7TDMI has no branch instructions with executing, the delay slot will be needed to customize. Generally, to insert a delay slot can confirm effective pipeline. The customization will be written in arm.md in GCC, codes as “Fig 4”:

Figure 4. Delay slot

We need three delay slots after executing branches and the three slots will be set true. As a result, the delay slot will eliminate pause of pipeline.

V. OPTIMIZING DESIGN Optimizing must be mainly started from instruction

optimizing. We need emulator named SkyEye to re-explain instruction to emulate instruction for sampling design.

As what analyzed in last section, there are two instructions should be optimized. One is that batch load/store instruction should be optimized to double load/store instructions. Another one is that deleting mult instruction and mult-add instruction from instruction set.

Optimizing is proceeding. First, we should delete some of mapping which we do not need no longer in file arm.md of GCC. For example, there is a mapping that processes four registers at once as next code. The optimizing of compiling will delete the mapping.

Figure 5. Load instrution template

The code start with name of “ldmsi4”, which explained that four register-memory instruction templates and the mode of the template is “SI”—mode of double words.

There are instruction templates named *ldmsi3*stmsi3 *stmsi4 *arm_mulsi3

*thumb_mulsi3 *mulsi3addsi and other related templates of comparison named

*mulsi3addsi_compare0 should be deleted. And stmsi2 ldmsi2 should be kept. And for multiple instruction templates, they should be

changed the direction of mapping, and finally generate shift-add instructions. As follows:

(define_delay (eq_attr "type" "branch ")

[(eq_attr "in_delay_slot" "true") (eq_attr "in_delay_slot" "true")

(eq_attr "in_delay_slot" "true")])

(define_insn "*ldmsi4" [(match_parallel 0 "load_multiple_operation" [(set (match_operand:SI 2 "arm_hard_register_operand" "") (mem:SI (match_operand:SI 1 "s_register_operand" "r"))) (set (match_operand:SI 3 "arm_hard_register_operand" "") (mem:SI (plus:SI (match_dup 1) (const_int 4)))) (set (match_operand:SI 4 "arm_hard_register_operand" "") (mem:SI (plus:SI (match_dup 1) (const_int 8)))) (set (match_operand:SI 5 "arm_hard_register_operand" "") (mem:SI (plus:SI (match_dup 1) (const_int 12))))])] "TARGET_ARM && XVECLEN (operands[0], 0) == 4" "ldm%?ia\\t%1, {%2, %3, %4, %5}" [(set_attr "type" "load4") (set_attr "predicable" "yes")]

)

641

Page 4: [IEEE 2010 Second International Conference on Computer Research and Development - Kuala Lumpur, Malaysia (2010.05.7-2010.05.10)] 2010 Second International Conference on Computer Research

Figure 6. multiple instruction templates

And then we can shift the mapping of multiplication to multiplication successfully and shift-add is a good substitution.

Notice that the design is an optimizing for a fixed platform and the aim is to confirm that there will improve the capacity of compiling for the design of hardware optimizing.

As for optimizing of pipeline, we can compare the compiling arm7 with arm9. Arm7 makes use of three-class pipeline, and arm9 makes use of five-class pipeline.

VI. CONCLUSION Design of optimizing technology of compiler structure is

related tightly to design of specify platform, such as peephole optimizing, diagram-coloring and instruction

scheduling. As concluding we can get the principle of designing compiler for embedded platform, as follows:

Try best to provide atomic design of instruction set. General register should be more than 16. Five-class pipeline is the best choice for compiler to

specific embedded platform. We proved the principle right on ARM7TDMI emulator

and the competence of compiling has been improved. And the other platforms should be proved more if the principle is fixed or not. This will be our future top

ACKNOWLEDGMENT Thanks to everybody that helped me. Thanks to my

instructors and my colleagues. Thanks to my parents ,especially to my wife.

REFERENCES [1] Peter McCormick.Developing and applying END of ARM tooling[M].

Robotics International of SME, c1986:22-94 [2] William Gatliff. Porting and Using Newlib in Embedded Systems.

http://www.billgatliff.com [3] Steven S. Muchnick.Advanced Compiler Design

Implementation[A] Academic Press 1997 [4] Per Bothner.Cygnus Solutions A GCC based Java

Implementation[A] [5] Riehard M Stallman.GNU Compiler Collection Internals Free

Software Foundation 2003. [6] Alexandre CourBot.Porting the GNU Compiler Collection to the

Cami11e Arehitecture,http://www.lifl.fr/~grimaud/CamilleNG /CamilleNG.html

(define_insn "*arm_mulsi3" [(set (match_operand:SI 0 "s_register_operand" "=&r,&r") (mult:SI (match_operand:SI 2 "s_register_operand" "r,r") (match_operand:SI 1 "s_register_operand" "%?r,0")))]

"TARGET_ARM; operands[1] =get_lsl_bits(operands[1] SImode) operand"[0]=operands[1]%2

"add%?\\t%0, %0, lsl%?\\t %2,%1" [(set_attr "insn" "mul") (set_attr "predicable" "yes")] )

642