83

robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Embed Size (px)

Citation preview

Page 1: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 2: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 3: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 4: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 5: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 6: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 7: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 8: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 9: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 10: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 11: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 12: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 13: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 14: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 15: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for
Page 16: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

nic kaj posebnega

Page 17: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Processor Types

 

Where did it come from?Anybody reading this in the UK will no doubt be familiar with Acorn's BBC micro. Some, sadly, seem to feel that the company never made it beyond that "odd thing with the black and red keys", while others can cast their mind back to the moment they booted RISC OS 4 on their new Kinetic, and gloat.

Either way, Acorn made use of the 6502 processor in the Atom, some kits, and some rackmount machines in the late seventies. As 1980 rolled in, the BBC went looking for a computer to fit a series of programmes they wanted to produce. Rather than these days, when the programmes are much more likely to fit the computer; the BBC had in mind the sort of specification it was looking for. A number of companies well known at the time tendered their designs. Acorn revamped the Atom design, throwing into it as much as possible, and building an entire working machine from the ground up in a matter of days. That's the stuff legends are made of, and that seems to be the stuff Acorn was good at, like "Hey, guys, let's pull an all-nighter and write an operating system".The BBC loved the machine, and the rather naffly named "The Micro Program" was released in 1982 alongside the BBC microcomputer. It filled school computer rooms. Many were sold. Not many in American terms, but staggering in European terms.The BBC micro, like earlier Acorn machines, was based around the 6502 processor - as were other popular computers such as the Apple II.From the outset, you could have colour graphics and text on-screen. Not to be outdone, the BBC micro offered seven screen 'modes' of varying types - ranging from high resolution monochrome to eight colour (plus eight flashing colours) and an eight colour 'teletext' mode that only required 1K of memory per screen; a cassette interface for cheap and cheerful use, on-board provision for a floppy disc interface (you needed to add a couple of ICs like the 1772 disc controller, that's all), serial, four channel analogue, eight channel digital I/O, tube for co-processors, a 1MHz system bus for serious fiddling and for harddiscs... and by adding a couple of extra components, you had built-in networking.Econet might have been slow and simple, but it was a revolution in those days, when it was stated that Bill Gates, among other notable gaffs, asked "what's a network?" - though this may well be urban legend. In any case, running multiple processor systems, and networking all sorts of machines was something that Acorn users were au fait with long before the PC marketplace kicked off, never mind implementing such things for itself.

However, Acorn had their sights set on the future, and between 1983 and 1985 the ARM processor design was designed by Steve Furber and Sophie Wilson (or, Roger Wilson, back then). This was a leap of faith and optimism, when only a year previous

Page 18: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

they had released a 32K 8 bit machine, they were then designing a 32 bit machine that could cope with up to 16Mb RAM, and some ROM as well.Why?

Acorn continued to produce the BBC micro and variants. Indeed, the production of their most successful version of the BBC micro - the Master - only finished in May 1993. However, back a decade in 1983 it was quite clear to the innovators inside Acorn that the next generation of machine should provide something far better than rehashing old ideas over and over. In this, lay the problem. Which processor to use? There was nothing that stood out from the crowd. Acorn had produced a machine with the 16 bit 6502-alike, the 65C816, but this wasn't up to the vision that Acorn had. They tried all of the 16 and 32 bit processors available by building second processor units for the BBC micro to aid in their evaluation.

So there was one idea left. To make the processor that they were looking for. Something that kept the ideals of the 6502, but provided raw power. Something small, cheap - both to produce and to power, and something fairly simple both internally and to program. The important early design decisions were to use a fixed instruction length (which makes it possible to accurately disassemble any random memory address simply by looking to see what is there - every instruction is word aligned), and to use a load/store model.

In that day, companies were talking about extending their CISC processors. The 8088 became the 80186 (briefly), the 80286, and so on to the processor it is today. RISC processors existed, but the majority of them were designed in-house as embedded controllers. Acorn took their ideas and requirements and wrote a BASIC program that emulated the ARM 1 instruction set. The designers of the processor were new to processor design, some of the tools used were not exactly cutting edge. This prevented the processor design from being large and complex, which in it's way was the best thing, and is now being spun as a 'plus' for the ARM processor, as indeed it is.While Acorn had very clear ideas of what they wanted the processor to do, they also wanted good all-round performance, rather than something so tailored to the end design that it obsoletes itself.

So. For the processor, Acorn rolled their own.

Please, take a moment to consider this.Not only did Acorn create an entire powerful and innovative operating system with a tiny crew (MicroSoft probably employs more people to clean their toilets than Acorn employed in total); but they also designed their own chipset.So basically these guys designed an entire computer from the ground up, on a tiny budget and with a tiny workforce.

You can fault Acorn for many things - lack of development, lack of advertising - but you can never fault them for having the sheer balls to pull it off in the first place.

At the time the "Archimedes" was released, it was widely touted as the world's fastest desktop machine. It also boasted a display system that could spit out loads of different resolutions. My A5000 (same video hardware) can output 640x480 at 256 colours, or 800x600 at 16 colours. It doesn't sound impressive, but this was using hardware

Page 19: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

developed in the mid '80s. The rest of the world (save Apple Macs) was using CGA and like; or Hercules for the truly deranged!

Not a lot was made of the fact that the machines were RISC. Maybe Acorn figured the name of the operating system (RISC OS) was a big hint. Maybe they figured they had enough going for the machine without getting all geeky.

So when, in the early '90s, Apple announced the world's first RISC desktop machine, we laughed. And Acorn ran a good-humoured advert in the Times welcoming Apple to RISC.

The chipset was:

ARM2This is the central processor, and originally stood for "Acorn RISC microprocessor" (rather than "ARM RISC machine", or whatever they've called it today (may I suggest "Advanced RISC Microprocessor"?)).  

MEMC1 (Anna)This was the MEMory Controller. It was very soon replaced by the MEMC1a, which I do not think had a name.The RiscPC generation of machines use a MMU (Memory Management Unit).

  VIDC1 (Arabella)

This was the VIDeo Controller, though due to all it was capable of doing to pixels and sound, many knew it as the Very Ingenious Display Contraption. Certainly, the monitors that cannot be supported under RISC OS are few and far between. It is a trivial matter to switch from a modern 21" SVGA monitor to a television monitor.The RiscPC generation of machines use the VIDC20, which takes it the logical step further. Unfortunately, the VIDC is no longer able to keep up with the latest advances in display driver technology. Enter J. Kortink with his ViewFinder.  

IOC (Albion)This was the Input/Output Controller, and it looked after podules and keyboards and basically anything that did I/O. In a flash of inspiration, it offered an IIC interface which is available on the expansion bus. My teletext receiver is hooked into this.RiscPC generation machines use the IOMD which is like a souped up IOC.

The ARM250 (mezzanine / macrocell) offered the ARM chipset on one piece of silicon. It was used in the A3010 and A3020 machines. It may have also been used in the A4000, but I've not seen inside such a machine.

 

The original operating system of the ARM-based machine was to be ARX, but it was taking too long and was running overbudget. So Arthur was designed. It has been said

Page 20: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

that Arthur's name derives from the porting of the BBC MOS "A RISC operating system by Thursday". Sadly, it has a lot of the hang ups of the BBC micro, such as a lack of memory protection (like 'modules' running in SVC mode (really only the kernel should run in SVC mode)), there's a plethora of unrelated things done with the OS_Byte SWI, the service call mechanism...From Arthur came RISC OS, which improved certain aspects of the system, but perhaps the most significant improvement was the Desktop. Instead of a bizarre looking (and horribly coloured) thing that could only run a task at a time, it introduced proper co-operative multitasking.The debate between pre-emptive and co-operative multitasking is legion, but I feel that Acorn wanted co-operative. That it was a design decision instead of a cop-out. Because, while it makes it slightly harder to program and more liable to problems with errant tasks, it fits so beautifully into Acorn's ethos. There's no process 'protection' like on Unix. You can drop to a privileged processor mode with little more than a SWI call, and a lot of stuff (that probably shouldn't) runs in SVC mode. Because, at it's heart, RISC OS is a hacker's operating system. Not the same type of 'hacking' that Linux and netbsd comes from - such things were not known in the home/office computer sector in those days, but in it's way, RISC OS is practically begging for you to whip out the disassembler and start poking around it's internals. The original Arthur PRMs said that any serious application would be written in assembler (a view they later changed, to suggesting serious applications would be written in C).

 

When the ARM processor team split off into ARM Ltd, they adopted a new numbering system for the processors. Originally, the numerical suffix reflected the revision of the device, the ARM 1, the ARM 2, the ARM 3 ... followed by the ARM two-and-a-half, which is 250 in the tradition of multiplying version numbers by a hundred.

Now, the single number reflects the macrocell as is always - ARM6, ARM7...

A processor with a twin number denotes a self-contained processor and basic interface circuitry, like the ARM60 and the VIDC20 (VIDC not strictly a processor, but part of the ARM chipset).

A processor with a triple number denotes the processor macrocell combined with other macrocells, or custom logic, like the ARM610 and the ARM710. Because of the simplicity of the designs, and the predefined parts, the ARM610 went from specification to silicon in under four months. Short development times are invaluable for custom devices where every development day matters... It also matters that ARM's designs will arrive on time, so you don't end up with your computer or PDA (or whatever) sitting there awaiting the processor. Within ARM's converted barn, a line of opened champagne bottles line the staircase - a testament to how many of their designs worked from the very first silicon implementation - which is virtually every single one of them.

 

Page 21: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

 

So there you have it.

From an idea to a global leader in microprocessors (Intel has said recently it is making more ARM silicon than x86 silicon), the ARM processor's birth is wrapped in spectacular innovation.

While it is not entirely certain where RISC OS is heading, one thing is for sure. The beautiful processor in our RISC OS machines is going from strength to strength.

We at Heyrick wish ARM Ltd all the best...

Return to assembler index

Copyright © 2004 Richard Murray

Page 22: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Where might you find an ARM?

 

The ARM processor is a powerful low-cost, efficient, low-power (consumption, that is) RISC processor. It's design was originally for the Archimedes desktop computer, but somewhat ironically numerous factors about its design make it unsuitable for use in a desktop machine (for example, the MMU and cache are the wrong way around). However, many factors about its design make it an exceptional choice for embedded applications.

So while many PCs can scream "Intel(R) inside", there are a steadily increasing number of devices that could scream "ARM inside", only ARM doesn't have an ego anywhere near as large as Intel. Oh, and yes I am aware that Intel are fabricating ARM processors. Oh what a tangled web we weave...

 

[last updated January 2002]

Gameboy Advance games console Daewoo inet.top.box Bush Internet TV / box Datcom 2000 digital satellite receiver Pace digital satellite receiver (supplied as part of the Sky package) Numerous other digital cable / satellite receivers Hauppauge WinTV DVB-S PC TV card Oracle NC LG Java computer Millipede Apex Imager video board Paradise AiTV set top box Sony MZ-R90 minidisc Win-Jam JVC's digital camera 'Pixstar' Lexmark Z12/22/32/42/52 colour Jetprinter Samsung office laser printer Samsung SmartJet MFP (printer/scanner/copier/fax) Xerox colour inkjet printer Digital logic analyzers from Controlware IHU-2 Experimental Space Flight Computer Siemens video phone Wizcom's Quicktionary Various GSM handsets, from the likes of Alcatel, AEG, Ericsson, Kenwood,

NEC, Nokia...

Page 23: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Cable/ADSL modems, by manufacturers such as Caymen Systems, D-Link, and Zoom.

3Com 3CD990-TX-97 10/100 PCI NIC with 3XP processor Routers, bus adaptors, servers, crypto, gateways... POS systems Smart cards Adaptec PCI to Ultra2 SCSI 64 bit RAID controller ATA drive electronics controller systems (bare) Iomega HipZip digital audio player C pen, with OCR and IrDA HP/Ericsson/Compaq pocket PCs Psion series 5 hand-held PC (5mx used 36MHz ARM710T) Various PDAs

And, of course, all of us using Archimedes / BBC (A30x0) / NetStation / RiscPC / A7000 / Mico / RiscStation computers!!!

This is not a complete list. Visit http://www.arm.com/ for a full list, with links to each item.

The above may not use ARM processors, but other hardware produced by ARM. It is rather difficult to discover what it actually inside half of these things, without owning one and taking it apart!

 

A site that gives images and interesting background information is the details of the IHU-2 experimental space flight computer.

Return to assembler index

Copyright © 2004 Richard Murray

Page 24: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

RISCvs

CISC 

You can read a reply to this text by going here.

 

In the early days of computing, you had a lump of silicon which performed a number of instructions. As time progressed, more and more facilities were required, so more and more instructions were added. However, according to the 20-80 rule, 20% of the available instructions are likely to be used 80% of the time, with some instructions only used very rarely. Some of these instructions are very complex, so creating them in silicon is a very arduous task. Instead, the processor designer uses microcode. To illustrate this, we shall consider a modern CISC processor (such as a Pentium or 68000 series processor). The core, the base level, is a fast RISC processor. On top of that is an interpreter which 'sees' the CISC instructions, and breaks them down into simpler RISC instructions.

Already, we can see a pretty clear picture emerging. Why, if the processor is a simple RISC unit, don't we use that? Well, the answer lies more in politics than design. However Acorn saw this and not being constrained by the need to remain totally compatible with earlier technologies, they decided to implement their own RISC processor.

Up until now, we've not really considered the real differences between RISC and CISC, so...

A Complex Instruction Set Computer (CISC) provides a large and powerful range of instructions, which is less flexible to implement. For example, the 8086 microprocessor family has these instructions:

JA Jump if Above JAE Jump if Above or Equal JB Jump if Below ... JPO Jump if Parity Odd JS Jump if Sign JZ Jump if ZeroThere are 32 jump instructions in the 8086, and the 80386 adds more. I've not read a spec sheet for the Pentium-class processors, but I suspect it (and MMX) would give me a heart attack!

Page 25: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

By contrast, the Reduced Instruction Set Computer (RISC) concept is to identify the sub-components and use those. As these are much simpler, they can be implemented directly in silicon, so will run at the maximum possible speed. Nothing is 'translated'. There are only two Jump instructions in the ARM processor - Branch and Branch with Link. The "if equal, if carry set, if zero" type of selection is handled by condition options, so for example:

BLNV Branch with Link NeVer (useful!) BLEQ Branch with Link if EQualand so on. The BL part is the instruction, and the following part is the condition. This is made more powerful by the fact that conditional execution can be applied to most instructions! This has the benefit that you can test something, then only do the next few commands if the criteria of the test matched. No branching off, you simply add conditional flags to the instructions you require to be conditional: SWI "OS_DoSomethingOrOther" ; call the SWI MVNVS R0, #0 ; If failed, set R0 to -1 MOVVC R0, #0 ; Else set R0 to 0Or, for the 80486: INT $...whatever... ; call the interrupt CMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code MOV DX, 0 ; else set DX to 0 return RET ; and return failed MOV DX, 0FFFFH ; failed - set DX to -1 JMP returnThe odd flow in that example is designed to allow the fastest non-branching throughput in the 'did not fail' case. This is at the expense of two branches in the 'failed' case.I am not, however, an x86 coder, so that can possibly be optimised - mail me if you have any suggestions...

 

Most modern CISC processors, such as the Pentium, uses a fast RISC core with an interpreter sitting between the core and the instruction. So when you are running Windows95 on a PC, it is not that much different to trying to get W95 running on the software PC emulator. Just imagine the power hidden inside the Pentium...

Another benefit of RISC is that it contains a large number of registers, most of which can be used as general purpose registers.

This is not to say that CISC processors cannot have a large number of registers, some do. However for it's use, a typical RISC processor requires more registers to give it additional flexibility. Gone are the days when you had two general purpose registers and an 'accumulator'.

One thing RISC does offer, though, is register independence. As you have seen above the ARM register set defines at minimum R15 as the program counter, and R14 as the link register (although, after saving the contents of R14 you can use this register as you wish). R0 to R13 can be used in any way you choose, although the Operating

Page 26: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

System defines R13 is used as a stack pointer. You can, if you don't require a stack, use R13 for your own purposes. APCS applies firmer rules and assigns more functions to registers (such as Stack Limit). However, none of these - with the exception of R15 and sometimes R14 - is a constraint applied by the processor. You do not need to worry about saving your accumulator in long instructions, you simply make good use of the available registers.

The 8086 offers you fourteen registers, but with caveats:The first four (A, B, C, and D) are Data registers (a.k.a. scratch-pad registers). They are 16bit and accessed as two 8bit registers, thus register A is really AH (A, high-order byte) and AL (A low-order byte). These can be used as general purpose registers, but they can also have dedicated functions - Accumulator, Base, Count, and Data.The next four registers are Segment registers for Code, Data, Extra, and Stack.Then come the five Offset registers: Instruction Pointer (PC), SP and BP for the stack, then SI and DI for indexing data.Finally, the flags register holds the processor state.As you can see, most of the registers are tied up with the bizarre memory addressing scheme used by the 8086. So only four general purpose registers are available, and even they are not as flexible as ARM registers.

The ARM processor differs again in that it has a reduced number of instruction classes (Data Processing, Branching, Multiplying, Data Transfer, Software Interrupts).

A final example of minimal registers is the 6502 processor, which offers you:  Accumulator - for results of arithmetic instructions  X register  - First general purpose register  Y register  - Second general purpose register  PC          - Program Counter  SP          - Stack Pointer, offset into page one (at &01xx).  PSR         - Processor Status Register - the flags.While it might seem like utter madness to only have two general purpose registers, the 6502 was a very popular processor in the '80s. Many famous computers have been built around it.For the Europeans: consider the Acorn BBC Micro, Master, Electron...For the Americans: consider the Apple2 and the Commadore PET.The ORIC uses a 6502, and the C64 uses a variant of the 6502.(in case you were wondering, the Speccy uses the other popular processor - the ever bizarre and freaky Z80)

So if entire systems could be created with a 6502, imagine the flexibility of the ARM processor.It has been said that the 6502 is the bridge between CISC design and RISC. Acorn chose the 6502 for their original machines such as the Atom and the System# units. They went from there to design their own processor - the ARM.

 

To summarise the above, the advantages of a RISC processor are:

Page 27: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Quicker time-to-market. A smaller processor will have fewer instructions, and the design will be less complicated, so it may be produced more rapidly.  

Smaller 'die size' - the RISC processor requires fewer transistors than comparable CISC processors...This in turn leads to a smaller silicon size (I once asked Russell King of ARMLinux fame where the StrongARM processor was - and I was looking right at it, it is that small!)...which, in turn again, leads to less heat dissipation. Most of the heat of my ARM710 is actually generated by the 80486 in the slot beside it (and that's when it is supposed to be in 'standby').  

Related to all of the above, it is a much lower power chip. ARM design processors in static form so that the processor clock can be stopped completely, rather than simply slowed down. The Solo computer (designed for use in third world countries) is a system that will run from a 12V battery, charging from a solar panel.  

Internally, a RISC processor has a number of hardwired instructions.This was also true of the early CISC processors, but these days a typical CISC processor has a heart which executes microcode instructions which correlate to the instructions passed into the processor. Ironically, this 'heart' tends to be RISC. :-)  

As touched on my Matthias below, a RISC processor's simplicity does not necessarily refer to a simple instruction set.He quotes LDREQ R0,[R1,R2,LSR #16]!, though I would prefer to quote the 26 bit instruction LDMEQFD R13!, {R0,R2-R4,PC}^ which restores R0, R2, R3, R4, and R15 from the fully descending stack pointed to by R13. The stack is adjusted accordingly. The '^' pushes the processor flags into R15 as well as the return address. And it is conditionally executed. This allows a tidy 'exit from routine' to be performed in a single instruction.Powerful, isn't it?The RISC concept, however, does not state that all the instructions are simple. If that were true, the ARM would not have a MUL, as you can do the exact same thing with looping ADDing. No, the RISC concept means the silicon is simple. It is a simple processor to implement.I'll leave it as an exercise for the reader to figure out the power of Mathias' example instruction. It is exactly on par with my example, if not slightly more so!

For a completion of this summary, and some very good points regarding the ARM processor, keep reading...

 

Page 28: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

In response to the original version of this text, Matthias Seifert replied with a more specific and detailed analysis. He has kindly allowed me to reproduce his message here...

 

RISC vs ARMYou shouldn't call it "RISC vs CISC" but "ARM vs CISC". For example conditional execution of (almost) any instruction isn't a typical feature of RISC processors but can only(?) be found on ARMs. Furthermore there are quite some people claiming that an ARM isn't really a RISC processor as it doesn't provide only a simple instruction set, i.e. you'll hardly find any CISC processor which provides a single instruction as powerful as a LDREQ R0,[R1,R2,LSR #16]!Today it is wrong to claim that CISC processors execute the complex instructions more slowly, modern processors can execute most complex instructions with one cycle. They may need very long pipelines to do so (up to 25 stages or so with a Pentium III), but nonetheless they can. And complex instructions provide a big potential of optimisation, i.e. if you have an instruction which took 10 cycles with the old model and get the new model to execute it in 5 cycles you end up with a speed increase of 100% (without a higher clock frequency). On the other hand ARM processors executed most instruction in a single cycle right from the start and thus don't have this optimisation potential (except the MUL instruction).

The argument that RISC processors provide more registers than CISC processors isn't right. Just take a look at the (good old) 68000, it has about the same number of registers as the ARM has. And that 80x86 compatible processors don't provide more registers is just a matter of compatibility (I guess). But this argument isn't completely wrong: RISC processors are much simpler than CISC processors and thus take up much less space, thus leaving space for additional functionality like more registers. On the other hand, a RISC processor with only three or so registers would be a pain to program, i.e. RISC processors simply need more registers than CISC processors for the same job.

And the argument that RISC processors have pipelining whereas CISCs don't is plainly wrong. I.e. the ARM2 hadn't whereas the Pentium has...

The advantages of RISC against CISC are those today:

RISC processors are much simpler to build, by this again results in the following advantages:

o easier to build, i.e. you can use already existing production facilities o much less expensive, just compare the price of a XScale with that of a

Pentium III at 1 GHz... o less power consumption, which again gives two advantages:

much longer use of battery driven devices no need for cooling of the device, which again gives to

advantages: smaller design of the whole device

Page 29: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

no noise

 

RISC processors are much simpler to program which doesn't only help the assembler programmer, but the compiler designer, too. You'll hardly find any compiler which uses all the functions of a Pentium III optimally...

And then there are the benefits of the ARM processors:

Conditional execution of most instructions, which is a very powerful thing especially with large pipelines as you have to fill the whole pipeline every time a branch is taken, that's why CISC processors make a huge effort for branch prediction  

The shifting of registers while other instructions are executed which mean that shifts take up no time at all (the 68000 took one cycle per bit to shift)  

The conditional setting of flags, i.e. ADD and ADDS, which becomes extremely powerful together with the conditional execution of instructions  

The free use of offsets when accessing memory, i.e. LDR R0,[R1,#16] LDR R0,[R1,#16]! LDR R0,[R1],#16 LDR R0,[R1,R2] LDR R0,[R1,R2]! LDR R0,[R1],R2

...

The 68000 could only increase the address register by the size of the data read (i.e. by 1, 2 or 4). Just imagine how much better an ARM processor can be programmed to draw (not only) a vertical line on the screen.  

The (almost) free use of all registers with all instructions (which may well be an advantage of any RISC processor). It simply is great to be able to use

ADD PC,PC,R0,LSL #2 MOV R0,R0 B R0is0 B R0is1 B R0is2 B R0is3

...

or even

ADD PC,PC,R0,LSL #3 MOV R0,R0 MOV R1,#1 B Continue MOV R2,#2 B Comtinue

Page 30: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

MOV R2,#4 B Continue MOV R2,#8 B Continue ...

I used this technique when programming my C64 emulator even more excessively to emulate the 6510. There the shift is 8 which gives 256 bytes for each instruction to emulate. Within those 256 bytes there is not only the code for the emulation of the instruction but also the code to react on interrupts, the fetching of the next instruction and the jump to the emulation code of that instruction, i.e. the code to emulate the CLC (clear C flag) looks like this:

ADD R10,R10,#1 ; increment PC of 6510 to point to next ; instruction BIC R6,R6,#1 ; clear C flag of 6510 status register LDR R0,[R12,#64] ; read 6510 interrupt state CMP R0,#0 ; interrupt occurred? BNE &00018040 ; yes -> jump to interrupt handler LDRB R1,[R4,#1]! ; read next instruction ADD PC,R5,R1,LSL #8 ; jump to emulation code MOV R0,R0 ; lots of these to fill up the 256 bytes

This means that there is only one single jump for each instruction emulated. By this (and a bit more) the emulator is able to reach 76% of the speed of the original C64 with an A3000, 116% with an A4000, 300% with an A5000 and 3441% with my RiscPC (SA at 287 MHz). The code may look hard to handle, but the source of it looks much better:

;-----------; ; $18 - CLC ; ;-----------; ADD R10,R10,#1 ; increment PC of 6510 BIC R6,R6,#%00000001 ; clear C flag of 6510 status register FNNextCommand ; do next command FNFillFree ; fill remaining space

 

My reply to his reply (!)The RISC/CISC debate continues. Looking in a few books, it would seem to come down to whether or not microcode is used - thus RISC or CISC is determined more by the actual physical design of the processor than by what instructions or how many registers it offers. This would support the view that some maintain that the 6502 was an early RISC processor. But I'm not going there...

Page 31: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

My other comment... 3441%. Wow.

 

Return to assembler index

Copyright © 2002 Richard Murray

Page 32: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Processor Types

 

ARM 1 (v1)This was the very first ARM processor. Actually, when it was first manufactured in April 1985, it was the very first commercial RISC processor. Ever.As a testament to the design team, it was "working silicon" in it's first incarnation, it exceeded it's design goals, and it used less than 25,000 transistors.

The ARM 1 was used in a few evaluation systems on the BBC micro (Brazil - BBC interfaced ARM), and a PC machine (Springboard - PC interfaced ARM).It is believed a large proportion of Arthur was developed on the Brazil hardware.In essence, it is very similar to an ARM 2 - the differences being that R8 and R9 are not banked in IRQ mode, there's no multiply instruction, no LDR/STR with register-specified shifts, and no co-processor gubbins.

ARM evaluation system for BBC Master (original picture source not known - downloaded from a website full of BBC-related images

this version created by Rick Murray to include zoomed-up ARM down the bottom...)

 

ARM 2 (v2)

Page 33: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Experience with the ARM 1 suggested improvements that could be made. Such additions as the MUL and MLA instructions allowed for real-time digital signal processing. Back then, it was to aid in generating sounds. Who could have predicted exactly how suitable to DSP the ARM would be, some fifteen years later?In 1985, Acorn hit hard times which led to it being taken over by Olivetti. It took two years from the arrival of the ARM to the launch of a computer based upon it...

...those were the days my friend, we thought they'd never end.When the first ARM-based machines rolled out, Acorn could gladly announce to the world that they offered the fastest RISC processor around. Indeed, the ARM processor kicked ass across the computing league tables, and for a long time was right up there in the 'fastest processors' listings. But Acorn faced numerous challenges. The computer market was in disarray, with some people backing IBM's PC, some the Amiga, and all sorts of little itty-bitty things. Then Acorn go and launch a machine offering Arthur (which was about as nice as the first release of Windows) which had no user base, precious little software, and not much third party support. But they succeeded.

The ARM 2 processor was the first to be used within the RISC OS platform, in the A305, A310, and A4x0 range. It is an 8MHz processor that was used on all of the early machines, including the A3000. The ARM 2 is clocked at 8MHz, which translates to approximately four and a half million instructions per second (0.56 MIPS/MHz).

No current image - can you help?

 

ARM 3 (v2as)Launched in 1989, this processor built on the ARM 2 by offering 4K of cache memory and the SWP instruction. The desktop computers based upon it were launched in 1990.Internally, via the dedicated co-processor interface, CP15 was 'created' to provide processor control and identification.Several speeds of ARM 3 were produced. The A540 runs a 26MHz version, and the A4 laptop runs a 24MHz version. By far the most common is the 25MHz version used in the A5000, though those with the 'alpha variant' have a 33MHz version.At 25MHz, with 12MHz memory (a la A5000), you can expect around 14 MIPS (0.56 MIPS/MHz).It is interesting to notice that the ARM3 doesn't 'perform' faster - both the ARM2 and the ARM3 average 0.56 MIPS/MHz. The speed boost comes from the higher clock speed, and the cache.

Oh, and just to correct a common misunderstanding, the A4 is not a squashed down version of the A5000. The A4 actually came first, and some of the design choices were reflected in the later A5000 design.

Page 34: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

ARM3 with FPU (original picture downloaded from Arcade BBS, archive had no attribution)

 

ARM 250 (v2as)The 'Electron' of ARM processors, this is basically a second level revision of the ARM 3 design which removes the cache, and combines the primary chipset (VIDC, IOC, and MEMC) into the one piece of silicon, making the creation of a cheap'n'cheerful RISC OS computer a simple thing indeed. This was clocked at 12MHz (the same as the main memory), and offers approximately 7 MIPS (0.58 MIPS/MHz). This processor isn't as terrible as it might seem. That the A30x0 range was built with the ARM250 was probably more a cost-cutting exercise than intention. The ARM250 was designed for low power consumption and low cost, both important factors in devices such as portables, PDAs, and organisers - several of which were developed and, sadly, none of which actually made it to a release.

No current image - can you help?

 

ARM 250 mezzanineThis is not actually a processor. It is included here for historical interest. It seems the machines that would use the ARM250 were ready before the processor, so early releases of the machine contained a 'mezzanine' board which held the ARM 2, IOC, MEMC, and VIDC.

 

ARM 4 and ARM 5These processors do not exist.

More and more people began to be interested in the RISC concept, as at the same sort of time common Intel (and clone) processors showed a definite trend towards higher power consumption and greater need for heat dissipation, neither of which are friendly to devices that are supposed to be running off batteries.The ARM design was seen by several important players as being the epitome of sleek,

Page 35: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

powerful RISC design.It was at this time a deal was struck between Acorn, VLSI (long-time manufacturers of the ARM chipset), and Apple. This lead to the death of the Acorn RISC Microprocessor, as Advanced RISC Machines Ltd was born. This new company was committed to design and support specifically for the processor, without the hassle and baggage of RISC OS (the main operating system for the processor and the desktop machines). Both of those would be left to Acorn.

In the change from being a part of Acorn to being ARM Ltd in it's own right, the whole numbering scheme for the processors was altered.

 

ARM 610 (v3)This processor brought with it two important 'firsts'. The first 'first' was full 32 bit addressing, and the second 'first' was the opening for a new generation of ARM based hardware. Acorn responded by making the RiscPC. In the past, critics were none-too-keen on the idea of slot-in cards for things like processors and memory (as used in the A540), and by this time many people were getting extremely annoyed with the inherent memory limitations in the older hardware, the MEMC can only address 4Mb of memory, and you can add more by daisy-chaining MEMCs - an idea that not only sounds hairy, it is hairy!The RiscPC brought back the slot-in processor with a vengeance. Future 'better' processors were promised, and a second slot was provided for alien processors such as the 80486 to be plugged in. As for memory, two SIMM slots were provided, and the memory was expandable to 256Mb. This does not sound much as modern PCs come with half that as standard. However you can get a lot of milage from a RiscPC fitted with a puny 16Mb of RAM.

But, always, we come back to the 32 bit. Because it has been with us and known about ever since the first RiscPC rolled out, but few people noticed, or cared. Now as the new generation of ARM processors drop the 26 bit 'emulation' modes, we RISC OS users are faced with the option of getting ourselves sorted, or dying.Ironically, the other mainstream operating systems for the RiscPC hardware - namely ARMLinux and netbsd/arm32 are already fully 32 bit.

Several speeds were produced; 20MHz, 30Mhz, and the 33MHz part used in the RiscPC.The ARM610 processor features an on-board MMU to handle memory, a 4K cache, and it can even switch itseld from little-endian operation to big-endian operation. The 33MHz version offers around 28MIPS (0.84 MIPS/MHz).

Page 36: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

The RiscPC ARM610 processor card (original picture by Rick Murray, © 2002)

 

ARM 710 (v3)As an enhancement of the ARM610, the ARM 710 offers an increased cache size (8K rather than 4K), clock frequency increased to 40MHz, improved write buffer and larger TLB in the MMU.Additionally, it supports CMOS/TTL inputs, Fastbus, and 3.3V power but these features are not used in the RiscPC.Clocked at 40MHz, it offers about 36MIPS (0.9 MIPS/MHz); which when combined with the additional clock speed, it runs an appreciable amount faster than the ARM 610.

ARM710 side by side with an 80486, the coin is a British 10 pence coin. (original picture by Rick Murray, © 2001)

 

ARM 7500The ARM7500 is a RISC based single-chip computer with memory and I/O control on-chip to minimise external components. The ARM7500 can drive LCD panels/VDUs if required, and it features power management. The video controller can output up to a 120MHz pixel rate, 32bit sound, and there are four A/D convertors on-chip for connection of joysticks etc.The processor core is basically an ARM710 with a smaller (4K) cache.The video core is a VIDC2.

Page 37: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

The IO core is based upon the IOMD.The memory/clock system is very flexible, designed for maximum uses with minimum fuss. Setting up a system based upon the ARM7500 should be fairly simple.

 

ARM 7500FEA version of the ARM 7500 with hardware floating point support.

ARM7500FE, as used in the Bush Internet box.

(original picture by Rick Murray, © 2002)

 

StrongARM / SA110 (v4)The StrongARM took the RiscPC from around 40MHz to 200-300MHz and showed a speed boost that was more than the hardware should have been able to support. Still severely bottlednecked by the memory and I/O, the StrongARM made the RiscPC fly. The processor was the first to feature different instruction and data caches, and this caused quite a lot of self-modifying code to fail including, amusingly, Acorn's own runtime compression system. But on the whole, the incompatibilities were not more painful than an OS upgrade (anybody remember the RISC OS 2 to RISC OS 3 upgrade, and all the programs that used SYS OS_UpdateMEMC, 64, 64 for a speed boost froze the machine solid!).In instruction terms, the StrongARM can offer half-word loads and stores, and signed half-word and byte loads and stores. Also provided are instructions for multiplying two 32 bit values (signed or unsigned) and replying with a 64 bit result. This is documented in the ARM assembler user guide as only working in 32-bit mode, however experimentation will show you that they work in 26-bit mode as well. Later documentation confirms this.The cache has been split into separate instruction and data cache (Harvard architecture), with both of these caches being 16K, and the pipeline is now five stages instead of three.

Page 38: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

In terms of performance... at 100MHz, it offers 114MIPS which doubles to 228MIPS at 200MHz (1.14 MIPS/MHz).

A StrongARM mounted on a LART board.

In order to squeeze the maximum from a RiscPC, the Kinetic includes fast RAM on the processor card itself, as well as a version of RISC OS that installs itself on the card. Apparently it flies due to removing the memory bottleneck, though this does cause 'issues' with DMA expansion cards.

A Kinetic processor card.

 

SA1100 variantThis is a version of the SA110 designed primarily for portable applications. I mention it here as I am reliably informed that the SA1100 is the processor inside the 'faster' Panasonic satellite digibox. It contains the StrongARM core, MMU, cache, PCMCIA, general I/O controller (including two serial ports), and a colour/greyscale LCD

Page 39: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

controller. It runs at 133MHz or 200MHz and it consumes less than half a watt of power.

 

 

ThumbThe Thumb instruction set is a reworking of the ARM set, with a few things omitted. Thumb instructions are 16 bits (instead of the usual 32 bit). This allows for greater code density in places where memory is restricted. The Thumb set can only address the first eight registers, and there are no conditional execution instructions. Also, the Thumb cannot do a number of things required for low-level processor exceptions, so the Thumb instruction set will always come alongside the full ARM instruction set. Exceptions and the like can be handled in ARM code, with Thumb used for the more regular code.

 

 

Other versionsThese versions are afforded less coverage due, mainly, to my not owning nor having access to any of these versions.While my site started as a way to learn to program the ARM under RISC OS, the future is in embedded devices using these new systems, rather than the old 26 bit mode required by RISC OS... ...and so, these processors are something I would like to detail, in time.

M variantsThis is an extension of the version three design (ARM 6 and ARM 7) that provides the extended 64 bit multiply instructions.These instructions became a main part of the instruction set in the ARM version 4 (StrongARM, etc).

 

T variantsThese processors include the Thumb instruction set (and, hence, no 26 bit mode).

 

E variants

Page 40: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

These processors include a number of additional instructions which provide improved performance in typical DSP applications. The 'E' standing for "Enchanced DSP".

 

 

The futureThe future is here. Newer ARM processors exist, but they are 32 bit devices.This means, basically, that RISC OS won't run on them until all of RISC OS is modified to be 32 bit safe. As long as BASIC is patched, a reasonable software base will exist. However all C programs will need to be recompiled. All relocatable modules will need to be altered. And pretty much all assembler code will need to be repaired. In cases where source isn't available (ie, anything written by Computer Concepts), it will be a tedious slog.It is truly one of the situations that could make or break the platform.

I feel, as long as a basic C compiler/linker is made FREELY available, then we should go for it. It need not be a 'good' compiler, as long as it will be a drop-in replacement for Norcroft CC version 4 or 5. Why this? Because RISC OS depends upon enthusiasts to create software, instead of big corporations. And without inexpensive reasonable tools, they might decide it is too much to bother with converting their software, so may decide to leave RISC OS and code for another platform.

I, personally, would happily download a freebie compiler/linker and convert much of my own code. It isn't plain sailing for us - think of all of the library code that needs to be checked. It will be difficult enough to obtain a 32 bit machine to check the code works correctly, never mind all the other pitfalls. Asking us for a grand to support the platform is only going to turn us away in droves. Heck, I'm still using ARM 2 and ARM 3 systems. Some of us smaller coders won't be able to afford such a radical upgrade. And that will be VERY BAD for the platform. Look how many people use the FREE user-created Internet suite in preference to commercial alternatives. Look at all of the support code available on Arcade BBS. Much of that will probably go, yes. But would a platform trying to re-establish itself really want to say goodbye to the rest?I don't claim my code is wonderful, but if only one person besides myself makes good use of it - then it has been worth it.

 

Click here to learn more on 32 bit operation

 

Return to assembler index

Copyright © 2004 Richard Murray

Page 41: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

The Stack

 

The 6502 microprocessor features support for a stack, located at &1xx in memory, and extending for 256 bytes. It also featured instructions which performed instructions more quickly relative to page zero (&0xx).

Both of these are inflexible, and not in keeping with the RISC concept.The ARM processor provides instructions for manipulating the stack (LDM and STM). The actual location where your stack lays it's hat is entirely up to you and the rules of good programming.

For example:

MOV R13, #&8000 STMFD R13!, {R0-R12, R14}would work, but is likely to scribble your registers over something important. So typically you would set R13 to the end of your workspace, and stack backwards from there.

These are conventions used in RISC OS. You can replace R13 with any register except R14 (if you need it) and R15. As R14 and R15 have a defined purpose, the next register down is R13, so that is used as the stack pointer.Likewise, in RISC OS, the stacks are fully descending (FD, or IA) which means the stack grows downwards in memory, and the updated stack pointer points to the next free location.

You can, quite easily, shirk convention and stack using whatever register you like (R0-R13 and R14 if you don't need it) and also you can set up any kind of stack you like, growing up, growing down, pointer to next free or last used... But be aware that when RISC OS provides you with stack information (if you are writing a module, APCS assembler, BASIC assembler, or being a transient utility, for example) it will pass the address in R13 and expect you to be using a fully descending stack. So while you can use whatever type of stack/location that suits you, it is suggested you follow the OS style. It makes life easier.

If you are not sure what a stack is, exactly, then consider it a temporary dumping area. When you start your program, you will want to put R14 somewhere so you know where to branch to in order to exit. Likewise, every time you BL, you will want to put R14 someplace if you plan to call another BL.To make this clearer:

; ...entry, R14 points to exit location

BL one BL two

Page 42: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

MOV PC, R14 ; exit

.one ; R14 points to instruction after 'BL one' ...do stuff... MOV PC, R14 ; return

.two ; R14 points to instruction after 'BL two' ...do stuff... BL three MOV PC, R14 ; return

.three ; R14 points to instruction after 'BL three' B four ; no return

.four ; Not a BL, so R14 unchanged MOV PC, R14 ; returns from .three because R14 not changed.Take a moment to work through that code. It is fairly simple. And fairly obvious is that something needs to be done with R14, otherwise you won't be able to exit. Now, a viable answer is to shift R14 into some other register. So now consider that the "...do stuff..." parts use ALL of the remaining registers.Now what? Well, what we need is a controlled way to dump R14 into memory until we come to need it.That's what a stack is.

That code again:

; ...entry, R14 points to exit location, we assume R13 is set up

STMFD R13!, {R14} BL one BL two LDMFD R13!, {PC} ; exit

.one ; R14 points to instruction after 'BL one' STMFD R13!, {R14} ...do stuff... LDMFD R13!, {PC} ; return

.two ; R14 points to instruction after 'BL two' STMFD R13!, {R14} ...do stuff... BL three LDMFD R13!, {PC} ; return

.three ; R14 points to instruction after 'BL three' B four ; no return

.four ; Not a BL, so R14 unchanged LDMFD R13!, {PC} ; returns from .three because R14 not changed.

Page 43: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

A quick note, you can write: STMFD R13!, {R14} ...do stuff... LDMFD R13!, {R14} MOV PC, R14but the STM/LDM does NOT keep track of which stored values belong in which registers, so you can store R14, and reload it directly into PC thus disposing of the need to do a MOV afterwards.

The caveat is that the registers are saved in ascending order...

STMFD R13!, {R7, R0, R2, R1, R9, R3, R14}will save R0, R1, R2, R3, R7, R9, and R14 (in that order). So code like: STMFD R13!, {R0, R1} LDMFD R13!, {R1, R0}to swap two registers will not work.

 

Please refer to this document for details on STM/LDM and how to use a stack.

Return to assembler index

Copyright © 2004 Richard Murray

Page 44: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Memory Management

 

IntroductionThe RISC OS machines work with two different types of memory - logical and physical.The logical memory is the memory as seen by the OS, and the programmer. Your application begins at &8000 and continues until &xxxxx.The physical memory is the actual memory in the machine.

Under RISC OS, memory is broken into pages. Older machines have a page of 8/16/32K (depending on installed memory), and newer machines have a fixed 4K page. If you were to examine the pages in your application workspace, you would most likely see that the pages were seemingly random, not in order. The pages relate to physical memory, combined to provide you with xxxx bytes of logical memory. The memory controller is constantly shuffling memory around so that each task that comes into operation 'believes' it is loaded at &8000. Write a little application to count how many wimp polls occur every second, you'll begin to appreciate how much is going on in the background.

 

MEMC : Older systemsIn ARM 2, 250, and 3 machines; the memory is controlled by the MEMC (MEMory Controller). This unit can cope with an address space of 64Mb, but in reality can only access 4Mb of physical memory. The 64Mb space is split into three sections: 0Mb - 32Mb : Logical RAM 32Mb - 48Mb : Physical RAM 48Mb - 64Mb : System ROMs and I/OParts of the system ROMs and I/O are mapped over each other, so reading from it gives you code from ROM, and writing to it updates things like the VIDC (video/sound).

It is possible to fit up to 16Mb of memory to an older machine, but you will need a matched MEMC for each 4Mb. People have reported that simply fitting two MEMCs (to give 8Mb) is either hairy or unreliable, or both. In practice, the hardware to do this properly only really existed for the A540 machine, where each 4Mb was a slot-in memory card with an on-board MEMC. Other solutions for, say, the A5000 and the A410, are elaborate bodges. Look at http://www.castle.org.uk/castle/upg25.htm for an example of what is required to fit 8Mb into an A5000!

Page 45: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

The MEMC is capable of restricting access to pages of memory in certain ways, either complete access, no access, no access in USR mode, or read-only access. Older versions of RISC OS only implemented this loosely, so you need to be in SVC mode to access hardware directly but you could quite easily trample over memory used by other applications.

 

MMU : Newer systemsThe newer systems, with ARM6 or later processor, have an MMU built into the processor. This consists of the translation look-aside buffer (TLB), access control logic, and translation table walk logic. The MMU supports memory accesses based upon 1Mb sections or 4K pages. The MMU also provides support for up to 16 'domains', areas of memory with specific access rights.The TLB caches 64 translated entries. If the entry is for a virtual address, the control logic determines if access is permitted. If it is, the MMU outputs the appropriate physical address otherwise is signals the processor to abort.If the TLB misses (it doesn't contain an entry for the virtual address), the walk logic will retrieve the translation information from the (full) translation table in physical memory.If the MMU should be disabled, the virtual address is output directly as the physical address.

It gets a lot more complicated, suffice to say that more access rights are possible and you can specify memory to be bufferable and/or cacheable (or not), and the page size is fixed to 4K. A normal RiscPC offers two banks of RAM, and is capable of addressing up to 256Mb of RAM in fairly standard PC-style SIMMs, plus up to 2Mb of VRAM double-ported with the VIDC, plus hardware/ROM addressing.

On the RiscPC, the maximum address space of an application is 28Mb. This is not a restriction of the MMU but a restriction in the 26-bit processor mode used by RISC OS. A 32-bit processor mode could, in theory, allocate the entire 256K to a single task.All current versions of RISC OS are 26-bit.

 

System limitationsConsider a RiscPC with an ARM610 processor.The cache is 4K.The bus speed is 16MHz (note, only slightly faster than the A5000!), and the hardware does not support burst-mode for memory accesses.Upon a context switch (ie, making an application 'active') you need to remap it's memory to begin at &8000 and flush the cache.I'll leave you to do the maths. :-)  

Page 46: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Memory schemes and

multitasking 

IntroductionThis is a reference, designed to help you understand the various types of memory handling and multitasking that exist.

 

Memory is a resource that needs careful management. It is expensive (£/Mb is much higher for memory than for conventional harddisc storage). A good system will offer flexible facilities trading off speed for functionality.You need memory because it is fast. It is rarely as fast as the processor, these days, but it is faster than harddiscs. Because we need fast. We need big, so we can hold these large programs and large amounts of data that seem to be around. It boggles the mind that a commercial mainframe did accounts and stuff with a mere 4K of memory.

Typically, there will be three or four, possibly five, kinds of storage in the computer.

1. Level 1 cacheThis is inside the processor, usually operating at the core speed of the processor. It is between 4K and 32K usually.  

2. Level 2 cacheIf the difference between the processor speed and system memory is quite large, you will often have a level 2 cache. This is mounted on the motherboard, and typically runs at a speed roughly halfway between the processor speed and the speed of the system memory.It is usually between 64K and 512K. RISC OS machines do not have Level 2 cache.  

3. Level 3 cacheIf your processor is running at some silly speed (such as 1GHz) and your system memory is running at a tenth of that, you might like a chunk (say a Mb or two) of cache between level 2 and system memory, so that you can further improve speed.Each layer of cache is getting slower, until we reach...  

4. System memoryYour DRAM, SRAM, SIMMs, DIMMs, or whatever you have fitted. Speeds

Page 47: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

range from 2MHz in the old home computers, to around 133MHz in a typical PC compatible. Older PCs use 33MHz or 66MHz buses.The ARM2/250 machines have an 8MHz bus, the ARM3 machines (A5000,...) have a 12MHz bus, the RiscPC has a 16MHz bus. In these cases, only the ARM2 is clocked at the same speed as the bus. The ARM3 is clocked at 25 or 30MHz, the ARM610 at 33MHz, the ARM710 at 40MHz and the StrongARM at a variety of speeds up to 280-ish MHz.  

5. HarddiscSlow, huge, cheap.

 

Basic monoprogrammingThis is where all of the memory is just available, and you run one application at a time. The kernel/OS/BIOS (whatever) sits in one place, either in RAM or ROM and it is mapped into the address map.

Consider:

.----------------. .----------------. | OS in ROM | | Device drivers | | | | in ROM | |----------------| |----------------| | | | | | Your | | Your | | application | | application | | | |----------------| |----------------| | | |System workspace| | OS in RAM | '----------------' '----------------'The first example is similar to the layout of the BBC microcomputer. The second is not that different to a basic MS-DOS system, the OS is loaded low in memory, the BIOS is mapped in at the top, and the application sits in the middle.

To be honest, the first example is used a lot under RISC OS as well. It is exactly what a standard application is supposed to believe. The OS uses page zero (&0000 - &7FFF) for internal housekeeping, it (your app) begins at &8000, and the hardware/OS sit way up in the ether at &3800000.Memory management under RISC OS is more complex, but this is how a typical application will see things.

When the memory is organised in this way, only one application can be running. When the user enters a command, if it is an application then that application is copied from disc into memory, then it is executed. When the application is done with, the operating system reappears, waiting for you to give it something else to do.

 

Basic multiprogramming

Page 48: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Here, we are running several applications. While they are not running concurrently (to do so would be impossible, a processor can only do one thing at a time), the amount of time given to an application is tiny, so the system is spending a lot of time faffing around hopping from one application to the next, all giving you the illusion that n applications are all happily running together on your computer.

Memory is typically handled as non-contiguous blocks. On an ARM machine, pages are brought together to fake a chunk of memory beginning at &8000. Anybody who has tried an address translation in their allocated memory will know two things. Firstly, it is near impossible to get an actual physical memory address out of the OS.The following program demonstrates this:

END = &10000 : REM Constrain slot to 32K

DIM willow% 16 SYS "Wimp_SlotSize", -1, -1 TO slot% SYS "OS_ReadMemMapInfo" TO page%

PRINT "Using "+STR$(slot% / page%)+" pages, each page being "+STR$(page%)+" bytes." PRINT "Pages used: ";

more% = slot% / page% FOR loop% = 0 TO (more% - 1) willow%!0 = 0 willow%!4 = &8000 + (loop% * page%) willow%!8 = 0 willow%!12= -1 SYS "OS_FindMemMapEntries", willow% IF loop% > 0 THEN PRINT ", "; PRINT STR$(willow%!0); NEXT PRINT ENDThis outputs something similar to: Using 8 pages, each page being 4096 bytes. Pages used: 2555, 2340, 2683, 2682, 2681, 2680, 2679, 2678

 

RISC OS handles memory by loading everything into memory. These applications are then 'paged in' by remapping the memory pointers in the page tables, consequently, other tasks are mapped out.

Windows/Unix systems load applications into memory, supported by a system called 'virtual memory' which dumps unused pages to disc in order to free system memory for applications that need it. I am not sure how Windows organises its memory, if it does it in a style similar to RISC OS (ie, remap to start from a specific address) or if each application is just told 'you are here'. Virtual memory is useful, as you can fit a 32Mb program into 16Mb of memory if you are careful how you load it, and swap out old parts for new parts as necessary.

Some systems use a lazy-paging form of memory. In this case, only the first page of memory is filled by the application when execution starts. As more of the application

Page 49: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

is executed, the operating system fills in the parts as required.By contrast, under RISC OS an application needs to load. Consider loading, well, practically anything, off of floppy disc. It takes time.

 

Virtual memoryWhen you no longer have actual physical memory, you may have virtual memory. A set of memory locations that don't exist, but the operating system tries real hard to convince you they do. And in the centre of the ring is the MMU (Memory Management Unit, inspired name, no?) keeping control[note: you need an MMU anyway when your memory is broken into remappable pages, this just seemed like a good time to introduce it!]

When the processor is instructed to jump to &8000 to begin executing an application, it passes the address &8000 to the MMU. This translates the address into the correct real address and outputs this on the address lines, say &12FC00. The processor is not aware of this, the application is not aware of this, the computer user is not aware of this.

So we can take this one stage further by mapping onwards into memory that does not exist at all. In this case, the MMU will hiccup and say "Oi! You! No!" and the operating system will be called in a panic (correctly known as a "page fault"). The operating system will be calm and collected and think, "Ah, virtual memory". A little-used page of real memory will be shoved out to disc, then the page that the MMU was trying to find will be reloaded in place of the page we just got rid of. The memory map will be updated accordingly, then control will be handed back to the user application at the exact point the page fault occured. It would, unknowing of all of this palaver, perform that instruction again, only this time the MMU will (happily?) output the correct address to the memory system, and all will continue.

 

Page tables and the MMUThe page table exists to map each page into an address. This allows the operating system to keep track of which memory is pretending to be which. However it is more complex. Some pages cannot be remapped, some pages are doubly mapped, some are not to be touched in user mode code, some aren't to be touched at all. Some are read only. Some just don't exist. All of this must be kept track of.

So the MMU takes an address, looks it up in the page table, and spits out the correct address.

Let's do some maths. We'll assume a 4K page size (a la RISC OS in a RiscPC). A 32bit address space has a million pages. With one million pages, you'll need one million entries. In the ARM MMU, each entry takes 7 words. So we are looking at seven megabytes just to index our memory.

Page 50: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

It gets better. Every single memory reference will be passed through the MMU. So we'll want it to operate in nanoseconds. Faster, if possible.In reality, it is somewhat easier as most typical machines don't have enough memory to fill the entire addressing space, indeed many are unlikely to get close on technical reasons (the RiscPC can have 258Mb maximum RAM, or 514Mb with Kinetic - the extra 2Mb is the VRAM). Even so, the page tables will get large.

So there are three options:

Have a huge array of fast registers in the MMU. Costly. Very. Hold the page tables in main memory. Slow. Very. Compromise. Cache the active pages in the MMU, and store the rest on disc.

An example. A RiscPC, 64Mb of RAM, 2Mb of VRAM, 4Mb of ROM and hardware I/O (double mapped). That's 734000320 bytes, or 17920 pages. It would take 71680 bytes to store each address. But an address on it's own isn't much use. Seven words comprise an entry in the ARM's MMU. So our 17920 pages would require 501760 bytes in order to fully index the memory.You just can't store that lot in the MMU. So you'll store a snippet, say 16K worth?, and keep the rest in RAM.

 

The TLBThe Translation Lookaside Buffer is a way to make paging even more responsive. Typically, a program will make heavy use of a few pages and barely touch the rest. Even if you plan to byte read the entire memory map, you will be making four thousand hits in one page before going to the next.A solution to this is to fit a little bit in the MMU that can map virtual addresses to their physical counterparts without traversing the page table. This is the TLB. It lives within the MMU and contains details of a small number of pages (usually between four and sixty four - the ARM610 MMU TLB has thirty two entries).Now, when we have a page lookup, we first pass our virtual address to the TLB which will check all of the addresses stored, and the protection level. If a match is found, the TLB will spit out the physical address and the page table isn't touched.If a miss is encountered, then the TLB will evict one of it's entries and load in the page information looked up in the page table, so the TLB will know the new page requested, so it can quickly satisfy the result for the next memory access, as chances are the next access will be in the page just requested.

So far we have figured on the hardware doing all of this, as in the ARM processor. Some RISC processors (such as the Alpha and the MIPS) will pass the TLB miss problem to the operating system. This may allow the OS to use some intelligence to pre-load certain pages into the TLB.

 

Page size

Page 51: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Users of an RISC OS 3.5 system running on an ARM610 with two or more large (say, 20Mb) applications running will know the value of a 4K page. Because it's bloody slow. To be fair, this isn't the fault of the hardware, but more the WIMP doing stuff the kernel should do (as happens in RISC OS 3.7) and doing it slower!

Like with harddisc LFAUs, what you need is a sensible trade-off between page granularity and page size. You could reduce the wastage in memory by making pages small, say 256 bytes. But then you would need a lot of memory to store the page table. A bigger page table, slower to scan through it. Or you could have 64K pages, which make the page table small, but can waste huge amounts of memory.To consider, a 32K program would require eight 4K pages, or sixty four 512 byte pages. If your system remaps memory when shuffling pages around, it is quicker to move a smaller number of large pages than a larger number of small pages.

The MEMC in older RISC OS machines had a fixed page table. So the size of page depended upon how much memory was utilised.

MEMORY PAGE SIZE0.5Mb 8K1Mb 8K2Mb 16K4Mb 32K

3Mb wasn't a valid option, and 4Mb is the limit. You can increase this by fitting a slave MEMC, in which case you are looking at 2 lots of 4Mb (invisible to the OS/user).In a RiscPC, the MMU accesses a number of 4K pages. The limits are due, I suspect, to the system bus or memory system, not the MMU itself.

Most commercial systems use page sizes in the order 512 bytes to 64K.The later ARM processors (ARM6 onwards) and the Intel Pentium both use page sizes of 4K.

 

Page replacement algorithmsWhen a page fault occurs, the operating system has to pick a page to dump, to allow the required page to be loaded. There are several ways that this may be achieved. None of these are perfect, they are a compromise of efficiency.

Not Recently UsedThis requires two bits to be reserved in the page table, a bit for read/write and a bit for page reference. Upon each access, the paging hardware (and it must be done in hardware for speed) will set the bits as necessary. Then on a fixed interval the operating system will clear these bits - either when idling or upon clock interrupt? This then allows you to track the recent page accesses, so when flushing out a page you can spot those that have not recently been read/written or referenced. NRU would

Page 52: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

remove a page at random. While it is not the best way of sorting out which pages to remove, it is simple and gives reasonably good results.

First-In First-OutIt is hoped you are familiar with the concept of FIFO, from buffering and the like. If you are not, consider the lame analogy of the hose pipe in which the first water in will be the first water to come out the other end. It is rarely used, I'll leave the whys and where-fores as an exercise for the bemused reader. :-)

Second ChanceA simple modification to the FIFO arrangement is to look at the access bit, and if it is zero then we know the page is not in current use and can be thrown. If the bit is set, then the page is shifted to the end of the page list as if it was a new access, and the page search continues. What we are doing here is looking for a page unused since the last period (clock tick?). If by some miracle ALL the pages are current and active, then Second Change will revert to FIFO.

Clock Although Second Chance is good, all that page shuffling is inefficient so the pages are instead referenced in a circular list (ie, clock). If the page being examined in in use, we move on and look at the next page. With no concept of the start and end of the list, we just keep going until we come to a usable page.

Least Recently UsedLRU is possible, but it isn't cheap. You maintain a list of all the pages, sorted by the most recently used at the front of the list, to the least recently used at the back. When you need a page, you pull the last entry and use it. Because of speed, this is only really possible in hardware as the list should be updated each memory access.

Not Frequently UsedIn an attempt to simulate LRU in software, we can maintain something vaguely similar to LRU in a software implementation, in which the OS scans the available pages on each clock tick and increments a counter (held in memory, one for each page) depending on the read/written bit.Unfortunately, it doesn't forget. So code heavily used then no longer necessary (such as a rendering core) will have a high count for quite a while. Then, code that is not called often but should be all the more responsive, such as redraw code, will have a lower count and thus stand the possibility of being kicked out, even though the higher-rated renderer is no longer needed but not kicked out as it's count is higher.But this can be fixed, and the fix emulates LRU quite well. It is called aging. Just before the count is incremented, it is shifted one bit to the right. So after a number of shifts the count will be zero unless the bit is added. Here you might be wondering how adding a bit can work, if you've just shifted a bit off. The answer is simple. The added bit is added to the leftmost position, ie most significant.The make this clearer...

Once upon a time: 0 0 1 0 1 1 Clock tick : 0 0 0 1 0 1 Clock tick : 0 0 0 0 1 0 Memory accessed : 1 0 0 0 0 1 Clock tick : 0 1 0 0 0 0

Page 53: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Memory accessed : 1 0 1 0 0 0

 

MultitaskingThere is no such thing as true multitasking (despite what they may claim in the advocacy newsgroups). To multitask properly, you need a processor per process, with all the relevant bits so processes are not kept waiting. Effectively, a separate computer for each task.

However, it is possible to provide the illusion of running several things at once. In the old days, things happened in the background under interrupt control. Keyboards were scanned, clocks were updated. As computers became more powerful, more stuff happened in the background. Hugo Fiennes wrote a soundtracker player that runs on interrupts, so works in the background. You set it going, it carries on independent of your code.

So people began to think of the ability to apply this to applications. After all, most of the time an application is spent waiting for user input. In fact, the application may easily do sweet sod all for almost 100% of the time - measured by an event counter in Emily's polling loop, I type ~1 character a second, the RiscPC polls a few hundred times a second. That was measured in a multitasking application, using polling speed as a yardstick. Imagine if we were to record loops in a single-tasking program. So the idea was arrived at. We can load several programs into memory, provide them some standard facilities and messaging systems, and then let them run for a predefined duration. When the duration is up, we pass control to the next program. When that has used its time, we go to the next program, and so on.As a brief aside, I wish to point out Schrödinger's cat. A rather cute little moggy, but an extremely important one. It is physically impossible to measure system polling speed in software, and pretty difficult to measure it in hardware. You see, the very act of performing your measurement will affect the results. And you cannot easily 'account' for the time taken to make your measurements because measuring yourself is subject to the same artefacts as when measuring other things. You can only say 'to hell with it', and have your program report your polling rate as being 379 polls/sec, knowing that your measuring code may be eating around 20% of the available time, and use the figures in a relative form rather than trying to state "My computer achieves 379 polls every second". While there is no untruth in that, your computer might do 450 if you weren't so busy watching! You simply can't be JAFO....and you need to go to school/college and get bored rigid to find out what relevance any of this has to your cat. Mine is sitting on my monitor, asleep, blissfully unaware of all these heavy scientific concepts. She's probably got the right idea...

 

Co-operative multitaskingOne such way of multitasking is relatively clean and simple. The application, once control has passed to it, has full control for as long as it needs. When it has finished,

Page 54: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

control is explicitly passed back to the operating system.This is the multitasking scheme used in RISC OS.

 

Pre-emptive multitaskingSeen as the cure to all the world's ills by many advocates who have seen Linux (not Windows!), this works differently. Your application is given a timeslice. You can process whatever you want in your timeslice. When your timeslice is up, control is wrested away and given to another process. You have no say in the matter, peon.

 

 

I don't wish to get into an advocacy war here. My personal preference is co-operative, however I don't feel that either is the answer. Rather, a hybrid using both technologies could make for a clean system. The major drawback of CMT is that if an application dies and goes into a never-ending loop, control won't come back. The application needs to be forceably killed off.Niall Douglas wrote a pre-emption system for RISC OS applications. Surprisingly, you didn't really notice anything much until an application entered some heavy processing (say, ChangeFSI) at which point life carried right on as normal while the task which would have stalled the machine for a while chugged away in the background.

Return to assembler index

Copyright © 2004 Richard Murray

Page 55: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

32 bit operation

 

A lot of this information is taken from the ARM assembler manual. I didn't have a 32 bit processor at the time, so trusted the documentation...As it happens, the documentation erroneously stated that UMUL and UMLA could only be performed in 32bit mode. Well, that is incorrect, if your processor can do it (ie: StrongARM), it will work in 32bit OR 26bit...

 

The ARM2 and ARM3 have a 32 bit data bus and a 26 bit address bus. On later versions of the ARM, both the data bus and the address bus are a full 32 bits wide.This explains how a "32 bit processor" can be referred to as 26 bit. The data width and instruction/word size is 32 bit, and always has been, but the address bus is only 24 bit.Oh, whoops, I said 26 bit, didn't I?:-) Well, as PC is always word aligned, the lower two bits will always be zero in an address, so on the ARM2/ARM3 processor these bits hold the processor mode setting. The width of PC is, effectively, 26 bit even though only 24 bits are actually used.

This is no a problem on the older machines. 4Mb memory was the norm. Some people upgraded to 8Mb, and 16Mb was the theoretical limit.However a RiscPC with a 26 bit program counter would not have been possible, as 26 bits only allows you to address %11111111111111111111111100 (or 67108860 bytes, or 64Mb). The RiscPC allows for 258Mb of memory to be installed.This, incidentally, explains the 28Mb size limit for application tasks; the system is expected to be compatible with the older RISC OS API.

The majority of the assembler site has been written regarding 26 bit mode of operation, which is compatible with the versions of RISC OS currently available (ie, RISC OS 2 to RISC OS 4); though some parts cover 32 bit modes (one example briefly runs in SVC32!), and I have noted parts of the examples that are 32 bit unfriendly.

Those with a RiscPC, Mico, RiscStation, A7000 etc have the ability to run a fully 32 bit operating system; indeed ARMLinux is such an operating system. RISC OS is not, because RISC OS needs, for the moment, to remain compatible with existing versions. It is the old dichotomy. It is wonderful to have a nice shiny new fully 32 bit version of RISC OS, but not so good when you realise a lot of your must-have software won't so much as load!RISC OS isn't totally 26 bit. Some of the handlers need to work in 32 bit mode; however it is limited by money (ie, who's going to pay for RISC OS to be fully converted; and who's going to pay for new development tools to rebuild their code (PD software is strong on RISC OS)) and also by necessity (ie, lots of people use Impression but CC is no longer with us; it is quite likely Impression won't work on an

Page 56: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

updated RISC OS, so people will not see a necessity to upgrade if their desired software won't work).

 

Why is this even an issue?Newer ARM processors will not support 26 bit operation. Several hybrids were made (ARM6, ARM7, StrongARM), but time has come to draw the line. You can either add the complexity of a 26/32 bit system, or you can go 32 bit only and have a simpler, smaller processor.Either we go with the flow, or get left behind... So really, this is an issue, and we don't have a choice.

 

32 bit architectureThe ARM architecture changed significantly with the introduction of the ARM6 series. Below, we shall describe the differences in behaviour between 26 bit and 32 bit operation.

In the ARM 6, the program counter was extended to a full 32 bits. As a result:

The PSR had to be separated from the PC into its own register, the CPSR (Current Program Status Register).  

The PSR can no longer be saved with the PC when changing processor modes;instead, each privileged mode now has an extra register - the SPSR (Saved Program Status Register) - to hold the previous mode's PSR.  

Instructions have been added to use these new status registers.

A further change was the addition of extra privileged processor modes, allowed by the PSR now having a full 32 bits to use. These modes are used to handle Undefined instruction and Abort exceptions. Consequently:

Undefined instructions, aborts, and supervisor code no longer have to share the same mode. This has removed restrictions on Supervisor mode programs which existed on earlier ARMs.

The availability of these features in the ARM6 series (and other later compatible chips) is set by one of several on-chip control registers. One of three processor configurations can be selected:

o 26 bit program and data space. This configuration forces ARM to operate with a 26 bit address space. In this configuration only the four 26 bit modes are available (refer to the Processor modes description); it is impossible to select a 32 bit mode.This configuration is set at reset on all current ARM6 and 7 series processors.  

Page 57: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

o 26 bit program space and 32 bit data space. This is the same as the 26 bit program and data space configuration, except that address exceptions are disabled to allow data transfer operations to access the full 32 bit address space.  

o 32 bit program and data space. This configuration extends the address space to 32 bits, and introduces major changes to the programmer's model. In this configuration you can select any of the 26 bit and the 32 bit processor modes (see Processor modes below).

 

When configured for a 32 bit program and data space, the ARM6 and ARM7 series support ten overlapping processor modes of operation:

User mode: the normal program execution stateor User26 mode: a 26 bit version  

FIQ mode: designed to support a data transfer or channel processor FIQ26 mode: a 26 bit version  

IRQ mode: used for general purpose interrupt handlingor IRQ26 mode: a 26 bit version  

SVC mode: a protected mode for the operating systemor SVC26 mode: a 26 bit version  

Abort mode (abbreviated to ABT mode): entered after a data or instruction prefetch abort  

Undefined mode (abbreviated to UND mode): entered when an undefined instruction is executed.

When in a 26 bit processor mode, the programmer's model reverts to that of earlier 26 bit ARM processors. The behaviour is the same as that of the ARM2aS macrocell with the following alterations:

Address exceptions are only generated by ARM when it is configured for 26 bit program and data space.In other configurations the OS may still simulate the behaviour of address exception, using external logic such as a memory management unit to generate an abort if the 64Mbyte range is exceeded, and converting that abort into an `address exception trap' for the application.  

The new instructions to transfer data between general registers and the program status registers remain operative. The new instructions can be used by the operating system to return to a 32 bit mode after calling a binary containing code written for a 26 bit ARM.  

When in a 32 bit program and data space configuration, all exceptions (including Undefined Instruction and Software Interrupt) return the processor

Page 58: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

to a 32 bit mode, so the operating system must be modified to handle them.  

If the processor attempts to write to a location between &0 and &1F inclusive (i.e. the exception vectors), hardware prevents the write operation and generates a data abort. This allows the operating system to intercept all changes to the exception vectors and redirect the vector to some veneer code. The veneer code should place the processor in a 26 bit mode before calling the 26 bit exception handler.

In all other respects, when operating in a 26 bit mode the ARM behaves as like a 26 bit ARM. The relevant bits of the CPSR appear to be incorporated back into R15 to form the PC/PSR with the I and F bits in bits 27 and 26. The instruction set behaves like that of the ARM2aS macrocell, with the addition of the MRS and MSR instructions.

 

The registers available on the ARM 6 (and later) in 32 bit mode are:

User26 SVC26 IRQ26 FIQ26 User SVC IRQ ABT UND FIQ

R0 ----- R0 ----- R0 ----- R0 -- -- R0 ----- R0 ----- R0 ----- R0 ----- R0 ----- R1R1 ----- R1 ----- R1 ----- R1 -- -- R1 ----- R1 ----- R1 ----- R1 ----- R1 ----- R2R2 ----- R2 ----- R2 ----- R2 -- -- R2 ----- R2 ----- R2 ----- R2 ----- R2 ----- R2R3 ----- R3 ----- R3 ----- R3 -- -- R3 ----- R3 ----- R3 ----- R3 ----- R3 ----- R3R4 ----- R4 ----- R4 ----- R4 -- -- R4 ----- R4 ----- R4 ----- R4 ----- R4 ----- R4R5 ----- R5 ----- R5 ----- R5 -- -- R5 ----- R5 ----- R5 ----- R5 ----- R5 ----- R5R6 ----- R6 ----- R6 ----- R6 -- -- R6 ----- R6 ----- R6 ----- R6 ----- R6 ----- R6R7 ----- R7 ----- R7 ----- R7 -- -- R7 ----- R7 ----- R7 ----- R7 ----- R7 ----- R7R8 ----- R8 ----- R8 R8_fiq R8 ----- R8 ----- R8 ----- R8 ----- R8 R8_fiqR9 ----- R9 ----- R9 R9_fiq R9 ----- R9 ----- R9 ----- R9 ----- R9 R9_fiqR10 ---- R10 ---- R10 R10_fiq R10 ---- R10 ---- R10 ---- R10 ---- R10 R10_fiqR11 ---- R11 ---- R11 R11_fiq R11 ---- R11 ---- R11 ---- R11 ---- R11 R11_fiqR12 ---- R12 ---- R12 R12_fiq R12 ---- R12 ---- R12 ---- R12 ---- R12 R12_fiqR13 R13_svc R13_irq R13_fiq R13 R13_svc R13_irq R13_abt R13_und R13_fiqR14 R14_svc R14_irq R14_fiq R14 R14_svc R14_irq R14_abt R14_und R14_fiq--------- R15 (PC / PSR) --------- --------------------- R15 (PC) --------------------- ----------------------- CPSR -----------------------

Page 59: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

SPSR_svc SPSR_irq SPSR_abt SPSR_und SPSR_fiqIn short, the 32 bit differences are:

The PC is a full 32 bits wide, and used singularly as a Program Counter.  

The PSR is contained within its own register, the CPSR.  

Each privileged mode has a private SPSR register in which to save the CPSR.  

There are two new privileged modes, each of which has private copies of R13 and R14.

 

The CPSR and SPSR registersThe allocation of the bits within the CPSR (and the SPSR registers to which it is saved) is: 31 30 29 28 --- 7 6 - 4 3 2 1 0 N Z C V I F M4 M3 M2 M1 M0

0 0 0 0 0 User26 mode 0 0 0 0 1 FIQ26 mode 0 0 0 1 0 IRQ26 mode 0 0 0 1 1 SVC26 mode 1 0 0 0 0 User mode 1 0 0 0 1 FIQ mode 1 0 0 1 0 IRQ mode 1 0 0 1 1 SVC mode 1 0 1 1 1 ABT mode 1 1 0 1 1 UND modePlease refer to the (26 bit) PSR for information on the N, Z, C, V flags and the I and F interrupt flags.

 

So what does it mean in practice?Most ARM code will work correctly. The only things that will not work are any operations which fiddle with R15 to set the processor status. Unfortunately, this isn't as easy to fix as it seems.I examined a 9K program (a MODE 7 teletext frame viewer, written in C) for potential problems, basically looking for:

A MOVS with R15 as the destination. Any LDMFD suffixed with the '^' character and loading R15.

About 64 instructions fell into one of these categories.

There is likely to be few ways to make the conversion process automatic. Basically...

Page 60: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

How will the system know what is data, and what is code.Actually, a clever rules-based program should be able to make a fairly good guess, but is a "fairly good guess" good enough?

There is NO simple instruction replacement. An automatic system probably could patch in the required instructions and jiggle the code around, but this could cause unexpected side effects, like an ADR directive no longer being in range.

It is incredibly hacky. Surely, much better to recompile, or to repair the source code.

 

It is NOT easy. Such a small change, but with such far-reaching consequences.

 

In comp.sys.acorn.programmer, Stewart Brodie answered my query with a hint that may be useful to people intending to work with 32 bit code:

> How is it possible, if 32 bit code uses MSR/MRS to transfer status and> register, and older ARMs don't have those instructions?> Are we into "black magic" code for this?

You take advantage of the fact that the encodings for MSR and MRS act as NOPson ARM2 and ARM3 ;-) With some careful arrangement, you can write fairlytight code.

To refer back to earlier postings, an example of when MOVS pc, lr in a32-bit mode is useful (entered in SVC or IRQ mode, IRQs disabled):

ADR r14, CallBackRegs TEQ PC,PC LDREQ r0, [r14, #16*4] ; The CPSR MSREQ SPSR_cxsf, r0 ; put into SPSR_svc/SPSR_irq ready for MOVS LDMIA r14, {r0-r14}^ ; Restore user registers NOP LDR r14, [r14, #15*4] ; The pc MOVS pc, r14 ; Back we go (32-bit safe - SPSR set up)

(CallBackRegs contains user mode registers: R0-R15, plus the CPSR if in a32-bit mode)

 

 

Download a 32 bit code scanner (12K)

Page 61: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

 

 

Where is the example?In the logical place, in the document describing the processor status register...

 

What about old stuff for which we don't have sources?There are two options...

The first option is a one-time conversion. We can use an intelligent disassembler (such as D.Ruck's !ARMalyser to provide us with a source of the software, with the 32bit unsafe parts identified. I used this method to cobble together a 32bit version of one of my modules.For fairly short things, this will be okay. For large projects... I shudder to think! One thing to be especially aware of is that some older software uses tricks like popping flags into 'unused' bits of addresses. A good example here is software that uses bits 0-27 as an address and bits 28-31 as flags...

1 << 28 = 268435456What this means, in essence, is that the software will work fine on all older machines - including the majority of RiscPCs for which 256Mb was the limit of installable memory.If, though, we run this on a 512Mb Iyonix (which is no longer out of the realms of possibility), as soon as it is loaded to an address over 256Mb ... bit 28 will be set!The code will need to be examined to ensure such things don't occur, and if they do, it'll need to be worked around.As far as I'm aware, which APCS-R requires flags to be saved, I've yet to see my C compiler generate code that depends upon the saving of flags across function calls. The typical example is:

Note that the N, Z, C and V flags from lr at the instant of entry must be reinstated; it is not sufficient merely to preserve the PSR across the call. Consider, a function ProcA which tail continues to ProcB as follows: CMPS a1, #0 MOVLT a2, #255 MOVGE a2, #0 B ProcBIf ProcB merely preserves the flags it sees on entry, rather than restoring those from lr, the wrong flags may be set when ProcB returns direct to ProcA's caller.

While it has not been my experience that the C compiler generates such code, humans can. And much worse. This, too, must be taken into account. And all those ORRing values into R14 to directly twiddle the processor flags (on return)...

Page 62: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

The other method is to make a new computer. All we need to to load up a few old modules, poke our application at 'troublesome' points, force everything to be in an area of memory that we may consider is 'safe'. Then we let our program loose with the same sort of critical care that you'd attend to a hungry cat in a room full of budgies... This, more or less, is what Aemulor does.But, at a cost.

From the "Inside Aemulor" article on the Foundation RISC User (issue 11; January 2003) CD-ROM, we encounter a very important point:

From RISC OS's perspective, the Aemulor RMA is a normal dynamic area, but Aemulor remaps the memory at an address below 64Mb so that it becomes addressable within the 26-bit environment. Because this emulated RMA is visible to all applications, native 32-bit applications are also restricted to a maximum size of 28Mb each (as per RISC OS 4) whilst Aemulor is running. It is hoped that this limitation can be removed with a later version.

Or, as they say: There's no such thing as a free lunch.

Having said that, the use of Aemulor is essential for all those must-have programs that either cannot sensibly be modernised, or are unlikely to be modernised.I have heard that somebody is 32bitting Impression Publisher. Well, you know, I heard once that somebody was porting Mozilla to RISC OS. Who knows, maybe I'm wrong... :-)

 

 

What API changes have there been?The "Technical information on 26/320bit RISC OS binary interfaces" (v0.2) states:

Many existing APIs do not actually require flag preservation, such as service call entries. In this case, simply changing MOVS PC... to MOV PC... and LDM {}^ to LDM {} is sufficient to achieve 32-bit compatibility.

This is possibly worse than useless as it doesn't specify exactly which APIs need it and which don't. Is it safe to assume that everything not otherwise described is safe?

The best thing to do is get hold of that document and browse through it. Please do not simply 'assume' that things will work if you simply don't save flags.Generally, this is the case, but unless you have a RISC OS 3.10 machine to test it on...

 

 

Return to assembler index

Page 63: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for

Copyright © 2004 Richard Murray

Page 64: robo.fe.uni-lj.sirobo.fe.uni-lj.si/~marko/ur/literature from www/embedded... · Web viewCMP AX, 0 ; did it return zero? JE failed ; if so, it failed, jump to fail code ... 64 for