owen/courses/2253-2017/slides/armslides-2up.pdf · " % :gd t f " % g5 u f " % ! g v f . % )kg5+ ) g5+& 1 ... f - $ o s (( $ & & ) + h # -

Introduction

● Why and what for 2253● Levels of abstraction● Instruction Set Architectures● Major parts of any computer● von Neumann architecture● Flow of control

CS2253● Goal: write a simple C program and understand how the computer actually executes it.

● This year, we study the ARM7TDMI processor.● Last year, we used the fictional LC-3.● The Church-Turing thesis essentially states that all systems with a certain minimal computational capability are able to compute the same things as each other. So LC3 vs ARM vs Intel x86 does not matter, at least theoretically.

● And in actuality, LC3 and ARM etc. are fundamentally similar.

● Easy to pick up a second machine....

Levels of Abstraction

● From atoms, we build transistors. [Lowest level]● From transistors, we built gates (ECE courses).● From gates, we build microarchitectures (CS3813) that

implement machine instructions.● From machine instructions, we can make simple programs

directly (first part of this course).● Or a compiler can string together machine instructions

corresponding to our C code (second part).● From simple pieces of C code, we can build OSes,

databases, other complex software systems. [Highest level]

Textbook Fig 1.9

Why Should I Care About the Lower Levels?

● If you want to work in the computer hardware field, it's obvious.

● If you want to work in software:– it's sad if you aren't intellectually a little bit curious about

how things really happen.

– when things fail, you tend to need to “see behind the veil” to understand what is going wrong. Debugging often requires a lower-level view.

– it helps you understand why some operations would be fast, while others would be slow. (Performance debugging.)

Microarchitecture example (block diagram, book Figure 1.4)

Instruction Set Architecture

● ISA is an important concept. In Java terms, it is like an Interface to the microarchitecture (hardware).

● It specifies all the things that you would need to know, to write a machine-instruction program:– what are the basic instructions?

– how does the machine find the data for instructions?

– what are the basic data types supported (bit lengths)?

– how is memory organized?

– how and where are the instructions stored?

Multiple implementation of an ISA

● In Java, several classes can implement the same Interface.● So, several microarchitectures can have the same ISA. They

can run the same machine instructions as each other.● IBM mainframes in the 1960s: fast implementation if you're

rich, slow ones if not.● Today: AMD and Intel processors can run the same code.● But computer designers like to extend ISAs over time.

Backward compatibility is the goal. No bad idea ever dropped. Ugly messes like Intel x64 architecture.

Textbook Fig 1.5

Major Parts of Any Computer

● An input and an output facility (from/to people or devices)● Memory/Memories to store data and programs● A “datapath” with the logic needed to add, multiply, store ...

binary values, etc.● A “controller” that goes through the program and makes the

datapath do the required operations, one at a time.

● Controller + Datapath = Processor (aka CPU)● Controller is the “puppeteer” and the datapath is the “puppet”.

Textbook Figure 1.2: Block diagram of a SOC (system on chip)

John von Neumann's architecture

● John von Neumann was a brilliant mathematician (and statistician and nuclear weapons guy and father of game theory and inventor of MergeSort and ...) who wrote an influential report in 1946 with a computer design.

● A von Neumann architecture stores the program in (a different part of) the same memory that stores the data.

● A Harvard architecture uses separate memories.● Modern computers appear to be von Neumann, but behind

the scenes are a bit Harvard-ish.● Except for some special-purpose machines.

Textbook Figure 1.8

von Neumann and the IAS, 1952

How a von Neumann Architecture Works

● The program is a list of instructions located in memory. Part of the control unit is the Program Counter (PC), which points to the exact place for the current instruction.

● while (true) { Fetch current instruction from memory Increment PC Inspect current instruction and do what it says}

● This is the Fetch-Execute cycle.

Be a Human CPU

Memory Address

1000 right hand in

1001 right hand out

1002 right hand in

1003 shake it all about

1004 turn yourself around

1005 go back to address 1000

● “Hokey Pokey” program

● PC starts at 1000● Fetch and do “right

hand in”; PC ← 1001● Fetch and do “right

hand out”;PC ← 1002● ….

Control Flow● Normally, when you finish one instruction you advance to the (sequentially) next one.

● This is called straight line flow of control. PC just increments.

● But there are instructions like “go back to the instruction at address 1000” that disrupt this.

● Good thing, since that's how we can do IF and WHILE in a high-level language.

● Control flow is often disrupted conditionally, based on status flags that record what happened earlier (eg, did last addition give -ve result?)

More Realistic Instructions

● Instead of “right hand in”, a CPU instruction will be something simple like a request to take two integers stored locally in the CPU, add them, and store the result in the CPU

● Or “go to memory location 2000 and load the 8-bit integer there into the CPU”

● Or “go back to memory address 1000”, as in the Hokey Pokey program. A jump or branch instr.

● A compiler or a programmer needs a lot of simple instructions to do something more complicated.

Data

● (Some repeating CS1083, ECE course)

● bits and bit sequences● integers (signed and unsigned)● bit vectors● strings and characters● floating point numbers● hexadecimal and octal notations

Bits and Bit Sequences

● Fundamentally, we have the binary digit, 0 or 1.● More interesting forms of data can be encoded into a bit

sequence.● 00100 = “drop the secret package by the park entrance”

00111 = “Keel Meester Bond”● A given bit sequence has no meaning unless you

know how it has been encoded.● Common things to encode: integers, doubles, chars.

And machine instructions.

Encoding things in bit sequences(From textbook)

● Floats

● Machine Instructions

How Many Bit Patterns?

● With k bits, you can have 2k different patterns● 00..00, 00..01, 00..10, … , 11..10, 11..11● Remember this! It explains much...● E.g., if you represent numbers with 8 bits, you can represent only 256 different numbers.

Names for Groups of Bits

● nibble or nybble: 4 bits● octet: 8 bits, always. Seems pedantic.● byte: 8 bits except with some legacy systems. In

this course, byte == octet.● after that, it gets fuzzy (platform dependent).

For 32-bit ARM,– halfword: 16 bits

– word: 32 bits

Unsigned Binary (review)

● We can encode non-negative integers in unsigned binary. (base 2)

● 10110 = 1*24 + 0*23 + 1*22 + 1*21 +1*20 represents the mathematical concept of “twenty-two”. In decimal, this same concept is written as 22 = 2*101 + 2*100.

● Converting binary to decimal is just a matter of adding up powers of 2, and writing the result in decimal.

● Going from decimal to binary is trickier.

Division Method (decimal → binary)

● Repeatedly divide by 2. Record remainders as you do this.

● Stop when you hit zero.● Write down the remainders (left to right),

starting with the most recent remainder.

Subtract-powers method

● Find the largest power of 2, say 2p, that is not larger than N (your number). The binary number has a 1 in the 2p's position.

● Then similarly encode N-2p.● Eg, 22 has a 1 in the 16's position

22-16=6, which has a 1 in the 4's position

6-4 = 2, which has a 1 in the 2's position

2-2=0, so we can stop....

Adding in Unsigned Binary

● Just like grade school, except your addition table is really easy:

● No carry in: 0+0=0 (no carry out) 0+1= 1+0 = 1 (no carry out) 1+1= 0 (with carry out)

● Have carry in: 0+0=1 (no carry out) 0+1 = 1+0 = 0 (with carry out) 1+1 = 1 (with carry out)

Fixed-Width Binary Integers

● Inside the computer, we work with fixed-width values.● Eg, an instruction might add together two 16-bit unsigned binary

values and compute a 16-bit result.● Hope your result doesn't exceed 65535 = 216-1. Otherwise, you

have overflow. Can be detected by a carry from the leftmost stage.

● If a result would really doesn't need all 16 bits, a process called zero-extension just prepends the required number of zeros.

● 10111 becomes 0000000000010111.● Mathematically, these bit strings both represent the same

number.

Signed Numbers

● A signed number can be positive, negative or zero.

● An unsigned number can be positive or zero.● Note: “signed number” does NOT necessarily

mean “negative number”.

In Java, ints are signed numbers. Can they be positive?? Can they be negative??

Some Ways to Encode Signed Numbers

● All assume fixed width; examples below for 4 bits● Sign/magnitude: first bit = 1 iff number -ve

Remaining bits are the magnitude, unsigned binary

Ex: 1010 means -2 ● Biased: Store X+bias in unsigned binary

Ex: 0110 means -2 if the bias is 8. (8+(-2) = 6)● Two's complement: Sign bit's weight is the

negative of what it'd be for an unsigned numberEx: 1110 means -2: -8+4+2 = -2

● You can generally assume 2's complement...

Why 2's Complement?

● There is only one representation of 0. (Other representations have -0 and +0.)

● To add 2's complement numbers, you use exactly the same steps as unsigned binary.

● There is still a “sign bit” - easy to spot negative numbers

● You get one more number (but it's -ve)Range of N bits 2's complement: -2N-1 to +2N-1-1

2's Complement Tricks

● +ve numbers are exactly as in unsigned binary● Given a 2's complement number X (where X may be

-ve, +ve or zero), compute -X using the twos complementation algorithm (“flip and increment”)

● Flip all bits (0s become 1s, 1s become zeros)● Add 1, using the unsigned binary algorithm● Ex: 00101 = +5 In 5 bit 2's complement

11010 + 1 → 11011 is -5 in 2's complement● And Flip(-5)=00100. 00100+1 back to +5

Converting a 2's complement number X to decimal

● Determine whether X is -ve (inspect sign bit)● If so, use the flip-and-increment to compute -X

Pretend you have unsigned binary.

Slap a negative sign in front.● If number is +ve, just treat it like unsigned

binary.

Sign extension

● Recall zero-extension is to slap extra leading zeros onto a number.

● Eg: 5 bit 2's compl. to 7 bit: 10101 → 0010101Oops: -11 turns into +21. Zero extension didn't preserve numeric value.

● The sign-extension operation is to slap extra copies of the leading bit onto a number

● +ve numbers are just zero extended● But for -11: 10101 → 1110101 (stays -11)

Overflow for 2's complement

● Although addition algorithm is same for fixed-width unsigned, the conditions under which overflow occurs are different.

● If A and B are both same sign (eg, both +ve), then if A+B is the opposite sign, something bad happened (overflow)

● Overflow always causes this. And if this does not happen, there is no overflow.

● Eg, 1001 + 1001 →0010 but -7 + -7 isn't +2.

Note that -14 cannot be encoded in 4-bit 2's complement.

Numbering Bits

● On paper, we often write bit positions above the actual data bits.● 543210 ← normally in a smaller font than this

001010 bits 3 and 1 are ones. ● Sometimes we like to write bits left to right, and other times, right to

left (which is more number-ish). We usually start numbering at zero.● Inside computer, how we choose to draw on paper is irrelevant.● Computer architecture defines the word size (usually 32 or 64).

Usually viewed as the largest fixed-width size that the computer can handle, at maximum speed, with most math operations.

● So bit positions would be numbered 0 to 31 for a 32-bit architecture.

More Arithmetic in 2's complement

● Subtract: To calculate A-B, you can use A + (-B)

Most CPUs have a subtract operation to do this for you.● Multiplication: easiest in unsigned. (Most CPUs have instr.)● D.I.Y. unsigned multiplication is like Grade 3:

But your times table is the Boolean AND !!

The product of 2 N-bit numbers may need 2N bits● For 2's complement, the 2 inputs' signs determine the product's

sign. eg, -ve * -ve → +ve● And you can multiply the positive versions of the two numbers.

Finally, correct for sign.

Bit Vectors (aka Bitvectors)

● Sometimes we like to view a sequence of bits as an array (of Booleans)

● Eg hasOfficeHours[x] for 1 <= x <= 31says whether I hold office hours on the xth of this month.

● And isTuesday[x] says whether the xth is a Tuesday.

● So what if you want to find a Tuesday when I hold office hours?

Bitwise Operations for Bit Vectors

● Bitwise AND of B1 and B2:

Bit k of the result is 1 iff bit k of both B1 and B2 is 1.● Java supports bitwise operations on longs, ints

int b1 = 6, b2 = 12; // 0b110, 0b1100 int result = b1 & b2; // = 4 or 0b100

● Bitwise NOT (~ in Java)● Bitwise OR ( | in Java)● Bitwise Exclusive Or ( ^ in Java) Also write “XOR” or “EOR”.● Pretty well every ISA will support these operations directly.

Find First Set

● Some ISAs have a Find First Set instruction. (You've got a bitvector marking the Tuesdays when I have office hours – but now you want to find the first such day.)

● Integer.numberOfTrailingZeros() in Java achieves this.

● So useInteger.numberOfTrailingZeros(hasOfficeHours & isTuesday)

Bit Masking

● Think about painting and masking tape. You can put a piece of tape on an object, paint it, then peel off the tape. Area under the tape has been protected from painting.

● We can do the same when we want to “paint” a bit vector with zeros, except in certain positions.

● Eg, I decide to cancel my office hours except for the first 10 days of the month.

● Or we can protect positions against painting with ones.● Details next...

Bit Masking with AND

● AND(x,0) = 0 for both Boolean values of x● AND(x,1) = x for both Boolean values of x

● bitwise AND wants to paint bits 0, except where the mask protects (1 protects)

● hasOfficeHours & 0b1111111111 is a bitvector that keeps my office hours for the first 10 days (only). Later in month, all days are painted false.

● hasOfficeHours &= 0b1111111111 modifies hasOfficeHours. By analogy to the += operator you may already love.

● The value 0b111111111 is being used as a mask.● Quiz: what does hasOfficeHours & ~0b1111111111 do?

Bit Masking with OR

● OR(x,1) = 1 for both Boolean values x● OR(x,0) = x for both Boolean values x● bitwise OR wants to paint bits with 1s, except where the

mask prevents it. A 0 prevents painting.● hasOfficeHours | 0b1111111111 is a bitvector where I

have made sure to hold office hours on each of the first 10 days (and left things alone for the rest of the month)

● hasOfficeHours |= 0b1111111111 makes it permanent.● Quiz: what would hasOfficeHours |= 0b101 do?

Bit Masking with EOR (aka XOR)

● EOR(0,x) = x for both Boolean values x● EOR(1,x) = NOT(x) for both Boolean values x● bitwise EOR wants to flip bits in positions that are not

protected with a 0 in the mask.● hasOfficeHours ^= 0b111100

inverts my office hour situation for Jan 3-6.● Bit masking with EOR is less common than OR and

AND.

Example: Is a Number Odd?

● Fact: A number is odd iff its least significant (i.e., rightmost) bit is 1.

● Java: if ( (myNum & 0b1) == 0b1) System.out.println(“Very odd number”);

● Note: decreasing precedence &&, ||, |, ^, &, ==

Even if you don't have to, maybe parenthesizing is a good idea. It's hard to remember weird operators' precedence levels.

Example: Multiple of 8?

● A binary value is a multiple of 8 (=23) iff it ends with 000.

● Related to the fact that a decimal number is a multiple of 1000 (= 103) iff it ends with 000.

● Java: if ( (myVal & 0b111) == 0) System.out.println(“multiple of 8”);

● Fact: a more general rule is that the rightmost k bits of X are (X mod 2k) (not certain about -ve numbers)

Bit Shifting

● A bunch of operations let the element of a bitvector play “musical chairs”.

● logical left shift: every bit slides one position left. The old leftmost bit is lost. The new rightmost bit is 0. Java << operator repeats this to shift the value several positions.

● Eg, 0b11 << 4 is same as 0b110000.● logical right shift: similar, Java >>> operator.

Dynamically Generating Masks

● Shifts are useful for dynamically generating masks for use with bitwise AND, OR, EOR.

● The Hamming weight of a bunch of bits is the number of bits that are 1. (After Richard Hamming, 1915-1998.) Many modern CPUs have a special “population count” instruction to compute Hamming weight. Except for speed, it is not needed:

int hWeight=0; // Hamming weight of int value x

for (int bitPos=0; bitPos < 32; ++bitPos) {

int myMask = 0b1 << bitPos; if ( x & myMask != 0) ++hWeight;

Poor Man's Multiplication

● What happens if you take a decimal number and slap 3 zeros on the right? It's same as multiplying by 103.

● Similarly, X << 3 is same as X * 8. Even works if X is -ve. (Unless X*8 overflows or underflows)

● Poor man's X*10 is (X<<3)+(X<<1)

since it equals (8*X + 2*X)● Compilers routinely optimize multiplications by some

constants like this, since multiplication is often a harder operation than shifting and adding. Called strength reduction.

Poor Man's Division

● So then, does shifting bits to the right then correspond to division by powers of 2?

● For unsigned, yes. (Throwing away remainders).● For 2's complement +ve numbers, yes. ● -ve numbers: no. Regular right shift inserts zeros at the

leftmost position (the sign bit).● 11111000 → 01111100 means -8 → +124● A modified form, arithmetic right shift inserts copies of the sign

bit at the leftmost. Java operator >> vs >>>● 11111000 → 11111100 means -8 → -4 as desired

Division by Constant, via Multiplication and Right Shifting

● Low-end CPUs may not have an integer divide instruction but may have a multiply. Want to divide by a constant y that is not a power of 2.

● Mathematically, x/y = x * 1/y● Multiply 1/y by p/p, for p being some power of 2. Say p = 2k. So

x/y = ( x * (p/y)) / p.● p/y is a constant that you can compute. Division by p is a right

shift.● Considerations: effect of truncations and whether the

multiplication overflows.

Example: Divide x by 17

● We get to choose p. Say p=28.

● 256/17 = 15.05 is close to 15.● Compute (x*15) >> 8 to approximate x/17.● Test run for x=175. (175*15)/256 = 10 (throwing away remainder). Good.● Test run for x=170. (170*10)/256 = 9 (because we throw away a remainder

of .996). Oops!● Can be improved, but will never be perfect. Still, maybe an approximate

answer is okay.● Closer approximations by using bigger values of p.● Using 32-bit integers, what is the biggest number we can divide by 17 this

way, without getting overflow?

General-Purpose Mult. and Div.

● What if you want to multiply and divide by a variable?

● Today, most CPUs come with instructions to do this, except maybe the kind in your digital toaster.

● But you can always implement * by the Grade-3 shift-and-add algorithm. Or repeated addition.

● Division: see how many times you can subtract y from x (in a loop). Or (harder), implement the algorithm you learned in Grade 3.

More Bit Shuffling

● Most CPUs support more exotic ways for bits to play musical chairs. No operator like >> in Java or C for this, though:

● Left rotation by 1 position:– every bit but leftmost moves left 1. The leftmost bit circles

around and becomes the new rightmost bit.

● Left rotation by >1 positions is same result as doing multiple left rotations by 1.

● A right rotation by 1, or by >1 positions, also exists.● Example: 1010011 right rotated by 1 is 1101001

Hacker's Delight

● Henry Warren's book, Hacker's Delight, belongs on the bookshelves of serious low-level programmers.

● It is a collection of neat bit tricks collected over the years. It is the source of much of the implementation of Java's class Integer.

● Course website has a link to a web page with a similar collection of “bit hacks”.

● Despite the title, this book is not about breaking security...it's the older, honourable use of “hacker”.

Confessions

● A simple arithmetic right shift of a negative number is not quite the same as division by a power of 2. (Sometimes you can be off by one; two extra instructions can adjust for this.)

● A detailed analysis of the divide-by-a-constant approach (eg the divide-by-17 example) can avoid the small errors. People have worked out approaches for dividing exactly by 3, 5, 7 using multiplications and shifts….

● Chapter 10 of Hacker’s Delight is “Integer Division by Constants” and is 72 pages that are quite mathematical. Also the word “magic” appears many times.

Character Data

● Characters are encoded into binary. One historical method that is still used is a 7-bit code, American Standard Code for Information Interchange.

● ASCII contains upper and lower-case letters, punctuation marks and digits that would commonly have been needed for US English data processing needs.

● ASCII also encodes other things that control the assumed teletypewriter machine: these control codes include carriage return, line feed, tab, ring the bell, end of file, …

Control Codes

● Control codes are often invisible when printed, and some text editors won't show them. But software (e.g. compilers) can be thrown off by them. Leads to puzzled students sometimes.

● A common convention is to discuss control codes by using a letter preceded by ^. Eg, ^C. On many keyboards, pressing the Ctrl key at the same time as the letter can generate a control code.

Backslash Escapes

● Many programming languages have have backslash escapes to represent some popular control codes.

● Eg, '\t', '\n', '\r' in Java and C. You type \t as two characters, but it represents a single character (a tab, ASCII code 9, ^I)

● '\123' represents a single byte whose ASCII code is 123 in octal (base 8 – more on this later) (In many programming languages)

Unicode

● Many (non US people) found ASCII to limiting so attempts were first made to extend it to other Western European character sets.

● Unicode seeks to represent all current and historical symbols in all cultures and languages. Original idea was that 16 bits was enough. Java uses this early Unicode idea, so char in Java is 16 bits.

● Unicode version 9.0 (2016) has >100k characters plus many more symbols. 16 bits is not enough.

● Each Unicode character is represented by a numeric code point. First 128 of them correspond to ASCII, for backwards compatibility.

First 216 are the Basic Multilingual Plane.● There are several ways of encoding code points into bytes.

UTF-8, UTF-16 etc

● UTF-32 uses 32 bits to store a code point. It is a fixed-width encoding: if I know how many characters I need to store, I know precisely how many bytes it will cost me.

● But UTF-32 wastes bytes. Characters outside the Basic Multilingual Plane are rare. Codepoints from 0-127 (“ASCII”) are very common.

● UTF-16 represents codes in the BMP with 2 bytes. Weird codes outside need 2 more. Not a fixed-width encoding.

● UTF-8 represents ASCII codes in 1 byte, other BMP codepoints with 2 or 3 bytes, and weird codes with 4.

● UTF-8 is fully backward compatible with old ASCII files.● In Java, the constructor for FileOutputStream has an optional parameter

that can be “UTF-8” or “UTF-16” etc. Otherwise, it uses the operating system's default.

Strings

● A string is a sequence of characters. In Java it's represented by a String object, as you know.

● In lower-level programming (C, assembler), a string is more likely viewed as a sequence of consecutive memory locations that store the successive characters in the string.

● Q: how do you know when the next memory location doesn't store the next character in a string (how to know a string is over)?

● Common convention: null-terminated string. A string always ends with a character whose ASCII code is 0. “C-style string”

● ARM assembly language: if you want a C-style string, you have to put the null at the end. Fun bugs if you forget.

Representing Fractional Values

● You can represent fractional values using a fixed-point convention. In decimal and for money (unit dollars), an example would be agreeing to store each values as a whole number of cents.

● So 2.35 is stored as 235. We have shifted the decimal point by a fixed amount, two positions.

● In unsigned binary 101.011 means 1012 and a fraction of 0*2-1 + 1*2-2 + 1*2-3. I.e., 5.375

.

Fractional Values, 2.

● We can store all numbers by shifting the binary point right 3 (for example). So we are measuring everything in eighths. 5.37510 is then stored as 101011, instead of 101.011.

● Can add and subtract fixed-point numbers successfully, as long as each is, for instance, measured in eighths.

● But multiplying two numbers given in eighths results in a product that is measured in 64ths. So have to divide by 8 (just shift right 3 positions...)

● Advantage: fractions handled using only integer arithmetic.

Floats, Doubles etc.

● Scientific processes generate huge and tiny numbers. No single fixed-point shift will suit everything.

● Measured values have limited precision - no sense to store the number of meters to Alpha Centauri as an integer.

● Floating-point representation is a computer version of the “scientific notation” you learned in school, eg: 3.456 x 10 -5

● +3.456 is the significand or mantissa. We've 4 sig. digits● Number is normalized to 1 significant digit before decimal

point.● The exponent is -5, and the sign is positive.

IEEE-754 Standard

● IEEE-754 is the standard way to represent a binary floating point value in 16 (half-precision), 32 (single precision), 64 (double precision), 128 (quad precision) or 256 (octuple precision) bits.

● 32-bit form available in C & Java as float; 64-bit form as double.

● Overall, it's a sign-magnitude scheme.● But exponent is signed quantity using the biased

approach.

IEEE-754 Floats

● 1 sign bit, S. (Bit position 31)● next, 8 exponent bits with binary value E.● Exponent bias b of 127.● 23 fraction bits fff..fff to represent the significand

of 1.ffff..fff. Note “hidden” or “implicit” leading 1.● Formula for “normal” floats:

Value = (-1)S * 1.ffff...fff * 2(E-b)

Example

● Find the numerical value of a float with bits

0 00111100 001100000000000000000000● Use formula (-1)S * 1.ffff...fff * 2(E-b)

● S=0. E=001111002 = 6010. b=127 (always)

ffff...fff = 00110000000000000000

● So: -10 * 1.00112 * 2-67

= +1 * (1 + 3/16) * 2-67

● It's a small positive number. Calculator for details.

Example 2: Determine the bits

● Determine how to represent -2253.2017.● Helpful facts: 2253 = 2048 + 4 + 1.● 0.2017 * 224 = 3383964 + fraction

● 338396410 = 11001110100010100111002

● Now let's put the pieces together.

Representable Values

● There are/is an uncountably infinity of real numbers.● There are at most 232 different bit patterns for a float. No bit pattern

represents more than one real number.● Therefore, there are real numbers that cannot be represented.

(Overwhelming majority.)● For any given exponent value, there are only 223 different mantissas,

1.000... to 1.111...● No number whose exponent exceeds 255-127● No number whose exponent < (0 – 127).● (Though in CS3813 you'll learn about subnormal numbers, so I've

lied; IEEE-754 is a fair bit hairier than I've presented.)

Example

● What is the next representable value, after 5/16?

● 5/16 = 0.0101 * 20 = 1.0100...0 * 2-2

● Now let's reason.

IEEE-754 Doubles

● 64 bits, divided up into– 1 sign bit

– 11 exponent bits, bias 1023

– 53 fraction digits

● More exponent bits: better range of numbers● More fraction bits: smaller gaps between

representable numbers (higher precision).● Otherwise, like Float.

Machine Instructions

● Another thing that becomes binary: machine instructions.● Typical m/c instruction has

– an operation code (opcode) that indicates which of the supported operations is desired

– codes indicating addressing modes that provide the input data (“operands”)

– code indicating where the result should be put

– code indicating the conditions under which the instruction should be ignored

● An instruction-format specification helps you determine how to assemble these codes into a machine-code instruction.

● Chapter 1: To store the constant 8 into a register variable: 0b11100011101000001101000000001000 in ARM m/c code.

● We'll study ARM instruction formats later.

Hexadecimal

● A decimal number has about 1/3 as many digits as the corresponding binary number: small base, lots of digits.

● Humans do poorly with many-digit numbers.● So for humans, it is handy to work in larger bases. But it's

hard to convert base 10 numbers into base 2 numbers.● Base 16, or hexadecimal (hex), is the go-to base for

machine-level human programmers.– numbers have few digits

– it's easy to convert to/from binary

Hexadecimal digits

● Whereas decimal uses digits 0 to 9,

hexadecimal uses 0-9,A,B,C,D,E,F.– Digit 7 has value seven, just like decimal

– Digit A has value ten, B has value eleven, … F has value fifteen

● Numbers have a ones place, a sixteens place, a 162s place, a 163s place, etc.

● 2F32 means 2*163 + 15*162 + 3*161+2● In many languages, you prefix hex constants with 0x

so int fred = 0x2f32; // works fine in Java.

int george = 0x100; // equivalent to george = 256;

Converting Hex to Binary

● Because 16=24, each hex digit expands to 4 binary digits.

● For 0x9A4– the 9 expands to 1001

– the A expands to 1010

– the 4 expands to 0100

● So 0x9A4 expands to 0b100110100100

Converting Binary to Hex

● Binary → Hex is the reverse process.● Only trick: you want the binary number to be the correct

length ( a multiple of 4 in length)● So zero extend it, if necessary● Then each group of 4 bits collapses to a hex digit.● 101010 → 0010 1010 → 2A● Rather than count bits and zero extend first, just circle

groups of 4 bits starting from the right. If the last group has fewer than 4 bits, it's okay.

Small Negative Numbers

● We usually use unsigned hex to reflect bit patterns, even if they meant to be 2's complement numbers.

● So what does a 32-bit negative number look like, if it is pretty close to 0?

● The corresponding bit pattern has a lot of leading ones. When converted to hex, each group of 4 ones turns into an F digit.

● So your not-very-negative number has lots of leading F's.

● 0xFFFF FFF3 is the bit pattern for -510 = 0b111...111011

Hex Arithmetic

● It's sometimes handy to addition and subtraction of hex numbers without converting to decimal.

● (typically, subtraction when you want to figure out the size of something in memory, and you've got the starting and ending positions)

● Like Grade 3, except your addition/subtraction table is bigger.– don't memorize: just use the values of digits

– you carry and borrow 16, not 10

Example: Hex Addition

● A debugger reports that an item begins in memory at address 0x1234. You know its size is 0x7D. What is the first address after the item?

● 1234+ 7D

● 4 is worth 4, D is worth 13. Sum to 17, or 0x11● Keep the 1, carry the 16 to the next stage● 3 and 7 sum to 10. But there is a carry, bumping you to 11,

or 0xB. No carry to next stage.● So 1234 + 7D = 12B1.

Example: Hex Subtraction

● 1203

-0F15● Since 3 < 5, borrow from 0x20 (making it 0x1F).● You borrowed 16, so 3 is now worth 16+3=19.● Take away 5, get 14. Hex digit for 14 is E.● F-1 is E (no borrow needed).● 1-F needs a borrow, makes 1 worth 16+1=17.● Take away F (value 15), get 2. (hex digit is 2).● You borrowed from the 1, so its 0. 0-0=0.● You could write down this leading zero, if you wanted....● 1203-0F15 = 02EE

Octal (base 8)

● In bygone days, octal (base 8) was an alternative to hexadecimal.

● Conversion to/from binary is by grouping bits into groups of size 3, but otherwise same as hex.

● Octal survives in some niches. In a string or a character, a backslash can be followed by 3 octal digits (typically the ASCII code of some otherwise unprintable character).

● In Java and C, any digit string that starts with a leading zero is assumed to be octal. Remaining digits must be 0 to 7.

● So: int fred = 09; // mysterious compile error

ARM v4TCS2253

Owen Kaser, UNBSJ

ARM v4T

● History of ARM processors● R is for RISC● Registers● Status flags and conditional execution● Memory● Example program

History of ARM v4T

● Acorn Computers in the UK, early 1980s● Designed own CPU for a line of PCs, based on cutting-edge

design trends then.● Cutting edge was RISC: Reduced Instruction Set Computers. ● ARM was the Acorn RISC Machine● Circa 1990, retitled Advanced RISC Machine and the design

was licensed to other companies to manufacture or add extra components, as part of a System-on-a-Chip.

● Like the extra stuff to make an Apple Newton, an iPod, a Nokia phone...

History of ARM v4T, cont.

● The ISA has been added to over the years. ARM v4T dates from early 1990s.

● Actually, v4T has the regular 32-bit ARM ISA and a simpler Thumb ISA, where instructions can be 16 bits long. We ignore Thumb in CS2253.

● New versions of the ISA have come out in the meantime (though old are still being produced).

● ISAs that evolve tend to get ugly, preserving backwards compatibility. There is now a 64-bit ISA that apparently is once again clean. Maybe we can shift 2253 to it in future.

ARM is Popular

● ARM variations are the champion in popularity for mobile devices.

● By 2002, there were 1.3 billion manufactured● In just 2012, 8.7 billion were manufactured.

What's RISC?

● The R in ARM stands for Reduced Instruction Set Computer.– in contrast to the extremely complicated CPUs of the

late 1970s (VAX had an “evaluate polynomial” instruction, for instance) A “CISC” machine has some advantages, in “code density”.

– complex means expensive to make, and hard to make run fast.

– RISC tried to simplify ISAs, so implementation can be simple and fast.

RISC Principles

● There should be a small number of instructions.● Every instruction should do something very simple, so it can

run in 1 clock cycle.● All machine codes should be the same length (32 bits).● There should be relatively few different machine code

formats.● Should be a fair number of storage registers, and most

operations should involve only them.● Values should be transferred between RAM and registers by

explicit Load and Store instructions.

ARM v4T Components

● There are 15 main registers, R0 to R14. Each can store any 32-bit value. R13 and R14 are a tad special.

● As a first approximation, a HLL programmer can view them as the only real “variables” you have.

● R15 is also called PC (Program Counter) and keeps track of where to fetch instructions.

● Due to “pipelining”, when an instruction executes, PC actually stores the address of the instruction that is 8 bytes ahead. Pipelining is an advanced CS3813 topic.

Example Instructions

● Add two register values, result in 3rd register.● Exclusive-OR two register values, results in 3rd.● Change the program counter (subtract 16 from it)● Get a halfword from memory, at an address that is 10 more

than the current value of R1. Sign extend it and put it in R2. Modify R1 to be increased by 10.

● Store the first byte in R1 into memory, address obtained by taking R2 and shifting it left 2, then adding that value to R3.

In each case, the technical ARM documentation can tell you how the instruction would be encoded into bits.

ARM Components: CPSR

● The Current Program Status Register is a collection of 12 miscellaneous bits.– 4 keep track of how recent instructions went (“status flags”)

– 8 allow you to see and control the processor configuration (“control bits”). We don't need them initially.

● Chapter 2 of the textbook tells you about other advanced concepts that aren't needed until the hardest parts of the course, much later.

● Please ignore anything about “processor modes” other than User, for now.

Status Flags

● Most ISAs (except the MIPS ISAs we often study in CS3813) use status flags.

● They help record the outcome of an earlier instruction, so that your program can do different things, depending on what happened earlier.

● Flags are N (bit 31 was 1), Z (all bits were 0), V (result oVerflowed), and C (there was a Carry out)

● Many instructions have a version that updates the flags and another that doesn't. But some instructions always update the flags.

Conditional Execution

● Most ARM instructions can be made conditional, so they do nothing unless the specified status flags are set.

● Example: 64-bit counter.– First instruction sets flags while incrementing the low-order 32

bits

– Second instruction runs conditionally and only increments the high-order 32 bits if the Z flag is set

– Maybe low-order bits in R1 and high-order in R2

● Non-ARM ISAs generally have only a few conditional instructions (the ones that implement IF)

Constants

● Many ARM operations can use constants (just like you can add two registers together, you can add a register to a constant, etc.)

● ARM constants are weird. Numbers -128 to 255 are okay, as are a few larger numbers

● Allowable larger numbers are those obtained by rotations of 0-255 by an even number of positions, etc. More later.

Memory

● The ARM processor is byte addressed, in that every byte of memory has its own address, starting from address 0.

● Addresses are 32 bits long, leading to a maximum of 4GB of memory (at least for a given running program). [But note that some addresses are typically carved out for non-memory.]

● Special Load and Store instructions are used to access memory. You can transfer 1, 2 or 4 bytes in one operation.

● In ARMv4, 4-byte transfer must begin at a memory address that is a multiple of 4: the alignment rule. Similarly, 2-byte transfers must begin at an even address.

Big Endian vs Little Endian

● When a 4-byte word is laid out in memory, does the most-significant byte (big end) come first, or the least-significant byte (little end)?

● A religious war arose between the two camps.● ARM7TDMI processor can do either, but the default

for ARM is usually little-endian.● The issue is only visible if you write a word/halfword

into memory and then try to read it back in smaller pieces (eg bytes).

Example Program

● Compute 10+9+8+7+6+5+4+3+2+1– Put the constant 0 into R1

– Put the constant 10 into R2

– Add R1 and R2, put the result into R1

– Subtract the constant 1 from R2 and set the status flags

– If the Z flag is not set, reset the PC to contain the address of the 3rd instruction above.

● Each of these instructions can be encoded into machine code, if you are willing to slog through the reference manuals enough.

Assembly LanguageCS2253

Owen Kaser, UNBSJ

Assembly Language

● Some insane machine-code programming● Assembly language as an alternative● Assembler directives● Mnemonics for instructions

Machine-Code Programming (or, Why Assemblers Keep Us Sane)

● Compute 10+9+8+7+6+5+4+3+2+1– Put the constant 0 into R1

– Put the constant 10 into R2

– Add R1 and R2, put the result into R1

– Subtract the constant 1 from R2 and set the status flags

– If the Z flag is not set, reset the PC to contain the address of the 3rd instruction above.

● Let's try to make some machine code.

Put 0 into R1

● There's a Move instruction, or you could subtract a register from itself, or EOR a register with itself, or... let's use Move.

● Book Fig 1.12●

● cond = 1110 means unconditional● S=0 means don't affect status flags● I=1 means constant; opcode = 1101 for Move● Rn = ???? say 0000; Rd = 0001 for R1● bits 8-11: 0000 Rotate RIGHT by 0*2 ● bits 0-7: 0x00 = 0x00● So machine code is

1110 00 1 1101 0 0000 0001 0000 00000000 = 0xE3A01000

Put 10 into R2

.●

● cond = 1110 means unconditional● S=0 means don't affect status flags● I=1 means constant; opcode = 1101 for Move● Rn = ???? say 0000; Rd = 0010 for R2● bits 8-11: 0000 (rotate right by 2*0 ) bits 0-7: 0x0A● So machine code is

1110 00 1 1101 0 0000 0010 0000 00001010 = 0xE3A0200A

●Add R1 and R2, put result into R1

● Same basic machine code format as Move

● cond = 1110 for “always” ; I=0 (not constant)● opcode = 0100 for ADD; S=0 (no flag update)● Rn = R1, Rd = R1● shifter_operand = 0x002 for R2 unmolested● Having fun yet??● 1110 00 0 0100 0 0001 0001 0000 0000 0010 = 0xE0811002

●Subtract 1 from R2, result into R2

● Same basic machine code format as Move

● cond = 1110 for “always” ; I=1 (constant)● opcode = 0010 for Subtract; S=1 (yes flag update)● Rn = R2, Rd = R2● shifter_operand = 0x001 for 1 rotated right 0 positions ● 1110 00 1 0010 1 0010 0010 0000 0000 0001 = 0xE2522001

Maybe Rinse and Repeat

● If the Z flag is not set, we want go back 2 instructions before this one.

● book Fig 3.2

● cond = 0001 means “when Z flag is not set”● L=0 means “don't Link” (Link changes R14)● signed offset should be -4. The PC is already 2 instructions ahead

of this one, and we want to go back 2 more than that.● 0001 101 0 111111111111111111111100 = 0x1AFFFFFC● Are you REALLY having fun yet ??

How'd you know the cond codes?

How'd You Know the Shifter Magic?

An Assembler

● Rather than making you assemble together all the various bit fields that make up a machine instruction, let's make a program do that.

● You are responsible for breaking the problem down into individual instructions, which will be given human friendly names (mnemonics).

● You give these instruction names to the assembler, along with various other directives (aka pseudo-ops) that control how the assembler does its job.

● It is responsible for producing the binary machine code.● It also produces symbol table information needed by a

subsequent linker program, if you write a multi-module program.

Assembly Language● You communicate with the assembler via assembly

language (mix of mnemonics, directives, etc.)● Assembly language is line-oriented.● A line consists of

– an optional label in column 1

– an optional instruction or directive (and any arguments)

– an optional comment (after a ; )

● Example:

here b here ; create infinite loop.● “here” is a label that marks a place● b is a branch instruction, forces the PC to a new location

(here).

The Bad News

● Anyone who creates an assembler gets to define their own assembly language (ignoring manufacturer's suggestions). Dialects?

● Textbook shows code for Keil and Code Composer Studio. But we use Crossware's assembler, which is yet another dialect and it's hard to find documentation on it.

● Textbook talks about “Old ARM format” and “UAL format”. Crossware is a mixture (more old).

Our Program in Assembly

mymain mov r1,#0 ← mymain is the label

mov is the instruction

# precedes the constant

; nice comment, eh?

mov r2,#10 ; put 10 into r2 (bad comment)

myloop add r1, R1, r2 ← case insensitive for reg names

subs r2, r2, #1 ← final s means to affect flags

bne myloop ← condition is “ne” (z flag false)

sticky b sticky ← so we don't fall out of pgm

end ← directive to assembler: you're done

;don't use “end”; it seems to be buggy in Crossware

Register Names

● r0 to r15 (alias R0 to R15)● SP or sp, aliases for R13● LR or lr, aliases for R14● PC or pc, aliases for R15● cpsr or CPSR (the status registers etc)● spsr or SPSR, apsr or APSR (later)● not s0-s3 or a1-a4 (unlike book page 63)

Popular Assembler Directives

● Textbook Section 4.4 describes the set of directives supported by the Keil assembler and the TI assembler.

● Our Crossware assembler is different than both (but closer to Keil).

● Let's look at directives to– set aside memory space for variables/arrays

– define a block of code or data

– give a symbolic name to a value

Directive to Set Aside Memory

● The SPACE directive tells the assembler to set aside a specified number of bytes of memory. These locations will be initialized to 0.

● Usually have a label, since you need a name to refer to the allocated memory.

● Example– myarray SPACE 100

– myarr2 SPACE 100*4 ←constant expression's ok

● Later, instructions can load and store things into the chunks of memory by referring to the names used.

● If myarray starts at address 1234, myarr2 starts at 1234+100

Use of SPACE

● An assembly language programmer uses SPACE for the same reasons that a Java programmer uses an array.

Directives for Memory Variables

● Use DCB to declare an initialized byte variable.● DCW for initialized halfword, DCD for word.● Example

myvar1 DCB 50 ← decimal constant myvar2 DCB 'x' ← ASCII code of 'x' myvar3 DCB 0x55 + 3 ← constant expression

● If myvar1 ends up being at address 1234, then myvar2 will be at 1235 and myvar3 at 1236

Alignment

● DCW assumes you want the memory variable to start at a multiple of 2 (“halfword aligned”)

● DCD assumes you want alignment to a multiple of 4.

● To achieve this, assembler will insert padding.● If you really want to set aside a word without

padding, use DCDU. The “U” is for unaligned.● There's also DCWU.

Alignment Example

v1 DCB 10

v2 DCW 20

v3 DCB 30

v4 DCD 40

If v1 is at address 3000, then

v2 starts at 3002 (1 byte of padding)

v3 is at 3004

v4 starts at 3008 (3 bytes padding)

v1 DCB 10

v2 DCWU 20

v3 DCB 30

v4 DCDU 40

If v1 is at 3000, then

v2 starts at 3001

v3 is at 3003

v4 starts at 3004 (aligned by luck)

More Alignment Control

● Keil assembler has an ALIGN directive that can force alignment to the next word boundary (inserting 0-3 bytes of padding).

● In Crossware, the directive takes a numeric argument. So ALIGN 4 (or ALIGN 8)

DCB with Several Values

● You can use DCB with several comma-separated values● Several consecutive memory locations are set aside. A label

names the first of them.● Example: foo DCB 1,2,3,4● We can access the location initialized to 3 as “foo+2”● A quoted string is equivalent to a comma separated list of ASCII

values.

DCB “XY” is same as DCB 'X','Y' or DCB 88,89● DCW and DCD can also take a comma-separated list.● Common use: make a small initialized table.

DCB: Signed or Unsigned?

● DCB's argument must be in the range -128 to +255.● -ve values are 2's complement● +ve values are treated as unsigned● So DCB -1, 255 is same as

DCB 255, 255● Similarly DCW's arguments in range -32768 to

+65535.● DCD from -231 to +232-1

AREA directive

● In general, an assembly language program can have several blocks of data and several blocks of code. And it can be written in several different source-code files.

● The AREA directive marks the beginning of a new block. You give it a new name and specify its type. – eg AREA fred,code

– You can go back to a previous area by using an old name

● A tool called a linker runs after the assembler to put your various sections (and any library routines you need) into a single program.

● Much more on linkers later in the course

AREA Example

AREA mycode,code

foo add R1, R2, R3

add R4, R5, #10

AREA mydata, data

var1 dcb “cs2253”

AREA mycode ← continues mycode where it left off

add R6, R7, R8

This feature allows for us to show our data declarations near the code that uses them (maybe good software engineering), even if the different sections end up being far apart in memory.

Memory picture on board...

Code in Data, Data in Code

● Q: Is this allowed; if so, what does it do?

AREA mycode, CODE

starthere add R1, R2, R3

DCD 0x1234567 ; this line is fishy

add R2, R3, R4

AREA mydata, DATA

var1 DCD 1234

var2 add R2, R3, R4 ; this line is also fishy

var3 DCB “hello world”,0

Operators in expressions

add R4, R5, #10 ↔ add R4, R5, #3+3+3*1+1● Both of the above generate the same single

machine-code instruction.● The + and * operators are just requests to the

assembler to do a little bit of math when it processes the line. No runtime effect.

● Other operators supported by Crossware are | and & (bitwise AND and OR). Also >> and <<.

● I can't find XOR, mod (unlike Keil and CCS on page 75)

EQU: Give a Symbolic Name

● The EQU directive is used to give a symbolic name to an expression. Use it to make code easier for humans.

● Example

fred DCB 20, 200, “Frederick Wu”

fred_age EQU fred+0

fred_height EQU fred+1

fred_name EQU fred+2

Subsequent instructions can load data from fred_height rather than the more cryptic fred+1.

But to the assembler, both loads will be equivalent.

Directives Crossware May Lack

● Compared to Keil and CCS, our Crossware assembler does not appear to support some directives. I can't find good documentation, so maybe they exist under a different name :(

– ENTRY

– RN

– LTORG, though we do have the “LDR rx,=” construct (eg textbook page 72)

– SETS

● Also, the SECTION directive only takes attributes CODE and DATA. Not the others in textbook Table 4.3.

● Crossware does support macros and conditional assembly, advanced topics for later in the course.

A Few Instructions

● Assembler directives are great, but the main thing in assembly language is to specify instructions (and then get the assembler to generate the associated machine codes)

● So far (from the loop example) we know– add

– sub

– b

– mov

A Few More Instructions (Table 4.1)

● These are math-ish instructions:– RSB – reverse subtract

– ADC, SBC – add/subtract with carry

– RSC – reverse subtract with carry

– MVN – move “negative” (a bitwise NOT)

– AND, ORR, EOR, BIC – bitwise logical operations

– MUL, SMULL, UMULL – various * ops

– MLA, SMLAL, UMLAL – multiply/accumulate.

Mnemonics

● A mnemonic is “a memory aid”.● It’s hard to remember the bit pattern associated

with a machine operation.● As a memory aid, we have human-friendly

names like ADD, SUB etc.● They are our mnemonics.

From Reference

Example: Swapping

● Java swap of v1 and v2:

temp = v1; v1 = v2; v2 = temp;● Naive ARM swap of r1 and r2

mov r3, r1

mov r1, r2

mov r2, r3

● Clever swap avoids trashing r3 (book p 53):

eor r1, r1, r2

eor r2, r1, r2

eor r1, r1, r2

● Book “Hacker's Delight” is full of this kind of trick.

Example: 64-Bit Addition

● Assume r1 contains the high 32 bits of value X and r2 contains the low 32 bits

● Assume r3 contains the high 32 bits of Y and r4 contains the low 32 bits.

● Want result in r5 (high bits) and r6 (low bits)

ADDS r6, r2, r4 ; add low words [affect flags]

ADC r5, r1, r3 ; add high words

Computing Your Grade

● Test was out of 80. Prof told you how many points you lost (put the number into R1). Figure out what your grade out of 80 was:

RSB R2, R1, #80● Now your grade is in R2.

Constant Operands● Most instructions have register values or

constants as the operands● (Exception: Load and store instructions – later)● All 8-bit constants are okay● As are all constants of the form

RotateRight( v, 2*amt)

where v is an 8-bit value and amt from 0 to 15.● So 0xAB is ok

– so is 0xAB0 ( 0xAB with a 28 bit rotate right)

– so is 0xB000000A (0xAB with a 4-bit rotate right)

Why This Weirdness

● Studies show that most constants are small.● Among larger constants, bit-masks containing a

small chunk of mixed bits are common (surrounded by zeros)

● Similar bitmasks that are mostly 1s can be handled by using the MVN instruction

● A RISC architecture with 32-bit instructions isn't long enough to encode an arbitrary 32-bit constant. So just allow the most common ones.

● Assembler complains if you use a constant that cannot fit this weirdness.

Machine Instruction With Constant

The Barrel Shifter's Place

Shifted Register Operands

● If the second operand is a register value, the barrel shifter can modify it as it travels down the B bus.

● Barrel shifter is capable of LSL (logical left shift)– LSR (logical shift right)

– ASR (arithmetic shift right)

– ROR (rotate right)

– RXX (33 bit ROR using carry between MSB and LSB)

● No modification desired? Shift by 0 positions!● Carry flag is involved (but the new carry value is

not necessarily written into the status register)

ARM Shifts and Rotates

How Much Shifting● With RRX, it appears the register can only be shifted

by one position.● With others, you can shift 0 to 31 positions

– Either as a constant (“immediate”)

– Or by the least significant 5 bits of a register

● There are separate machine code formats for these cases.– Bit 4 distinguishes the cases

– Bits 5 & 6 say what kind of shift/rotate

– Bits 11 to 7 involve which register, or the constant

Machine Encoding (from Ref Man)

● Below, shift field is 00 for LSL, 01 for LSR, 10 for ASR, 11 for ROR. RRX also 11 with count of 0 (and rotates only one position).

Example

● Machine code to take R1, logical left shift it by 3 positions, result in R2

● Assembly language: MOV R2, R1, LSL #3● It’s the “immediate shift” format:

– Bits 27, 26, 25 and 4 are all 0

● Bits 11 to 7 are 00011 (for the #3)● Bits 3 to 0 are 0001 (since R1 is being shifted)● Bits 5 & 6 are 00 to select the LSL kind of shift● Unconditional, bits 31 to 28 are 1110; MOV opcode 1101● So: 1110 00 0 1101 0 ???? 0010 00011 00 0 0001 = 0xE1A02181

Setting Conditions

● Any of the data-processing instructions so far can optionally affect the flags.

● At the machine-code level, bit 20 (called S) controls this: S=1 means to set the flags

● In assembly language, you append an S on the mnemonic. ADDS instead of ADD

● Also, there are some instructions whose sole purpose is to set flags: they don’t change any of R0 to R15.

● Compare (CMP, CMN) and Test (TST, TEQ) instructions.

● Let’s add 1+2+3+… until sum exceeds R4 (unsigned)

MOV R1,#0 ; The sum

MOV R2,#1

LP ADD R1, R1, R2

ADD R2, R2, #1

CMP R1, R4 ; computes R1 – R4, sets flags

BLS LP ; LS = unsigned Lower or Same (CF=0 or Z=1)

; use LE for signed Lesser or Equal

Sum to a Limit

Multiplication

● The ARM v4 ISA has 6 multiplication instructions.

● Does not include “multiply by a constant”● Why several?

– Should product be 32 bits or 64 bits?

– Are the input values considered signed?

32-Bit Products

● Fact: Since the product stored is the low-order 32 bits of the true product, signed and unsigned variations would give same result. So not separate instructions.

● MUL instruction: Two registers' values multiplied, low-order 32 bits stored in destination register.

● MLA (multiply and accumulate). The low order 32-bits of the product are added to a 3rd register and stored in a 4th register.

● Eg: MLA R4, R1, R2, R3 ; R4 = R1*R2 + R3

64-Bit Product (Long Multiply)

● Results are stored in a pair of registers.● The “accumulate” version has the product added onto the 64-bit

value in a pair of registers.● SMULL – signed long multiply● UMULL – unsigned long multiply● UMLAL - unsigned long multiply accumulate● SMLAL – signed long multiply accumulate● Ex: UMLAL R1, R2, R3, R4 means

(R1, R2) ← (R1, R2) + R3*R4 with unsigned math

– Above, R1 is the least significant 32 bits

ARM MemoryOwen Kaser, CS2253

Mostly corresponds to book Chapter 5.

Overview

● Loads and Stores● Memory Maps● Register-Indirect Addressing● Post- and Pre-indexed Addressing

16 Registers is Not Enough

● So far, the only places discussed for data are the ARM's CPU registers

● Most interesting programs need more data.● We need memory outside the CPU for our bulk

data storage.● Also, memory can contain pre-computed tables

(eg, of trig functions) that are never altered● For your toaster's software, the machine code

can be set at the factory. Fancy toaster: you can “flash” your toaster with improved software.

Loads and Stores

● Recall that ARM is a “load/store” architecture. Cannot directly do calculations on values in memory. Have to load them into a CPU register to use them as inputs.

● Similarly, calculations put results into registers. Then you can use a store instruction to put them into memory.

● Loads and stores need to specify where in memory things should go. This will be a numeric “memory address”.

● (Memory) addressing modes are small built-in calculations the CPU can do, to compute the memory address.

● Simple case: value in, say, R3 is to be used as the address.

System Memory Maps

● A system built around an ARM7TDMI processor uses 32-bit values as memory addresses. Each address would correspond to a byte (oops, octet).

● The overall “memory address space” ranges from 0 to 0xFFFFFFFF.

● But the overall memory address space is further subdivided (boundaries are often small multiples of powers of 2)

● RAM, ROM, flash, and I/O devices can be given their own subdivisions.

● More on I/O devices later in the course. For now, just realize that some memory addresses accept stores, and some ignore them.

Ex. Memory Map (extracts from book Table 5.1)

Start End Description

0x00000000 0x0003FFFF On-chip flash

0x00040000 0x00FFFFFF reserved

0x01000000 0x1FFFFFFF ROM

0x20000000 0x20007FFF (Static) RAM

…..

0x4000C000 0x4000CFFF UART 0 (a “serial port”) device

…..

0xE0001000 0xE0001FFF “data watchpoint and trace” (DWT) facility

….

0xE0004000 0xFFFFFFFF reserved

For Simplicity....

● Let's only mess with addresses in a range that corresponds to RAM memory.

● Then, loads and stores both make sense.

Register-Indirect Addressing Mode

● Let's suppose you want to load the byte at address 0x00005000 into register R3.

● 8 bit value into a 32-bit container. If we want the 8-bit value to be zero-extended, use LDRB instruction.

● If you want it sign-extended, use LDRSB.● Simplest case: a register stores the address of some

data you care about. Let's go for R1.● Assembler: MOV R1, #0x00005000 ;address to R1

LDRB R3, [R1] ; memory value to R3

Looping Through Memory

● Let's suppose you want to wipe clear (to 0) the contents of all memory locations from 0x00005000 to 0x00005FFF.

● A loop will work nicely.

MOV R1, #0x00005000 ; starting location

MOV R2, #0x00006000; when to stop

MOV R3, #0

LP STRB R3, [R1] ; wipe clear current location's value

ADD R1, R1, #1 ; advance to next location

TEQ R1, R2 ; has R1 hit the stopping location?

BNE LP

….

Speeding It Up

● If the area to be cleared is properly aligned (starts on a multiple of 4) and is the right size (a multiple of 4) we can clear out 4 consecutive addresses with one STR (store word) instruction.

● Recall that a 32-bit word is stored across 4 addresses: A, A+1, A+2, A+3.

Faster Code



MOV R3, #0 ; 4 bytes of zeros

LP STR R3, [R1] ; wipe clear current location's value AND the next 3 locations' values

ADD R1, R1, #4 ; advance to location of next group of 4 bytes


BNE LP

● Loop runs only ¼ as many times now.

Even Faster

● The pattern of “use a register to provide a memory address, then update the register in preparation for the next loop” is extremely common.

● ARM designers created an addressing mode that does BOTH of these operations in a single instruction. “post-indexed”

● STR R3, [R1], #4 is equivalent to

STR R3, [R1]

ADD R1, R1, #4

Textbook Figure 5.2

Even Faster Code



MOV R3, #0 ; 4 bytes of zeros

LP STR R3, [R1], #4 ; wipe 4, then advance “pointer” R1

ADD R1, R1, #4 ; advance to location of next group


BNE LP

Java Pre- vs Post-Increment

● Can draw a parallel to Java's ++ operators.● Recall, v = M[ p++] in Java

– it uses the current version of p to index M

– then it increments p. post-increment.

● Versus v = M[++p] in Java– it first increments p pre-increment

– then then new value of p is used to index into M

Post-Indexed Addressing

● In ARM, post-indexed indexing takes a base register. (Should not be R15.)

● Uses that base register's value to go to memory● Then updates the base register's value by a little

computation– adding/subtracting a constant (earlier example)

– adding/subtracting a register● which is allowed to be modified by the barrel shifter● can be shifted/rotated by a constant amount● can be shifted/rotated by a register amount

● Usefulness of fanciest of these seems doubtful● LDR R1, [R2], ROR R3 ; is this useful???

Useful? Example

● Java, for an int array M, variable x:

j = 0;

while (….) {

sum += M[j];

j += x;}

● ARM: suppose x in R2, start of M in R1● In loop body: LDR R3, [R1], R2 LSL #2

Pre-Indexed Addressing

● There are two flavours of pre-indexed addressing. Both do a little computation and use the computed effective address to go to memory. In one, the base register is updated. Other flavour does not update.

● In assembly language, the ! symbol means to update the base register. Don't use R15 as the base register with !

● Ok to use R15, without ! The value of R15 is 8 bytes beyond the start of the current machine code. [Details of why are a bit advanced.]

Rationale for the “little computations”

● PC-relative addressing for constants● Getting a field of an object, given the start of

the object.● Indexing into array of objects, selecting a field

(if the object size is a power of two)● (Selected largely by analyzing what compilers

for HLLs would find useful, I think...rather than focussing on assembly language programmers)

Pre-indexed Figure (Textbook)

● Instruction is STR r0, [r1, #12]● Add ! to update r1 when finished:

STR r0, [r1, #12]! ; r0 ← x20c

Some Pre-indexed Examples

● MOV R1, 0x123456578 fails. Constant is not a rotation of an 8 bit value.

● Instead, initialize a memory location with your constant. Then use PC-relative addressing to load it.

● LDR R1, myConst ; pseudo-op

… 1000 bytes later...

myConst DCD 0x12345678

● The LDR instruction is actually something like

LDR R1, [PC, #996] ; PC was already 8 ahead● 996 is close enough to PC. Must be within 4 kiB.

Ex: Field Access for an Object

● In HLLs, the fields of an object occupy consecutive memory addresses (possibly with padding)

● Let's suppose that an object starts at 1000. There are two 32-bit fields, then a 16-bit halfword field that we want to load into R2.

● Let's suppose that R1 contains the starting address of the object.

● Use LDRH R2, [R1, #8] ; immediate offset is 8

(Desired field starts 8 bytes later: gotta skip over first two words.)

● (Minor point: LDRH requires offset ±256)

Ex: Array Access

● Suppose R1 contains the starting address of an array.

● Suppose the array's elements are 4 bytes each● To load the wth array element, we want address

R1 + 4*w● Suppose value w is in R2● LDR R5, [R1, R2 , LSL #2] loads desired value.

No ADR Pseudo-op

● The Crossware assembler does not seem to support ADR, which is used to put an address into a register (that you will then use as a base register). For instance, summing values in array…

MOV R0, #0 ; accumulate answer

ADR R1, MyArr ; Keil pseudo-op

ADR R2, AfterMyArr ; past last valid address

LP LDR R3, [R1], #4

ADD R0, R0, R3

TEQ R1, R2

BNE LP

…..

MyArr DCD 34, 23, 56, 78, 12345566, ……...

AfterMyArr DCB 0

Instead of ADR● Instead of ADR, you should be able to do the following:

MOV R0, #0 ; accumulate answer

LDR R1, =MyArr

LDR R2, =AfterMyArr ; past last valid address

LP LDR R3, [R1], #4

ADD R0, R0, R3

TEQ R1, R2

BNE LP

…..

MyArr DCD 34, 23, 56, 78, 12345566, ……...

AfterMyArr DCD 0 ; wasted word, could avoid...

LDR As Pseudoinstruction

● LDR Rx, =value works for any 32-bit value (address or constant).

● It sets aside space in a “constant pool” , preinitialized to value. This constant pool is (by default) at the end of the current AREA.

● Then it generates machine code for a PC-relative LDR into Rx from this preinitialized location.

● Like a convenient DCD and LDR Rx, [PC, #something]● See textbook Chapter 6.

Machine-Code FormatsLDR/STR/LDRB/STRB

● From reference manual:

Meaning of Some Bits (Ref Man)

Exercise/Example

● Determine machine code for

LDR R3, [R1], #4

and also

STRB R3, [R1, R2, LSR #5]!

Load and Store Multiple

● There are instructions LDM and STM that load or store a number of registers.

● With LDM, a bit vector in the machine code indicates which register to load. They are loaded from consecutive addresses.

● STM works similarly● They are especially useful in storing things on

the runtime stack, and will be looked at when we cover that topic.

Control StructuresCS2253, Owen Kaser

Control Structures

● Implementing familiar HLL control structures:– if-then

– if-then-else

– while

– do..while

● Omit: switch● See textbook Chapter 8

Basic Mechanism

● Essentially, to disrupt the flow of control you need to set PC (alias R15) to a new value.

● The b command does this● But so does any other allowable instruction that writes

to R15!● Consider this instruction:

add R15, R15, R3 shl 2

Number of instructions skipped ahead depends on R3.

Nesting

● A typical HLL program has nested control structures: if inside of an if, inside of while...

● We'll look at how to replace a HLL control structure (that might have another control structure within it) by corresponding assembly language.

● The inner control structure can be replaced similarly.● In the following templates, the first use of newlabel1 …

newlabel9 means to generate and use a label that was not already in use. Any subsequent occurrence of, say, newlabel1 means to use that same label.

If Without Else

● Replace if (<condition>) { <body> } by

code to test the condition (often using CMP)

b<opposite of condition> newlabel1

code for body

newlabel1

Example

● a1 is in R1, a2 is in R2● Translate if (a1 >= a2) { a1++;}

cmp R1, R2

blt xyz0001 ; lt is opposite to >=

add R1, R1, #1 ; translation of a1++

xyz0001 ; my new label

ARM Optimization

● If the body doesn't have nested control statements or other statements that set the flags, can have the following

code to test the condition

code for the body, with every instruction conditional.

Eg

cmp R1, R2

addge R1, R1, #1 ; add made conditional on >=

If With Else

Replace if (<condition>) { <body1>} else {<body2>} with

code to test condition

b<opposite of condition> newlabel1

code for body1

b newlabel2

newlabel1

code for body2

newlabel2

Example

● if (a1 >= a2) a1++; else a2++;

● Following the template:

cmp R1, R2 ; a1 >= a2 ??

blt xyz001

add R1, R1, #1

b xyz002 ; don't fall into else code

xyz001

add R2, R2, #1 ; the else's body

xyz002

ARM Optimization

● Since the bodies are simple, can use predicated [i.e., conditional] instructions:

cmp R1, R2

addge R1, R1, #1 ; the “then” body

addlt R2, R2, #1 ; the “else” body

● Look Ma, no labels and no branching. No “branch penalty”.

While Statement● Recall that a while statement checks the condition before every

iteration, including the first.● while (<cond>) {<body>} can turn into

b newlabel1

newlabel2

code for <body>

newlabel1

code for <cond>

b<the condition> newlabel2

Other translations are possible, but this is the book's

Example

● for (i=0; i<j; i+=2) ++k; ← for is just while disguised.

mov R1, #0 ; say R1 stores I

b xyz001

xyz002

add R3, R3, #1 ; body: say R3 has k

add R1, R1, #2 ; code for i+=2

xyz001

cmp R1, R2 ; say R2 has j

blt xyz002

Counting Down To Zero

● If you can arrange for your for loops to count down from N to zero AND if it is guaranteed to do at least one iteration, better to use code like

mov R1, #N ; counting down with R1

newlabel1

code for the body of the loop

subs R1, R1, #1 ; set the flags

bne newlabel1

Do...While Statement

Translate do { <body> } while (<cond>); as

newlabel1

code for <body>

code to check condition

b<cond> newlabel1

● Slightly simpler than the while loop

Nesting

● Let's do Euclid's algo together:

while (a != b)

if (a>b) a=a-b;

else b=b-a;

Conditional Execution● Using conditional execution, we can reduce

Euclid's code to

GCD CMP R0, R1

SUBGT R0, R0, R1

SUBLT R1, R1, R0

BNE GCD● Book also shows how to use conditional

execution to handle something like

if (x==1 || x==5) ++x

Assemblers and LinkersCS 2253

Owen Kaser, UNBSJ

Contents

● Review of assembler tasks● A look at linker tasks● Assembler implementation

● The location counter and symbol table● Two-pass assembler

● Macros and conditional compilation

Review of Assemblers● An assembler takes commands and translates

them into what will be the contents of some areas.

● Assembler commands can be– directives, such as

● AREA foo, data [change the area being generated]

● DCB “hello” [generate some byte contents in current area]

– instructions● ADD R1,R2,R3 [generate machine code bits in current area]

– labels

● blah …. [record the current position in the current area as “blah”]

Linkers

● The assembler typically generates one “object code” (.OBJ) file, containing the contents of the various areas.

● One source code file → one object code file.● Libraries are also object code files.● Linker's overall job is to put together the various

areas in all the object files, getting an executable file that is ready to load into memory and run.

Relocation● Consider the following situation in area AAA

foo DCD foo ; say this is 40 bytes into AAA

… 100 instructions later

LDR R1,=foo● The machine code for the LDR is really for

LDR R1, [PC,#-408] “relocatable”● But the content of the variable foo is supposed to depend on where

foo ends up in memory. The assembler does not know this. It just knows that foo will be 40 bytes into its area.

● Only the linker knows where foo will be located. At link time, say that AAA starts at 3000. The linker will fill in 3040 as part of its relocation.

● The assembler had generated a “fix me” note in the .OBJ file, recording something like “fixme: start of AAA + 40” for the linker.

Externals

● A related job for the assembler is to handle cases where one source-code file referred to something that is defined in another source-code file.

● At the assembly-language level, when you intend to use something defined in a different source code file, you declare that thing to be external.

● When you define something that you want to be used in another file, you declare it global.

Crossware's Example

include xstdsys.h

extern_main,__initiostreams,__init_cvars,__HP

global__cstart,_exit

;*****************************************************************************************

; These sections are required by the Crossware C Compiler:

area __STACK,4,data,high ; Linker places this at highest available ram location

space 1

org __LowestRomLocation ← new directive for you

global __START

__START * give the linker the start address

dcdu __STACK ; Initialise supervisor stack for C compiler

dcdu __cstart+1 ; Jump to __cstart on power up

…...

Assembler Implementation

● Key data structures:– a “location counter”. (Some assemblers let you use

its value as a constant, and they call it $)

– an array of area information● a saved location counter● a buffer of all code/data generated so far into that area

– a symbol table, mapping labels to their addresses.● address: probably an offset within an specified area● symbol table may also record a type, or whether the entry

is global or external, etc.

Assembler : Rough SketchArea ← default; $ ← 0; buffer ← empty buffer

Repeat

get a line of text, parse it and discard any comments

if line has label L, then SymTab.put(L, Area, $)

if line has directive AREA nm, type ( where nm is new)

Areas.put(Area, $, buffer); Area ← nm; $ ← 0;

else if line has directive DCB <someexpression>

x = evaluate_constant_expression( <someexpression>, SymTab)

buffer.add_byte(x); $ ← $+1;

else if line has instruction ADD rX, rY, rZ

x = figure out machine code(“ADD”, rX, rY, rZ);

buffer.add_word( x); $ ← $+4;

else if line has instruction B <someexpression>

if <someexpression> is a label in SymTab whose area matches Area

distance = ($+8 – SymTab.getValue( <someexpression>))/4;

x = figure out machine code(“B”, distance)

buffer.add_word(x); $ ← $+4

else

???

else if line has …..

Until line has directive END

Problem 1: External References● What if the assembler processes a line like

B foo

where foo is a label in a different source code file?● Solution:

buffer.add(figure_out_machine_code(B, 0))

buffer.add(<fixme note for linker: value foo>)

$ ← $ + 4 // even if fixme note is large

● The final object code file will have a linked list of the various “fixme” places.

● At link time, the linker will know where “foo” will really be and it will replace the offset of 0 by the actual distance.

Problem 2: Forward References

● Contrast this:

foo add r1, r2, r3

bne foo (foo is in the SymTab already)

● to this:

add r1, r2, r3

bne foo (foo may not be in the SymTab yet)

add r4, r5, r6

foo add r7, r7, r7

2-Pass Assemblers

● Fairly easy solution to the forward-reference problem: Process the source-code twice.

● First pass: pretend to generate code for the areas, but when something (ie, a forward reference) is unknown, just stick some padding (of the appropriate size) into the buffer. But create the symbol table.

● Second pass: run through the code again, but using the symbol table made in the first pass to generate the correct code for forward references.

Conditional Assembly

● At assembly time, you can test a condition (usually based on textual or constant equality) and exclude the assembler from seeing a block of code if the condition fails.

● Like an if-statement at assembly time. It affects what code is actually placed into your object file.

● One set of source code can generate machine code for slightly different platforms.

● C and C++ have this feature too.

Example Crossware Code

ifeq __NoRom

...

dcdu HardFault+1

….

elsec

….

dcdu HardFault+27

endc

Other Conditional Directives

● ifeq <expr> checks whether <expr> is zero● ifge <expr> checks whether it’s >= 0 ● iflt <expr> : <= 0● ifc <string1>,<string2> checks whether the two

strings are equal to each other

Macros (textbook p 73)

● Macros allow a programmer to assign a mnemonic name to a bunch of assembler lines.

● When the mnemonic is then used, associated assembler lines are “copy pasted” into the source code (from the viewpoint of the assembler: actually, your actual source code is unchanged).

● Macros can have parameters that are substituted in the copy-paste process.

● Macros can function like assembly time methods.● Not all assemblers support macros. Certain HLLs,

notably C and C++, also have macros.

Macro in Crossware

● Macros and conditional compilation in Crossware’s ARM assembler and their 8051 assembler appear similar. The 8051 is documented...(see course website for link)

● foobar macr

….. some lines of assembly (body of macro)…

endm● To invoke: foobar R0, hello, 35 ← comma sep. args● First param is \0 in body of macro (here, it is R0)● Also, \1, \2, …● Labels in the macro body should be \.0 to \.9

(On each invocation of the macro, a different label will be used)

Crossware Macro

silly macr

ifc \0,R0 ← no space

mov \1, \0

elsec

cmp R7, #\0

beq \.0

mov \1, #\0

\.0 add R0, R0, R0

endc

endm

● Silly R0, R5 expands to

mov R5,R0● Silly 18, R5 expands to

cmp R7, #18

beq temp000

mov R5, #18

temp000 add R0,R0,R0

Use of Macros

● Skilled assembler programmers (there are a few left…) often develop a library of macros that generate code for a variety of fiddly tasks.

● The include assembler directive requests that the assembler read a named file (perhaps with lots of juicy macro definitions) and act as if the contents had been pasted into this source code file.

● C language uses a similar mechanism. Java has a higher-level import idea.

Repetition at Assembly Time

● Sometime, you want the assembler to process the same block of code a bunch of times

● But you don't want to type it yourself● Some assemblers (dunno about Crossware's) allow you to put

a REPT <n> at the start of a block of lines, and ENDREPT at the end.

● Like an assembly-time FOR loop running n times.

REPT 5

ADD R1, R2, R3 ← assembler sees this 5 times

ENDREPT

“Macro Language”● Together, macros, conditional assembly and maybe

repetition essentially form a little programming language that runs at assembly time.

● If unlimited repetition (or recursive macros) are allowed, the macro language can be “Turing Complete” - wait till you finish CS2333.

● In the 1990s, Shaw/McNally challenged me to implement numerical integration in the TASM macro language.

● Usefulness: compute a table of values to use in the “real” program.

● N.B. The macro language for C is not Turing Complete.

Calling Conventions and the StackCS2253

Owen Kaser, UNBSJ

Overview

● Stacks and the Load/Store Multiple Instructions● Subroutines● ARM Application Procedure Call Standard

– Code Linkage Mechanism

– Parameter Passing

– Caller- and Callee-Save Registers

● See Chapter 13 of textbook

Stack in Memory

● Recall from CS1083 (or CS2383) that one can use an array to store a stack's data. Also need an integer array index, called TOP.

● Push(value) → Data[++TOP] = value;● Pop → return Data[TOP--];● TOP was initialized to -1 and always points to

the value to be popped. Stack grows up.● ARM folk call this a “full ascending” stack

Full-Descending Stack

● In low-level programming, a Full-Descending stack is more common. ARM ABI requires it.

● TOS is initialized to max_valid_index +1● Push(value) → Data[ --TOP] = value● Pop → return Data[ TOP++]● We decrement before on push and increment

after on pop.● DB and IA.

Empty Stacks

● To ARM, an empty stack is where the top-of-stack pointer indicates where the next push will go.

● Push(value) → Data[TOP--]● Pop → return Data[++TOP]● Decrement After (DA) and Increment Before(IB)

for an “empty descending” stack.● Empty ascending stacks are also possible.

ARM Assembly Push

● Use a form of the store multiple (STM) instruction for push, and a form of load multiple (LDM) for pop.

● Top-of-Stack is usually the R13 register (alias SP)● Push(value in R5) → STMDB SP!, {R5}● Pop (from stack to R5) → LDMIA SP!, {R5}● This is for a Full-Descending stack, so you can use

STMFD and LDMFD if desired (maybe).● Crossware does not support PUSH and POP (textbook

13.2.2)

An Un-RISCy Quirk

● We'll soon see that programmers often want to push (or pop) a bunch of registers to the stack.

● RISC approach: need 6 instructions for 6 pushes● ARM: have a single complex instruction STM to push

several registers. Requires multiple clock cycles.● STMDB SP! , {R12, R3-R5, R7, R8}● Pushed from smallest register to largest, and the !

ensures that SP gets changed.● LDMIA SP!, {R8, R7, R12, R3-R5} will restore them.

LDM/STM encoding

●

●

● P=1 means Before● U=1 means Upward● W means a ! was used● L=1 means LDM instead of STM● S=1 means some weird behaviour that depends on processor

mode; see technical docs● The register list is a bit vector; bit k means Rk is included in the

set of registers loaded or stored.

Full Descending Stack

StackBottom SPACE 1000

StackTop ; label to mark address beyond top

…..

MOV SP, #StackTop ; initialized

● Now we can push and pop, up to a limit of 250

32-bit values.

Why Push and Pop?

● The stack is convenient for saving registers (push them to save; later pop them to restore).

● Let's suppose you have a some valuable info in R3, R4. You are about to enter into some code that trashes R3 and R4. And after you exit from that code, you have some use for the old R3 and R4.

● STMDB SP!,{R3,R4} ; save R3 and R4

… code that trashes R3 and R4...

LDMIA SP!,{R3,R4} ; restore saved registers

… code that uses R3 and R4...

Subroutines

● A subroutine is essentially a method (that doesn't belong to an object).

● You call a subroutine and return from it.● A reentrant subroutine can be paused while running,

and while paused, another copy of it can start running. When finished, the paused subroutine is allowed to continue...and no problems arise.

● Your most familiar reentrant situation is with recursion. (Later: we will see “interrupts”)

● Naive coding of subroutines leads to non-reentrancy. Fancy coding with stack frames will give reentrancy.

Getting Into/ Out of Subroutines

● If you have some code you wish to call, use a BL (branch-and-link) instruction.

● Like B, it changes the program counter so you jump somewhere (the first instruction of the subroutine.)

● But R14 (alias LR, the link register) is automatically set to the return address of PC-4 (the address of the instruction immediately after the BL instruction)

● at the end of the subroutine, arrange for the return address to go into PC.

Example Subroutine

; this subroutine multiplies the value in r0

; by 10 and returns result in r1. Trashes r2.

times10 mov r2, r0, lsl #3 ; r2 = 8*r0

add r1, r2, r0, lsl #1 ; r1 = r2 + 2*r0

mov pc, lr ; return

…...

mov r0, #5

bl times10 ; invoke subroutine

….

Passing Parameters

● Our example passed an input parameter by using a register.

● And it used a register to pass back a return value.● This is a common approach. ● Another alternative is to pass parameters on the stack

(caller pushes them).● The callee then accesses them in memory.● When subroutine is finished, either the caller or callee

must pop off the parameters so stack doesn't overflow.

Subroutines Calling Subroutines

● Commonly, one method calls another.● Say main() calls func1(), which then calls

func2(). Don't lose the original return address!

Saving/Restoring Registers: 1

● Suppose you are going to call a routine that's documented to trash R4 and R5. Currently, you have something valuable in R4 that you will need later. But the value in R5 doesn't matter anymore to you.

● Caller-save scheme:

push just the value in R4

call the routine, using BL

pop into R4

Saving/Restoring Registers: 2

● Suppose you're writing a subroutine. You've been told you're not allowed to trash R4, but you are allowed to trash R5.

● You really need to use both R4 and R5● Callee-save scheme:

mysub STMDB SP!,{R4} ; save R4

...code trashing R4 and R5

LDMIA SP!,{R4} ; restore R4

MOV PC, LR ; return

Interoperability

● I want to be able to call your subroutines, and you want to be able to call mine.

● So I need to know how you expect parameters passed in, and how return values should be handled.

● I'd like to know whether I can trust you not to trash my registers' values.

● I'd like to know if there are any registers that I can trash, without having to save/restore them.

● We need some rules...

AAPCS (Textbook 13.5)

● The ARM Application Procedure Call Standard is our binding contract. If we both follow it, our code will interoperate smoothly.

● It's an example of a calling convention.● 3 parts to the contract

– obligations of caller to set things up for callee

– obligations of callee not to (permanently) trash parts of the state of the caller

– rights of the callee to trash other parts of the caller's state.

AAPCS: Parameter Passing

● R0 to R3 have the parameter values. Caller places them there.

● Callee places return value(s) in R0 and R1● Caller is otherwise free to trash R0 to R3 (so they

are essentially “caller-save registers”).● Subroutines with more than 4 parameters will have

the caller push the extra parameter values on the stack. [not sure who is responsible for their cleanup: guess it is the caller, who made the mess.]

AAPCS:Callee-save

● The callee must preserve R4-R11,SP (so they are “callee-save registers”)

● R4 to R11 typically contain local variables of the caller. And the caller expects the SP to come back unchanged.

● It is very normal to do all the callee-save pushing as a single STMDB instruction at the very start of the subroutine. This said to create a stack frame.

● And the corresponding popping is a single LDMIA instruction that is at the end of the subroutine.

AAPCS: Status flags

● You cannot expect the status flags to be preserved during a call.

● So essentially, they would be caller-save.● Except that it's rare to need an earlier-

computed status flag after the subroutine returns.

AAPCS: Link register

● The caller will have given us an R14 value that tells us where to return to when finished.

● Any BL that we do will destroy it. Very bad.– Case 1: we are a “leaf procedure” (ie, make no calls). Just don't

use R14 for anything. Finish with

MOV PC, R14

– Case 2: we are a non-leaf procedure (and can potentially make a call). Push R14 with the callee-save registers at the start of the subroutine.

And pop it (into PC) with the LDMIA instruction when finishing.

– This works because of the order in which LDM/STM stores regs.

AAPCS: R12

● Textbook indicates that R12 is somehow used by the linker at the point when the caller invokes the callee. Mysterious.

● You are allowed to trash it, but you must expect the calling process to potentially trash it.

● I.e., R12 is a caller-save register.

Warning

● Deviate from AAPCS at your own risk.● First, your code won't otherwise interoperate.● Second, DIY approaches to these things tend to

fail in horrible and utterly puzzling ways, especially if recursion is involved. I don't want to be responsible for your soul-destroying all-night debugging session that makes you switch to a BBA degree. (The world needs you in CS: follow the AAPCS...)

Let's Code the Recursive Fibonacci

● int fib(int n) {

int temp, temp2;

if (n < 2) return n;

else {

temp = fib(n-1);

temp2 = fib(n-2);

return temp+temp2;

}

}

ARM Exception HandlingCS2253

Owen Kaser, UNBSJ

Overview● Warning: hardest parts of CS2253.

● Back to Chapter 1: Processor Modes & Vector Table

● Concept of Exceptions

– Interrupt Handlers

– Priority Levels● Software Interrupts

● Memory-Mapped Input/Output

● See Textbook Chapter 14, 16

Exceptions

● Sometimes, the normal flow of control needs to be unexpectedly modified– A character is received on the keyboard

– An access is made to a memory location with no memory/device there

– The processor is asked to execute a bit pattern that is not a valid machine code

● Two main cases: interrupts and errors

Interrupts

● An interrupt can occur because a hardware device wants service...now.

● Device: “I have just received a character from the keyboard, and my buffer is only 1 character deep. Please stop whatever you are doing and remove/process this character ASAP, so that I will be able to accept the next character the keyboard sends”.

● CPU: “ok, I see your interrupt request. I am interrupting my normal execution to switch to handler code for you. When I finish, I will return to my normal execution, where I left off.”

Why Interrupts?

● There can be a lot of asynchronous things happening in a computer system. Having one program keeping track of them would be hard.

● (Eg, every loop that ran more than a few microseconds would have to contain code to check for input/output)

● System running one dedicated program can use this approach. Maybe it even waits, looping, while I/O happens.

● This is called polling I/O. It is generally viewed as unweildy. ● Better to just have the I/O device interrupt.

IRQ and FIQ interrupts

● The ARM ISA defines regular (IRQ) interrupts and fast (FIQ) interrupts.

● Stay tuned.

Software Interrupts

● Some ISAs, including ARMv4, have a special SWI instruction that, when executed, causes the system to act like a hardware device requested an interrupt.

● A hardware interrupt is like an unscheduled subroutine call that also puts the processor into an more privileged mode. Handler code is trusted and part of the operating system.

● So an SWI instruction is often used to invoke an operating system service subroutine.

● Book refers to the SWI instruction under its new name, SVC, but Crossware still uses SWI.

Error Exceptions

● Undefined instruction– Can be intentional (emulate a “missing” instruction)

● Prefetch abort– An attempt to fetch instruction fails (eg, PC is not a valid memory

location)

● Data abort– A LD or ST with an illegal address

– A store to a read-only address

● Sometimes, the response should be to die gracefully. But other times, we may be able to recover and continue.

Overall Approach

● For an exception, we need to – save the current state (including CPSR)

– Reset the PC to the handler code [ & change mode]

– Execute the handler

– Restore the saved state, including the PC & mode

● State saving and PC resetting are done by hardware. Handler and restoring done by software.

Modes

● See textbook Sections 2.3.1, 2.3.2.● Normal code executes in User mode. In processing

exceptions, modes are System, Undef, Abort, IRQ, FIQ, Supervisor.

● In some of these modes, some of the registers are banked out. The User version of R11 is hidden from use, replaced by another R11 when in FIQ mode. (So the User version of R11 is safe from modifications.)

Processor Modes (Book Fig 2.1)

Registers in Different Modes

Recognizing Your Mood Mode

● The CPSR stores more than your status flags. A 5-bit field M4:M0 stores a flag indicating mode (Table 2.1) – eg 10000 for User, 11011 for Undefined, etc.

● I bit enables or disables IRQ interrupts● F bit enables or disables FIQ (fast) interrupts● T is status: are you in Thumb mode?

Details Invoking Exception

● See textbook 14.4.● First, CPSR copied to SPSR_<mode>● Adjust CPSR (mode bits, disable IRQ, maybe

disable FIQ)● Store return address to LR_<mode>● Set PC to start of relevant handler

Finding the Right Handler● For the different kinds of exceptions, there are

different handlers. When an exception occurs, the hardware determines the source of the exception as a 3-bit number, which it uses to index the vector table (which starts in memory at address 0).

Textbook Figure 14.3

Returning After Exception

● When the handler has finished its task, it returns to the caller (in software)● The mode needs to be put back to its pre-interrupt value. And the PC needs to

be put back to the correct instruction.– Either to the instruction that had the exception (and did not successfully finish) or to the

next instruction. Case depends on the kind of exception

● SUBS PC, LR, #4 or SUBS PC, LR, #8 ← magic CPSR restore when PC is the destination and the S flag set

● Or on entry to handler – adjust LR (eg subtract 4)

– STMDB sp!,{some regs, lr}

● then use a

LDMIA sp!, {some regs, pc}^ to return.● ^ means to restore CPSR also.

Multiple Stacks

● Interrupt code typically uses stacks. And there is a separate R13 for each mode (except one). So there is a separate stack per mode...and at machine startup, it needs to be initialized.

● Initialization via a MSR (move into status register) instruction to change mode.

● Then store a value to (that) SP.● Then use a MSR to put mode back

Mrs. MRS

● MSR moves to a status register.– Status registers are CPSR, SPSR

– Underscores after (eg CPSR_cf) indicate which sub-parts of the status register are affected. Used in book code but not described… _cxsf is all of it?.

● MRS moves a status register into a regular register.

Implementing ADDSHIFT

● See Example 14.1 for the hairiest program of CS2253.

Priorities

● In order of decreasing priority, we have– Reset

– Data abort

– FIQ

– IRQ

– Prefetch abort

– SVC and Undefined Instruction

● A higher priority exception can interrupt the handling of a lower priority exception, but usually not the other way.

Priorities in IRQ

● Even within a given exception (eg IRQ), some hardware units (eg disk) are more urgent than others (eg keyboard).

● To prioritize, could OR together all interrupt request inputs. Then software can check each possible device to see who’s knocking...starting with the most urgent.

● Or a special priority device, a VIC, can take care of this.– Devices' IRQ lines go to VIC

– Only VIC actually interrupts CPU

– CPU can ask VIC for the handler address of the highest priority active interrupt request. [talking to devices: stay tuned!]

Vectoring IRQ InterruptsTextbook Figure 14.5

Software Interrupts

● Software can generate an exception. Use SWI to request an operating-system service.

● SWI handler has to use the value in R14 to find the actual instruction, in order to extract the “SVC number” field and thus know which OS service was requested.

● Assembler example: SWI 234

Talking to Devices (Ch 16)

● Not dealing with co-processors in CS2253.● Instead, we are dealing with attached devices

such as– UART (“serial port”)

– Timer

– Analogue to Digital (A-D) and Digital to Analogue converters

– Disk controllers

– General Purpose I/O (GPIO) connections that can control electronic devices (LEDs, motors, ...)

Special I/O Instructions

● Some ISAs (not ARM) have special instructions to access devices

● Intel x86: IN and OUT instructions. ● Devices are assigned 1+ numeric “port addresses”. ● CPU does an OUT instruction to the port address of the

UART to give it a command● CPU does an IN instruction to the port address to get a byte

of data from the UART, or check its status.● Frequently, a device has several (usually consecutive) port

addresses:– 1+ status port, 1+ control port, 1+ data output port, 1+ data input

port.

Memory-Mapped I/O

● Alternative: memory-mapped I/O, where certain “memory” addresses have no RAM.

● Instead, they are assigned to devices. No distinction between “port addresses” and “memory addresses”

● Now, ordinary LD and ST instructions (to the right “memory” addresses) can talk to devices.

● No IN or OUT instructions: RISC-ish.

Book Example

● LPC2104 System-on-Chip has an ARM processor core and a bunch of memory-mapped devices (peripherals)

● Some are internally connected via an “AHB” bus and some via an “VPB” bus.

● All VPB bus peripherals are in one range of addresses (0xE000000 to 0xEFFFFFF)

● UART0 device occupies 0xE000C000 to 0xE000C01C

● Every 4 addresses, we have 8 bits of data. Use LDRB or STRB to access, avoid endian-ness issue.

SoC's Memory Map

Zoom Into 0xE000C000 area

UARTs are Fiddley

● Serial communication is quite hairy. Lots of communications parameters to be set. You're lucky this era is largely past us.

● After setup, sending characters is pretty easy.● Bit 5 of the Line Status Register (memory mapped

to 0xE000C014) tells us if the transmitter buffer can accept another character now. (0 means “yes”)

● Transmit Holding Register (0xE000C000) is where we can put the next character.

Polling I/O Example (p 347, mod)

; ASCII code to send is in R0

LDR R5, =0xE000C000wait LDRB R6, [R5, #0x14] ; Line Status Reg CMP R6, #0x20 ; Risky check of Bit 5 BEQ wait ; spin until ready

; if we get here, bit 5 = 0

STRB R0,[R5] ; write to transmitter buff.

Polling vs Interrupt Driven I/O

● Note how the “wait” loop repeated asks (polls) the device to see if it is ready.

● If not, it just tries again...and again...and....● No other work in the system can be done...tight polling loops

waste system resources.● Better to flip some bits in 0xE000C004 and get the transmitter

to give us an interrupt when its buffer gets some space. In the meantime, we can be doing other things (eg, getting input, doing calculations).

● This “Interrupt-driven I/O” is generally better, though more complicated.

C-Style StringsCS2253

Owen Kaser, UNBSJ

Strings

● In C and some other low-level languages, strings are just consecutive memory locations that contain characters. A special “null character” (ASCII code 0) terminates the string.

● Common string-processing library routines are good source of assembly-language examples.

Making a Constant String

● (Review) Use DCB and don't forget the null character terminator

● mystring dcb “hello”,0

A String Local Variable

● Suppose you know you need a string local variable. If you know the maximum length you could possibly need (say 50 characters), proceed as follows....

● mySubroutine

STMFD SP!, {some regs, LR} SUB SP, SP, #52 ;maintain SP alignment MOV R0, #0 ; null character STRB R0, [SP] ; terminate string (show picture) … use space from SP to SP+51 for your string.. ADD SP, SP, #52 ; pop off space used by string LDMFD SP!, {some regs, PC}

Stack Smashing● Q: What if someone is allowed to put a 56-byte string into your

52 byte area?● A: You affect the things in the memory addresses above your

string.● The last thing pushed by the STMFD was the return address.

So you have a wrong return address.● A cracker can write some nasty machine code program as the

56-byte “string” and arrange for you to return to her program.● Moral: String locals need to be very carefully checked to see

that they are not too long. ● Some modern CPUs will mark the stack region of memory as

“nonexecutable” to help. You can still be forced to return to an arbitrary location in the existing program, may be good enough for cracker.

Returning a String

● Suppose your subroutine is supposed to return a string.● You can just return the memory address of somewhere

in memory that holds the characters of your string. (In C terminology, you return a pointer to your characters.)

● But that somewhere needs to be “safe” - not subject to arbitrary destruction.

● Any stack location below the top of the stack is not safe.

Bad Scenario

● main subroutine calls foo● foo has a local string variable, v, that it puts

some lovely string into.● foo returns the address of v to main● main turns around and calls bar● bar returns. main tries to use the lovely string.

Unhappiness results.

Bad Scenario, picture 1

Bad Scenario, picture 2

● Because the string address sent by 'foo' to main was in the danger zone, 'bar' trashed it. Not bar's fault.

● Solution: Never return the address of a local variable.

Non-Reentrant Solution

● If a subroutine S needs to return a string (whose maximum length is known), then it can put the string in a “buffer” memory location set aside just for S. And it can return the address of S to its caller.

● S's buffer is safe enough...except from itself. This approach means S won't be entrant – S cannot be recursive.

● And callers to S should copy out the answer, in case anyone they invoke also calls S.

Example

S_buffer DCB 0 SPACE 31 ; total length 32

S STMFD SP!,{...,LR} … put some string into S_buffer... LDR R0, =S_buffer ; return value in R0 LDMFD, SP!,{...,PC} ;return to caller

Length of a String (in R0)

strlen mov R1, #0 ; length counterloop ldrsb R2, [R0],#1 ; get current character addne R1,R1,#1 bne loop mov R1, R0 ; return value in R0 mov PC,LR ; return● Since this is a leaf method, we didn't need STM

and LDM

Reverse (buffer version, untested)

rev_buffer SPACE 32

reverse mov R1,R0 ;R1 is caller save stmfd SP!, {R1,LR} bl strlen ;length in R0 mov R1,#0 ldr R2,=rev_buffer strb R1, [R2,R0,LSL #0] ; mark end sub R0, R0, #1 ldr R1, [SP,#4] ; recover start of inputloop ldrsb R3, [R1],#1 ;the copying loop beq done strb R3, [R2, R0, LSL #0] sub R0, R0, #1 b loopdone ldmfd SP!, {R1, LR} ldr R0, =rev_buffer ;return value mov PC, LR

Or, Use a Stack

● Can push a bunch of characters to stack from input. (And count them).

● Pop them off, one at a time, and append to buffer

● Then return address of buffer.

Alternative Approach

● We can make the caller responsible for finding space for us to store the returned string.

● The address of the space for the returned string (probably in the caller's activation record) is passed as a parameter.

● This is a little better than the buffer approach.

Reverse (param 2 has address)

reverse mov R2,R0 ;R2 is caller save stmfd SP!, {R2,LR} bl strlen ;length in R0 mov R2,#0 ldr R1,=rev_buffer strb R2, [R1,R0,LSL #0] ; mark end sub R0, R0, #1 ldr R2, [SP,#4] ; recover start of inputloop ldrsb R3, [R2],#1 ;the copying loop beq done strb R3, [R1, R0, LSL #0] sub R0, R0, #1 b loopdone ldmfd SP!, {R2, PC} ; no return value

Making It Robust

● When the address of an output buffer is passed in, you should usually pass along another parameter to indicate how long the buffer is.

● And the string routine should be coded to avoid overflowing the buffer.

● Without the “how long” parameter, the string routine would have no way of knowing when overflow might occur.

● Early design of the C string library didn't really seem to appreciate this enough. Later additions did, but by then, programmers had developed sloppy habits.

Documents

owen/courses/2253-2017/slides/armslides-2up.pdf · " % :gd t f " % g5 u f " % ! g v f . % )kg5+ ) g5+& 1 ... f - $ o s (( $ & & ) + h # -