The Cortex M0 LPC111x Overview and Benchmarks
ARM TechCon3 October 2009 Kenneth Dwyer
Contents
NXP Introduction
Cortex M0 Introduction
LPC111x Product Details
Performance Benchmarks
Development Tools
2
NXP Introduction
3
Family Overview
4
All MCUs are not created equalARM vendors start with the same ARM IP
– Cores, Internal Bus, Interrupt Controllers, etc– But the end result is not the same!
Architectural choices, implementation, processing optimization, and power management make a big difference
– MCU supplier implementation impacts performance, power consumption and ease of use
Examples:– Flash memory performance– Peripheral consistency– Integration– Debugging capabilities
5
Changing the MCU Landscape
LPC213x (2003) LPC210x (2005)LPC24xx (2007) LPC17xx (2009)
6
Cortex M0 Introduction
7
ARM Cortex-M0 Processor
32-bit ARM RISC processor– Thumb 16-bit instruction set
Very power and area optimized– Designed for low cost, low power
Automatic state saving on interrupts and exceptions– Low software overhead on exception entry and exit
Deterministic instruction execution timing– Instructions always takes the same time to execute*
*Assumes deterministic memory system
8
Thumb instruction set
Thumb®
ARM7 ARM9 Cortex-A9Cortex-R4Cortex-M3Cortex-M0
Thumb instruction set upwards compatibility
32-bit operations, 16-bit instructions– Introduced in ARM7TDMI (‘T’ stands for Thumb)– Supported in every ARM processor developed since– Smaller code footprint
Thumb-2– All processor operations can all be handled in ‘Thumb’ state– Enables a performance optimised blend of 16/32-bit instructions– Supported in all Cortex processors
9
Instruction set architectureBased on 16-bit Thumb ISA from ARM7TDMI
– Just 56 instructions, all with guaranteed execution time– 8, 16 or 32-bit data transfers possible in one instruction
Dhrystone - 0.9DMIPS/MHz
10
Program registers r0r1r2r3r4r5r6r7r8r9
r10r11r12
r15 (PC)r14 (LR)
All registers are 32-bit wide– Instructions exist to support 8/16/32-bit data
13 general purpose registers– Registers r0 – r7 (Low registers)– Registers r8 – r12 (High registers)
3 registers with special meaning/usage– Stack Pointer (SP) – r13– Link Register (LR) – r14– Program Counter (PC) – r15
Special-purpose registers - xPSR
r13 (SP)
xPSR
11
Instruction behaviour
Most instructions occupy 2 bytes of memory
When executed, complete in a fixed time– Data processing (e.g. add, shift, logical OR) take 1 cycle– Data transfers (e.g. load, store) take 2 cycles– Branches, when taken, take 3 cycles
The instructions operate on 32-bit data values– Processor registers and ALU are 32-bit wide!
MULS15 0
MULS r0, r1; Assemblera = a * b; C code
12
16-bit multiply example
Time: 1 clock cycleCode size: 2 bytes
Time: 8 clock cyclesCode size: 8 bytes
Time: 48 clock cycles*Code size: 48 bytes
MULS r0,r1,r0 MOV R1,&MulOp1MOV R2,&MulOp2MOV SumLo,R3 MOV SumHi,R4
MOV A, XL ; 2 bytesMOV B, YL ; 3 bytesMUL AB; 1 byteMOV R0, A; 1 byteMOV R1, B; 3 bytesMOV A, XL ; 2 bytesMOV B, YH ; 3 bytesMUL AB; 1 byteADD A, R1; 1 byteMOV R1, A; 1 byteMOV A, B ; 2 bytesADDC A, #0 ; 2 bytesMOV R2, A; 1 byteMOV A, XH ; 2 bytes
MOV B, YL ; 3 bytes
ARM Cortex-M016-bit example8-bit example
MUL AB; 1 byte
ADD A, R1; 1 byte
MOV R1, A; 1 byte
MOV A, B ; 2 bytes
ADDC A, R2 ; 1 bytes
MOV R2, A; 1 byte
MOV A, XH ; 2 bytes
MOV B, YH ; 3 bytes
MUL AB; 1 byte
ADD A, R2; 1 byte
MOV R2, A; 1 byte
MOV A, B ; 2 bytes
ADDC A, #0 ; 2 bytes
MOV R3, A; 1 byte
Consider an device with a 10-bit ADC– Basic filtering of data requires a 16-bit multiply operation– 16-bit multiply operation is compared below
* 8051 need at least one cycle per instruction byte fetch as they only have an 8-bit interface
13
Nested Vectored Interrupt Controller
NVIC enables efficient exception handling– Integrated within the processor - closely coupled with the core– Handles system exceptions & interrupts
The NVIC includes support for– Prioritization of exceptions– Tail-chaining & Late arriving interrupts
Fully deterministic exception handling timing behavior– Always takes the same number of cycles to handle an exception– Fixed at 16 clocks for no jitter– Register to trade off latency versus jitter
Everything can be written in C
14
Interrupt behaviour
On interrupt, hardware automatically stacks corruptible state
Interrupt handlers can be written fully in C– Stack content supports C/C++ ARM Architecture Procedure Calling Standard
Processor fetches initial stack pointer from 0x0 on reset
r0r1r2r3r12
r15 (PC)r14 (LR)
xPSR Memory
r13 (SP)
Stack
Growth
Push
15
Traditional approach
Exception table– Fetch instruction to branch
Top-level handler – Routine handles re-entrancy
IRQVECTOR
LDR PC, IRQHandler
. .
IRQHandler PROC
STMFD sp!,{r0-r4,r12,lr}
MOV r4,#0x80000000
LDR r0,[r4,#0]
SUB sp,sp,#4
CMP r0,#1
BLEQ C_int_handler
MOV r0,#0
STR r0,[r4,#4]
ADD sp,sp,#4
LDMFD sp!,{r0-r4,r12,lr}
SUBS pc,lr,#4
ENDP
Writing interrupt handlers
ARM Cortex-M family
NVIC automatically handles– Saving corruptible registers– Exception prioritization– Exception nesting
ISR can be written directly in C– Pointer to C routine at vector– ISR is a C function
Faster interrupt response– With less software effort
WFI, sleep on exit
16
Software support for sleep modes
ARM Cortex-M family has architected support for sleep states – Enables ultra low-power standby operation– Critical for extended life battery based applications– Includes very low gate count Wake-Up Interrupt Controller (WIC)
NVIC
Cortex-M0
WIC
Wake-up
External interrupts
Wake-up sensitive Interrupts
Power Management Unit
DeepSleep
Sleep– CPU can be clock gated – NVIC remains sensitive to interrupts
Deep sleep– WIC remains sensitive to selected interrupts– Cortex-M0 can be put into state retention
WIC signals wake-up to PMU– Core can be woken almost instantaneously– React to critical external events
17
Instruction Set Comparison
CORTEX-M3
CORTEX-M0
ADC
BL
ADD ADR
B
BIC
CMN CMP EOR
LDMLDR LDRB
LDRH LDRSH
LSL LSR MOV
MUL MVN
POP PUSH ROR
SBC STM
STR STRB STRH
SUB
LDRSB
SVC TST
AND ASR
ORR
RSB
IT
ADC ADD ADR AND ASR B
BFC BFI BIC CBNZ CBZCLREX
CLZ
CMN
CMP DBG
CDP
EOR LDC
LDMIA LDMDB LDR LDRB
LDRBT LDRD LDREX LDREXB
LDREXH LDRH LDRHT LDRSB
LDRSBT LDRSHLDRSHT LDRT
LSL LSRMCR
MCRR MLA
MLS
MOV MOVT
MRC MRRC MUL MVN
NOP ORN ORR PLD
PLDW PLI POP PUSH
RBIT REV REV16 REVSH
ROR RRX RSB SBC
SBFX SDIV SEV SMLAL
SMULL SSAT STC STMIA
STMDB STR STRB STRBT
STRD STREX STREXB STREXH STRH STRHT STRT
SUB SXTB SXTH TBB TBH TEQ TST
UBFX UDIV UMLAL UMULL USAT UXTB UXTH
WFE WFI YIELD
WFI YIELD
BKPT BLX
BX CPS
DMB
ISB
MRS
MSR
NOP REV
REV16 REVSH
SEV SXTB
SXTH UXTB
UXTH WFE
DSB
Present in ARM7TDMI
18
LPC111x Product Details
19
LPC111x benefits
Energy efficiency
Performance efficiency
Ease of use
20Sponsored Session
LPC111x Cortex-M0 Configuration
Single cycle multiplier
Little Endian
Operating system extensions– 24 Bit Sys Tick
– Process stack pointer register
– Supervisor call instruction
Debug– Serial wire debug only – no SWO
– 4 breakpoints,2 watchpoints
21
LPC111x Family
22
LPC111x Block diagram
23
LPC111x Memory subsystem
M0 has ZERO latency 32 bit Interface.
Thumb instruction set (16 bits)
Match Flash width to bus width– Use self timed reads
– Current is lower by allowing one 32 bit read
versus two 16 bit reads
– Increased processor performance
32 bit Flash32 bit RAM
M0 Core
32
24
LPC111x Serial Interfaces
UARTS (Universal Asynchronous Receiver/Transmitter)– Fractional divider for baud rate control, auto-baud capabilities, and
implementation of software or hardware flow control.
– 16 Byte Receive and Transmit FIFO with adjustable trigger levels
– EIA-485/RS-485 and 9-bit mode support• Allows both software address detection and automatic address
detection using 9-bit mode• Auto Direction Control.
– Modem control support
25
LPC111x Serial Interfaces
I2C-bus interface (Inter-Integrated Circuit) – Can be configured as Master, Slave, or Master/Slave.– Bi-directional data transfer between masters and slaves– Fast mode plus compatible I2C pads (1Mbit/sec)– Programmable clock to allow adjustment of I2C transfer rates.– Monitor Mode– Can respond to multiple I2C addresses
26
LPC111x I2C output pad characteristics
High Current I2C Fast Plus Pad Input Current IIN versus Input Voltage VIN under Nominal Conditions (3.3V VDDE, 1.8V VDD, 27oC) - October 23, 2008
0
200
400
600
800
1000
1200
1400
1600
1800
0 20 40 60 80 100 120
IIN ( mA )
VIN
( m
V )
27
LPC111x Serial Interfaces
SSP controller (Synchronous Serial Communication)– 8 Frame Transmit and Receive FIFO ‘s– Flexible Frame size - 4 to 16bits– Multi-protocol capabilities
• SPI (Motorola)
• SSI (TI)
• Microwire (National)– Maximum speeds
• 25 Mbits/s (Master Mode)
• 6 Mbits/s (Slave Mode
28
LPC111x Timers
Two general purpose 32-bit timers/counters– Capture inputs, and 13 match outputs – Timer as counter or timer mode. – Match output can toggle, go high, go low or do nothing– PWM mode
Two general purpose 16 bit timers/counters
SysTick Timer (Core)
WDT 32bits– Resets chip if not periodically reloaded – Debug mode– Multiple clock sources (Watchdog Osc, IRC, Main clock)
29
LPC111x Clock Generation Block Diagram
osc_clk
irc_osc
system clock select
(CLKSRCSEL)
watchdog pclk
Peripheral Clock
Divider
wd_clk
pclk1
pclk8pclk4pclk2
SYS PLL/DLL
main PLL settings(PLL0...)
CPU Clock
Divider
CPU PLL select
(PLL0CON)
cclk
watchdog clock select
(WDCLKSEL)
CPU clock divider setting(CCLKCFG)
sysclkIRC
WDOSC
30
LPC111x Clocking Features (cont.)
Clock Output function– Output any clock regardless of CPU clock– Clock divider
clock out select(CLKOUTSEL)
irc_oscosc_clkcclk
WD OSC
Clock outClock Divider
CLKOUT DIV(CLKOUT_DIV)
CLKOUT
0
31
LPC111x General Purpose I/O (GPIO)
42 High Speed GPIOs (LQFP48)
14 pins have their own interrupt vector
All pins configurable to pull-ups/pull-downs enabled or disabled
Bus keeper mode
High Current output on some pins
5V tolerant
32
LPC111x ADC
10-bit successive approximation ADC.
Input multiplexing among 8 pins.
Power-down mode.
Measurement range 0 V to VDD(3V3).
10-bit conversion time ≥ 2.44 µs.
Burst conversion mode for single or multiple inputs.
Optional conversion on transition of input pin or Timer Match signal.
Individual result registers for each channel reducing interrupt overhead.
33
LPC111x Reset Sources
RESETN pin
Watchdog (WDT) Reset
Power-On Reset (POR)
BrownOut Detection (BOD)– 4 stage monitoring of the voltage– BOD asserts interrupt signal to NVIC– BOD asserts Reset
34
LPC111x Power modes
Sleep mode– The clock to the core is stopped.– Reset or interrupt resumes program execution. – Peripheral functions continue operation during
Deep-sleep mode– All clocks are stopped (state is retained)– Various power down option trades off for wake up time.
Deep power down mode– All chip power turn off (state is lost)– Always on domain registers maintain state. Can be used to information of
where the processor was before deep power down.– Wakeup through a wakeup pin.
35
LPC111x Serial Wire Debug (SWD)
Target debug with no debug agent– Target auto-detection– Run, stop and single-step– Flash download and verify– Read/write registers and memory– Interleaved source - disassembly
Only 2 pins required!– Same 20-pin connector as JTAG
with optional SWV connection
High performance– 400 Kbytes/sec @ 5 MHz
Functionality and performance equivalent to JTAG with only 2 pins
36
Performance Benchmarks
37
Code size of 32 bits versus 16/8bit MCU’s
The instruction size of 8 bit MCU’s is not 8 bits– 8051 is 8 to 24 bits– PIC18 is 18 bits– PIC16 is 16 bits
The instruction size of 16 bit MCU’s is not 16 bits– MSP430 can be up to 32bits and the extended version can be up to 64 bits– PIC24 is 24 bits
The instruction size for M0 is mostly 16 bits
38
Code size of 32 bits versus 16/8bit MCU’s
39
What about Data ?
8 bit microcontrollers do not just process 8 bit data– Integers are 16 bits– 8 bit microcontroller needs multiple instructions integers– C libraries are inefficient – Stack size increases– Interrupt latency is affected
Pointers take multiple Bytes.
M0 can handle Integers in one instruction
M0 can efficiently process 8 and 16 bit data– Supports byte lanes– Instructions support half words and half bytes.
LDR, LDRH, LDRB
M0 has efficient Library support– Optimized for M0
40
What about Data ?For 16 bit processors have issues with
– Long integers– Floating point types– Data transfers between processor registers and memory
16 bit processors have 16 bit registers– Two registers required for 32 bit transfers– Increased stack requirements
M0 has 32 bit registers and 32 bit memories– Less cycles for long integers– Good floating point performance– Less cycles for data transfers
41
Code Size Performance
0.00
0.50
1.00
1.50
2.00
2.50a2
time
aifir
f
aiiff
t
bitm
np
canr
dr
iirflt
pntrc
h
puw
mod
rspe
ed
HC08 M0 using microlib
42
Code Size PerformanceM0 code size is on average 10% smaller than best MSP430 average
Code size for basic functions
0
50
100
150
200
250
300
350
Math8bit
Math16bit
Math32bit
Matrix2dim
8bit
Matrix2dim
16
Matrixm
ult
Sw
itch8bit
Sw
itch16bit
Cod
e Si
ze (B
ytes
)
MSP430
MSP430F5438
MSP430F5438 Large model
Cortex M0
43
Code Size PerformanceM0 code size is 42% and 36% smaller than best MSP430 generic
Floating Point and Fir Filter Code Size
0200400600800
100012001400
Gen
eric
MS
P43
0
MS
P43
0F54
38
MS
P43
0F54
38la
rge
data
mod
el
Cor
tex-
M0
Cod
e Si
ze(b
ytes
)
MathFloatFirfilter
44
Code Size PerformanceM0 code size is 30% smaller than MSP430F5438
Whet
0
1000
2000
3000
4000
5000
6000
7000
Generic
MS
P430
MS
P430F5438
MS
P430F5438
large datam
odel
Cortex-M
0
Cod
e Si
ze (B
ytes
)
45
Code Size Performance (Coremark)M0 code size is 16% smaller than generic MSP430
Coremark
6500
7000
7500
8000
8500
9000
9500
Generic MSP430 Cortex-M0
Cod
e Si
ze (B
ytes
)
46
Code Size Performance (Coremark)M0 code size is 49% smaller than Atmel AVR8
Coremark
0
2000
4000
6000
8000
10000
12000
14000
16000
Atmel AVR8 Mega644 Cortex-M0
Cod
e Si
ze (B
ytes
)
47
Computation Performance
48
Computation PerformanceuS
ec
16 bit FIIR filter performance at 1MHz
49
Computation Performance
Coremark Score
0
0.5
1
1.5
2
2.5
3
3.5
AVR8 ATMega644 MSP430 M0
Cor
emar
k (M
ark/
sec)
50
Power performance
Current Contribution by Function (1MHz)
CoreFlashRAMPeripheralsAnalog support
Current Contribution by Function (20MHz)
CoreFlashRAMPeripheralsAnalog support
Core size is 3 times smaller than the M3. In the case of smallerFlash footprint this is a significant power saving
51
Power performance
Slope is 150uA/MHz
While (1); loop
0
1
2
3
4
5
6
12 16 20 24 28 32 36
Cur
rent
(mA)
Frequency(MHz)
LPC111x current versus Frequency
52
Development Tools
53
MCU Tool Solutions
NXP’s Low cost Development Tool Chain
Rapid Prototyping Online Tool
Traditional Feature Rich Tools (third party)
54
NXP’s FIRST Low Cost Toolchain
Eclipse-based IDE LPCXpressoStarter Board
Evaluation Product Development
55
LPCXpresso
LPCXpresso will provide end-to-end solution from evaluation all the way to product development
Attractive upgrade options to full blown suites and development boards
LPCXpresso will change the perception about NXP’s solution for tools
Key competition:– Microchip MPLAB– Atmel AVR Studio
“LPCXpresso will change the Tool Landscape for NXP”
56
LPCXpresso Components
NXP has created the first single perspective Eclipse IDE
This offers the power and flexibility of Eclipse in combination with a simple and easy to learn user interface
Supports all NXP products (currently up to 128k)
LPC3154 HS USB download and debug engine
LPC134x Target board
LPC3154
57
58
M0 Core Main Features
Tight integration of peripherals reduces area and development costs
Thumb instruction set => high code density with 32-bit performance
Power control optimization of system components
Integrated sleep modes for low power consumption
Fast code execution permits slower processor clock or increases sleep mode time
Hardware multiplier
Deterministic, high-performance interrupt handling for time-critical applications
Serial Wire Debug reduces the number of pins required for debugging.
59
LPC111x Memory subsystem
Cortex-M0 bus interleaves instruction and data– Does not cause a significant performance penalty
versus separate I and D bus
– Less than a 5% performance change based on broad
benchmark testing
60
LPC111x Memory system bus connections
Simple deterministic flash memory interface
Zero latency bus
All IAP and ISP code is in ROM– Flash programming– Plus Driver code!
61
SWD interface
ARM Cortex-M0
TEST/DEBUG INTERFACE
FLASH SRAM ROM
LPC111x Simple AHB-lite interface
M0 supports a simple AHB lite interface– 0 wait states from the core– No AHB buffering– No Burst transaction– No merging or re-ordering of accesses
Slave
M0 Core
Slave
Slave
Master Master
62