1 Tema IV – Sistemas Embebidos – MicroBlaze y ARM ZYNQumh1759.edu.umh.es/.../uploads/sites/783/2013/02/Tema-4.pdf · 2014-11-11 · MicroBlaze y ARM ZYNQ 1 Roberto Gutiérrez

Tema IV – Sistemas Embebidos – MicroBlaze y ARM ZYNQ

1

Roberto Gutiérrez Mazón

2

¨  Introduction ¨  MicroBlaze and ARM ZYNQ Features ¨  Random Memory Access (RAM):

¤  SRAM, DRAM, SDRAM, etc. ¤  Programmable ROM (PROM, EPROM, EEPROM, FLASH)

¤  Content-Addressable Memory (CAM).

¨  Buses – Architectures. ¤  PCI, PCI Express. ¤  Processor Local Bus (PLB) & On-chip Peripheral Bus (OPB). ¤  Advanced Extensible Interface (AXI) .

¨  Hardware/Software co-design. Address Management ¨  Embedded Development Kit (EDK). ¨  Os vs RTOS (Real Time OS)

3

¨  An Embedded System is nearly any computing system (other than a general-purpose computer) with the following characteristics: ¤  Single function

n  Typically designed to perform a predefined function

¤  Tightly constrained n  Tuned for low cost n  Single-to-fewer component based n  Performs functions fast enough n  Consumes minimum power

¤  Reactive and real-time n  Must continually monitor the desired environment and react to changes

¤  Hardware and software coexistence

4

¨  Embedded design in an FPGA consists of the following: ¤  Develop processor system in FPGA

n  MicroBlaze processor (soft core) or ARM processor (hard core) n  Peripherals

n  PLBv46 (XPS) n  AXI interconnect

n  Reset, clocking, debug ports ¤  Use Operating System (OS) or Real Time Operating System (RTOS)

(optional) ¤  Generate drivers and libraries ¤  Create the software application

n  Software routines n  Interrupt service routines

5

Power Supply CLK CLK

CLK custom IF-logic

SDRAM SDRAM SRAM SRAM SRAM

Memory Controller

UART LC

Display Controller

Interrupt Controller Timer

Audio Codec

CPU (uP / DSP) Co-

Proc.

GP I/O

Address Decode Unit

Ethernet MAC

6

FPGA CLK CLK

CLK custom IF-logic


Memory Controller

UART

Display Controller

Timer

Power Supply

LC

Audio Codec

CPU (uP / DSP) Co-

Proc.

GP I/O

Address Decode Unit

Ethernet MAC

Interrupt Controller

7

Power Supply


LC

Audio Codec EPROM

8

¨  Example: ¤  Hummingbird processor from Samsung

n  Used by Galaxy phones and tablet, and basis of the Apple’s A4 processor for the Ipad and iPhone4

n  An ARM Cortex A8 processor core with a PowerVR SGX 535 graphics chip. ¤  NVIDIA’s Tegra 2 is similar

n  Paired two ARM cortex n  Processor Cores with an NVIDIA n  Gpu.

9




¨  Bus – Conceps and Architectures. ¤  PCI, PCI Express. ¤  Processor Local Bus (PLB) & On-chip Peripheral Bus (OPB). ¤  Advanced Extensible Interface (AXI) .


MicroBlaze Architecture 10

Optional MMU for

Linux2.6 and MPU block for ease of

software use

PLB based system

Enhanced FSL for CPU

to hw/sw accelerator

11

¨  Scalable 32-bit Core ¤  Single-Issue pipeline

n  Supports either 3-stage (resource focused) or 5-stage pipeline (performance focused)

¤  Configurable Instruction and Data Caches n  Direct mapped (1-way associative)

¤  Optional Memory Mgt or Memory Protection Unit n  Required for Linux OS (Linux 2.6 is currently supported)

¤  Floating-point unit (FPU) n  Based upon IEEE 754 format

¤  Barrel Shifter ¤  Hardware multiplier

n  32x32 multiplication to generate a 64-bit result ¤  Hardware Divider ¤  Fast Simplex Link FIFO Channels for Easy, Direct Access to Fabric and

Hardware Acceleration ¤  Hardware Debug and Trace Module

MicroBlaze Architecture

12

¨  New features and improvements ¤  High-performance AXI4 interface and AXI4 peripherals ¤  Memory Management Unit (MMU) implements virtual memory management

n  PPC405 processor MMU compatible n  Virtual memory management provides greater control over memory protection,

which is especially useful with applications that can use an RTOS

¤  Processing improvements n  New float-integer conversion and float-square root instructions n  Speeds up

n  FP è Int conversion n  Int è FP conversion n  FP square root

¤  Enhanced XMD support ¤  AXI4 streaming interface


13 ¨  All instructions take one clock cycle, except the following

¤  Load and store (two clock cycles) ¤  Multiply (two clock cycles) ¤  Branches (three clock cycles, can be one clock cycle)

¨  Operating frequency – fast speed grade, 5 stage pipeline ¤  307 MHz on the Virtex-6 (-3) FPGA ¤  245 MHz on the Virtex-5 (-3) FPGA ¤  154 MHz on the Spartan®-6 (-3) FPGA ¤  119 MHz on the Spartan-3 (-5) FPGA

¨  Performance of 1.15 DMIPS/MHz ¨  Fabric utilization – in LUT’s size optimized/speed optimized

¤  779/1,134 LUTs in the Virtex-6 FPGA ¤  240/330 LUTs in the Virtex-5 FPGA ¤  770/1,154 LUTs in the Spartan-6 FPGA ¤  1,258/1,821 LUTs in the Spartan-3 FPGA


ZYNQ Features (ARM) 14

¨  Complete ARM®-based processing system ¤  Application Processor Unit (APU)

n  Dual ARM Cortex™-A9 processors n  Caches and support blocks

¤  Fully integrated memory controllers ¤  I/O peripherals

¨  Tightly integrated programmable logic ¤  Used to extend the processing system ¤  Scalable density and performance

¨  Flexible array of I/O ¤  Wide range of external multi-standard

I/O ¤  High-performance integrated serial

transceivers ¤  Analog-to-digital converter inputs


¨  Application processing unit (APU)

¨  I/O peripherals (IOP) ¤  Multiplexed I/O (MIO), extended

multiplexed I/O (EMIO) ¨  Memory interfaces ¨  PS interconnect ¨  DMA ¨  Timers

¤  Public and private ¨  General interrupt controller

(GIC) ¨  On-chip memory (OCM): RAM ¨  Debug controller: CoreSight


¨  Legacy ARM processors ¤  ARM7, ARM9 (not the Cortex-A9

processor), ARM11 ¨  Cortex family of processors

¤  Cortex-A#: "A" application n  The products support a memory

management unit (MMU) n  Excellent for operating systems

¤  Cortex-R#: "R" real time n  The products support a memory

protection Unit (MPU) n  Better determinism than an MMU

¤  Cortex-M#: "M" Embedded microcontroller

¨  There are some products that are implemented differently but use the same ARM Architecture ¤  Cortex-A8 and Cortex-A9

processors

¨  ARM Cortex-A9 processor implements the ARMv7-A architecture ¤  ARMv7 is the ARM Instruction Set Architecture (ISA)

n  Thumb instructions: 16 bits; Thumb-2 instructions: 32 bits n  NEON: ARM’s Single Instruction Multiple Data (SIMD)

instructions

¤  ARMv7-A: Application set that includes support for a Memory Management Unit (MMU)

¤  ARMv7-R: Real-time set that includes support for a Memory Protection Unit (MPU)

¤  ARMv7-M: Microcontroller set that is the smallest set

¨  ARM Advanced Microcontroller Bus Architecture (AMBA®) protocol ¤  AXI3: Third-generation ARM interface

¤  AXI4: Adding to the existing AXI definition (extended bursts, subsets)

¨  Cortex is the new family of processors ¤  ARM family is older generation; Cortex is current; MMUs in

Cortex processors and MPUs in ARM

Application Processing Unit (APU) 17

¨  Heart of the PS ¨  Tightly coupled

processors and sub-components for maximum performance

¨  Tied to other PS components and PL via the PS interconnect


¨  Dual ARM® Cortex™-A9 MPCore with NEON extensions ¤  Up to 800-MHz operation ¤  2.5 DMIPS/MHz per core ¤  Separate 32KB instruction

and data caches ¨  Snoop control unit

¤  L1 cache snoop control n  Accelerator coherency port

¨  Level 2 cache and controller ¤  Shared 512 KB cache with

parity


¨  Introduction to NEON ¤  NEON is the ARM codename for the vector processing unit

n  Provides multimedia and signal processing support ¤  FPU is the floating-point unit extension to NEON

n  Both NEON and FPU share a single set of registers ¤  NEON technology is a wide single instruction, multiple data (SIMD)

parallel and co-processing architecture n  32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide) n  Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, or 32-bit

float


¨  L1 Cache Features ¤  Separate instruction and data caches

for each processor ¤  Caches are four-way, set associative

and are write-back ¤  Non-lockable ¤  Eight words cache length ¤  On a cache miss, critical word first

filling of the cache is performed followed by the next word in sequence

¨  L2 Cache Features ¤  512K bytes of RAM built into the SCU

n  Latency of 25 CPU cycles n  Unified instruction and data cache

¤  Fixed, 256-bit (32 words) cache line size

¤  Support for per-master way lockdown between multiple CPUs

¤  Eight-way, set associative

¤  Two AXI interfaces n  One to DDR controller n  One to programmable logic master (to peripherals)


¨  APU Sub-components ¤  General interrupt controller (GIC) ¤  On-chip memory (OCM): RAM and boot ROM ¤  Central DMA (eight channels) ¤  Device configuration (DEVCFG) ¤  Private watchdog timer and timer for each CPU ¤  System watchdog and triple timer counters shared between CPUs ¤  ARM CoreSight debug technology


¨  Snoop Control Unit (SCU) ¤  Shares and arbitrates functions between the two processor cores

n  Data cache coherency between the processors n  Initiates L2 AXI memory access n  Arbitrates between the processors requesting L2 accesses n  Manages ACP accesses n  A second master port with programmable address filtering between OCM and L2 memory

support

23






Introduction - RAM

24

Second Level Cache

(SRAM)

Control

Datapath

Secondary Memory (Disk)

On-Chip Components

RegFile

Main Memory (DRAM) D

ata C

ache Instr

Cache

ITLB

DTLB

eDRAM

Speed (ns): .1’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s T’s

Cost: highest lowest

q  By taking advantage of the principle of locality: ●  Present the user with as much memory as is available in the cheapest

technology. ●  Provide access at the speed offered by the fastest technology.

Introduction - RAM

25

Read-Write Memory Non-Volatile Read-Write

Memory Read-Only Memory

EPROM

E 2 PROM

FLASH

Random Access

Non-Random Access

SRAM

DRAM

Mask-Programmed

Programmable (PROM)

FIFO

Shift Register

CAM

LIFO

Introduction

26

q  Grow in DRAM chip Capacity

Introducction - RAM

27

¨  Random Access: ¤  “Random” is good: access time is the same for all locations ¤  DRAM: Dynamic Random Access Memory

n  High density, low power, cheap, slow n  Dynamic: need to be “refreshed” regularly

¤  SRAM: Static Random Access Memory n  Low density, high power, expensive, fast n  Static: content will last “forever”(until lose power)

¨  “Non-so-random” Access Technology: ¤  Access time varies from location to location and from time to time ¤  Examples: Disk, CDROM

¨  Sequential Access Technology: access time linear in location (e.g.,Tape)

Introduction - RAM

28

¨  Performance of Main Memory: ¤  Latency: Cache Miss Penalty

n  Access Time: time between request and word arrives n  Cycle Time: time between requests

¤  Bandwidth: I/O & Large Block Miss Penalty (L2)

¨  Main Memory is DRAM : Dynamic Random Access Memory

¤  Dynamic since needs to be refreshed periodically (8 ms) ¤  Addresses divided into 2 halves (Memory as a 2D matrix):

n  RAS or Row Access Strobe n  CAS or Column Access Strobe

¨  Cache uses SRAM : Static Random Access Memory

¤  No refresh (6 transistors/bit vs. 1 transistor) Size: DRAM/SRAM - 4-8 Cost/Cycle time: SRAM/DRAM - 8-16

Introduction - RAM

29

Word 0

Word 1

Word 2

Word n-1

Word n-2

Storage Cell

m bits

n w

ords

S0

S1

S2

S3

Sn-2

Sn-1

Input/Output

n words è n select signals

Word 0

Word 1

Word 2

Word n-1

Word n-2

Storage Cell

m bits

S0

S1

S2

S3

Sn-2

Sn-1

Input/Output

A0

A1

Ak-1 Dec

oder

Decoder reduces # of inputs. k = log2 n

1D Memory Architecture

Introduction - RAM

30

2D Memory Architecture

A0

Row

Dec

oder

A1 Aj-1

Sense Amplifiers

bit line

word line

storage (RAM) cell

Row

Add

ress

C

olum

n A

ddre

ss

Aj Aj+1

Ak-1

Read/Write Circuits

Column Decoder

2k-j

m2j

Input/Output (m bits)

amplifies bit line swing

selects appropriate word from memory row

Introduction - RAM

31

Row

A

ddr

Col

umn

Add

r B

lock

A

ddr

Input/Output (m bits) 3D Memory Architecture

Random Memory Access (RAM). SRAM cell.

32

¨  Basic building block: SRAM Cell ¤  Holds one bit of information,

like a latch ¤  These cross-coupled inverters

are often referred to as a latch ¤  The circuit uses positive

feedback

bit

write

write_b

read

read_b

¨  6T SRAM Cell ¤  Used in most commercial chips ¤  Data stored in cross-coupled

inverters

¨  Read: ¤  Precharge bit, bit_b ¤  Raise wordline

¨  Write: ¤  Drive data onto bit, bit_b ¤  Raise wordline bit bit_b

word

¨  12-transistor (12T) SRAM cell ¤  Use a simple latch

connected to bitline


33

SRAM Read ¨  Precharge both bitlines high ¨  Then turn on wordline ¨  One of the two bitlines will be

pulled down by the cell ¨  Ex: A = 0, A_b = 1

¤  bit discharges, bit_b stays high ¤  But A bumps up slightly

¨  Read stability ¤  A must not flip ¤  N1 >> N2

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

0.0

0.5

1.0

1.5

0 100 200 300 400 500 600time (ps)

word bit

A

A_b bit_b


34

SRAM Write ¨  Drive one bitline high, the other low ¨  Then turn on wordline ¨  Bitlines overpower cell with new value ¨  Ex: A = 0, A_b = 1, bit = 1, bit_b = 0

¤  Force A_b low, then A rises high ¨  Writability

¤  Must overpower feedback inverter ¤  N2 >> P1

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

time (ps)

word

A

A_b

bit_b

0.0

0.5

1.0

1.5

0 100 200 300 400 500 600 700


35

Decoders ¨  n:2n decoder consists of 2n n-input AND gates

¤  One needed for each row of memory ¤  Build AND from NAND or NOR gates

word0

word1

word2

word3

A0A1 word0

word1

word2

word3

word15

A0A1A2A3

A0

A1

A2

A3

word1

word2

word3

word15

word0

1 of 4 hotpredecoded lines

predecoders

Pre-decoding


36

¨  Column Circutry ¤  Bitline conditioning ¤  Sense amplifiers ¤  Column multiplexing

¨  Precharge bitlines high before reads

¨  Equalize bitlines to minimize voltage difference when using sense amplifiers.

f

bit bit_b

φbit bit_b

bit_bbit

sense sense_b

sense_clk isolationtransistors

regenerativefeedback

A0A1

B0 B1 B2 B3

Y

Multiplexing Column.

Sense Amplifiers.


37

¨  Ex: UltraSparc 512KB cache ¤  4 128 KB subarrays. ¤  Each have 16 8KB

banks. ¤  256 rows x 256 cols /

bank. ¤  60% subarray area

efficiency. ¤  Also space for tags &

control.

Random Memory Access (RAM). DRAM cell.

38

¨  DRAM 1-T uses a capacitor (Cc) to temporarily store data which must be refreshed periodically to prevent information loss, and the data is lost in most DRAMs during the read cycle.

¨  Due to leakage currents of MA, the data will eventually be corrupted, hence it needs to be refreshed


39

Storing a ‘0’

Storing a ‘1’


40

¨  DRAM subarray (256 words* 512bits).


41

Sense Amplifier.

1 T DRAM cell read operation.

Bitline Conditioning

Column circuitry


42

DRAM Timing. Multiplexed Addresing

SRAM Timing.

DRAM Timing . Multiplexed addresing Detailed.


43

43

A D

OE_L

256K x 8 DRAM 9 8

WE_L CAS_L RAS_L

OE_L

A Row Address

WE_L

Junk

Read Access Time

Output Enable Delay

CAS_L

RAS_L

Col Address Row Address Junk Col Address

D High Z Data Out

DRAM Read Cycle Time

Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L

¨  DRAM Read Timing. Every DRAM access begins at: ¤  Assertion of the RAS_L ¤  2 ways to read: early or late v. CAS

Junk Data Out High Z


44

44

A D

OE_L

256K x 8 DRAM 9 8

WE_L CAS_L RAS_L

WE_L

A Row Address

OE_L

Junk

WR Access Time WR Access Time

CAS_L

RAS_L

Col Address Row Address Junk Col Address

D Junk Junk Data In Data In Junk

DRAM WR Cycle Time

Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L

¨  DRAM write timing. Every DRAM access begins at: ¤  The assertion of the RAS_L ¤  2 ways to write: early or late v. CAS


45

DUAL DATA RATE (DDR)

Functional Block Diagram 8M* 16b SDRAM


46

Prefetch

Burst Length

Double-Data Rate (DDR) DRAM transfers data on both rising and

falling edge of the clock

Command frequency does not change

Read Only Memory (ROM).

47

¨  ROM memories is a nonvolatile structure in that the state is retained indefinitely, even without power.

¨  Mask-programmed ROM can be configured by the presence (‘1’) or absence (‘0’) of a transistor or contact.

ROM Array

2:4DEC

A0A1

Y0Y1Y2Y3Y4Y5

weakpseudo-nMOS

pullups

Word 0: 010101

Word 1: 011001

Word 2: 100101

Word 3: 101010

Read Only Memory. Programmable ROM.

48

¨  Programmable ROMs ¤  PROM uses fuses to store

the information. One-time programmable memory.

¤  The user typically configures the ROM in a specialized PROM programer before the putting it in the system.

¨  Erasable Programmable ROMs (EPROM) ¤  Using a floating gate the

control gate and channel.

¤  EPROM, EEPROM, Flash

Floating gate Source

Substrate

Gate Drain

n + n +_ p

t ox t ox

G

S

D

EPROM

Floating gate Source

Substrate p

Gate Drain

n 1 n 1

20 – 30 nm

10 nm EEPROM


49

0 V

2 5 V 0 V

D S

Removing programming voltage Leaves charge trapped

5 V

2 2.5 V 5 V

D S

20 V

10 V 5 V 20 V

D S

Avalanche injection

Programming results in Higher Vt

WL

BL

V DD

EEPROM 2T Cell


50

ETOX 1T Cell (Flash)


51

¨  64K cells / pages. 64 cells/line ¤  256 pages/block.

¨  4 bits / cells (multilevel Vt)

¨  2K block/plane. 2 planes.

NA

ND

FLA

SH

Content Addressable Memory (CAM)

52

¨  Extension of ordinary memory (e.g. SRAM) ¤  Read and write memory as

usual.

¤  Also match to see which words contain a key.

CAM

adr data/key

matchread

write

row decoder

weak

missmatch0

match1

match2

match3

clk

column circuitry

CAM cell

address

data

read/write

bit bit_b

word

match

cell

cell_b

CAM cell

Content Addressable Memory (CAM)

53

¨  CAM in Memory Cache

Address D

ecoder

Hit Logic

CAM

ARRAY

Input Drivers

Tag Hit Address

SRAM

ARRAY

Sense Amps / Input Drivers

Data R/W

54






Introduction to Buses 55

¨  What is a bus? ¨  It is a simplified way for many devices to

communicate to each other. ¨  Looks like a “highway” for information. ¨  Actually, more like a “basket” that they all share.

CPU Keyboard Display


¨  Suppose CPU needs to check to see if the user typed anything.



¨  CPU puts “Keyboard, did the user type anything?” (represented in some way) on the Bus.


“Keyboard, did the user type anything?”


¨  Each device (except CPU) is a State Machine that constantly checks to see what’s on the Bus.

Bus




¨  Keyboard notices that its name is on the Bus, and reads info. Other devices ignore the info.

Bus




¨  At some point, CPU reads the Bus, and gets the Keyboard’s response.

Bus


“CPU: Yes, user typed ‘a’.”

Buses 101 61

¨  A bus is a multiwire path on which related information is delivered ¤  Address, data, and control

buses

¨  Processor and peripherals communicate through buses

¨  Peripherals may be classified as: ¤  Arbiter, master, slave, or

master/slave (bridge)

Master Master/ Slave

Slave Slave Slave

Master Arbiter Arbiter

Buses 101 62

¨  Address Bus : ¤  CPU reads/writes data from the memory by addressing a unique location; outputs the location of

the data (aka address) on the address buss; memory uses this address to access the proper data ¤  Each I/O device (such as monitor, keypad, etc.) has a unique address as well (or a range of

addresses); when accessing a I/O device, CPU places its address on the address bus. Each device will detect if it is its own address and act accordingly

¤  Devices always receive data from the CPU; CPU never reads the address buss (it is never addressed)

¨  Data Bus: ¤  When the CPU fetches data from memory, it first outputs the address on the address bus, then the

memory outputs the data onto the data bus; the CPU reads the data from data bus ¤  When writing data onto the memory, the CPU outputs first the address on the address bus, then

outputs the data onto the output bus; memory then reads and stores the data at the proper location

¨  Control Bus: ¤  Address and data buses consist of n lines, which combine to transmit one n bit value; control bus is

a collection of individual control signals. This bus is mostly a collection of unidirectional signals ¤  These signals indicate whether the data is to be read into or written out the CPU, whether the CPU

is accessing memory or an IO device, and whether the I/O device or memory is ready for the data transfer

Buses 101 63

¨  Bus masters have the ability to initiate a bus transaction ¨  Bus slaves can only respond to a request ¨  Bus arbitration is a three-step process:

¤  A device requesting to become a bus master asserts a bus request signal

¤  The arbiter continuously monitors the request and outputs an individual grant signal to each master according to the master’s priority scheme and the state of the other master requests at that time

¤  The requesting device samples its grant signal until the master is granted access. The master then initiates a data transfer between the master and a slave when the current bus master releases the bus

¨  Arbitration mechanisms ¤  Fixed priority, round-robin, hybrid

Buses 101 64

¨  The IBM CoreConnect bus architecture standard provides three buses for interconnecting cores, library macros, and custom logic: ¤  Processor Local Bus (PLB) ¤  On-Chip Peripheral Bus (OPB)

¤  Device Control Register (DCR) bus

¨  IBM offers a no-fee, royalty-free CoreConnect bus architecture license ¤  Licenses receive the PLB arbiter, OPB arbiter, and PLB/OPB bridge

designs along with bus-model toolkits and bus-functional compilers for the PLB, OPB, and DCR buses

¤  Required only if you create your own CoreConnect bus architecture peripheral or you are using the Bus Functional Model (BFM)

Buses 101 65

The MicroBlaze processor core is organized as a Harvard architecture

MicroBlaze™ DPLB

Local Memory

DLMB

IIC

PLB ARB

GPIO

UART

Ethernet

Timer/PWM

BRAM

Interrupt Controller

ILMB

LMB Buses

IXCL DXCL CacheLinks

Multi-Port Memory Controller

FSL

Co-Processor

IPLB

Separate busses for data and instruction

66

¤  Processor independence ¤  Low-power consumption ¤  Burst use for all read and write

transfers ¤  Bus speed up to 66 MHz ¤  64-bit bus width ¤  Low pin count (PCI Target: 47,PCI

Initiator: 49 pins) ¤  Concurrent bus operation ¤  Bus master support ¤  Hidden bus arbitration ¤  Auto configuration

PCI Bus

Key Terms ¨  Initiator

¤  Or Master ¤  Owns the bus and initiates the data transfer ¤  Every Initiator must also be a Target

¨  Target ¤  Or Slave ¤  Target of the data transfer (read or write)

¨  Agent ¤  Any initiator/target or target on the PCI bus

67

PCI Bus Clock ¨  All action synchronize to the PCI clock

¨  Clock may be any where from 0 MHz to 33 MHz and all PCI device must be support this range

¨  The revision 2.1 specification define speed up to 66 MHz

Address phase ¨  At the same time, initiator identifiers

target device and the type of transaction

¨  The initiator assert the FRAME# signal

¨  Every PCI target device latch the address and decode it

PCI Bus

Data Phase ¨  Number of data bytes to be transformed is

determined by the number of Command/Byte Enable signals asserted by initiator

¨  Both of initiator and target must t ready to complete data phase

¨  IRDY# and TRDY# used

Transaction Duration ¨  By asserting FRAME# at start of address phase

and remain until the final data phase

Transaction completion and return of bus to idle state ¨  By deasserting the FRAME# but asserting IRDY#

¨  When the last data transfer has completed the initiator returns the PCI bus to idle state by deasserting IRDY#

Intro. to PCI Bus Operation.

68

PCI Bus

Transfer Modes

69

¤ PCI Bus Lines (required) n Systems lines

n Including clock and reset

n Address & Data n 32 time mux lines for address/data n Interrupt & validate lines

n Interface Control n Arbitration

n Not shared n Direct connection to PCI bus arbiter

n Error lines

PCI Bus

¤  PCI Bus Lines (optional) n  Interrupt lines

n  Not shared

n  Cache support n  64-bit Bus Extension

n  Additional 32 lines n  Time multiplexed n  2 lines to enable devices to

agree to use 64-bit transfer

n  JTAG/Boundary Scan n  For testing procedures

70

¨  PCI Commands ¤  Transaction between initiator (master)

and target ¤  Master claims bus ¤  Determine type of transaction

n  e.g. I/O read/write

¤  Address phase ¤  One or more data phases

PCI Bus

PCI Bus Transaction Start

Address

4 2 3 1 5 6 7 8 CLK

9

FRAME#

AD

C/BE# Command

IRDY#

GNT#

Bus Idle

71

PCI Bus

PCI Bus Read CLK

FRAME#

AD

C/BE#

IRDY#

TRDY#

DEVSEL#

ADDRESS DATA-1 DATA-2 DATA-3

BE#’S BUS CMD

72

PCI Bus

PCI Bus Target Read Transaction

73

PCI Bus

PCI Bus Write

74

PCI Bus

PCI Bus Target Write Transaction

75

¨  PCI Bus Arbitration

PCI Bus

PCI Bus ArbitrationTiming

76

¨  A PCI target can implement up to three different types of address spaces

¨  Configuration space ¤  Stores basic information about the device ¤  Allows the central resource or O/S to program a

device with operational settings

¨  I/O space ¤  Used mainly with PC peripherals and not much else.

¨  Memory space ¤  Used for just about everything else

¨  I/O space ¨  This space is where basic PC peripherals

(keyboard, serial port,etc.) are mapped. ¨  The PCI spec allows an agent to request 4 bytes to

2GB of I/O space.

PCI Bus

¨  Configuration space ¨  Contains basic device information, e.g.,

vendor or class of device. ¨  Also permits Plug-N-Play

¤  Base address registers allow an agent to be mapped dynamically into memory or I/O space.

¤  A programmable interrupt-line setting allows a software driver to program a PC card with an IRQ upon power-up (without jumpers!).

¨  Memory space ¨  This space is used by most everything else – it’s

the general-purpose address space ¤  The PCI spec recommends that a device use memory

space, even if it is a peripheral

¨  An agent can request between 16 bytes and 2GB of memory space ¤  The PCI spec recommends that an agent use at least

4kB of memory space, to reduce the width of the agent’s address decoder

PCI Address Space

77

¨  PCI Express Introduction ¤  PCI Express architecture is a high performance, IO

interconnect for peripherals in computing communication platforms.

¤  Evolved from PCI and PCI-X architectures ¤  PCI Express is a serial point-to-point interconnect between

two devices. Scalable performance based on number of signal Lanes implemented on the PCI Express

¤  Implements packet based protocol for information transfer interconnect.

PCI Express Bus

¨  PCI Express Features ¤  Point-to-point connection ¤  Serial bus means fewer pins ¤  Scalable: x1, x2, x4, x8, x12,

x16, x32 (2.5 Gb/s) ¤  Dual Simplex connection ¤  2.5VGT/s transfer/direction/s ¤  Packet based transaction

protocol

Devic

e A Frame

Frame

Sequence Number

Packet Request CRC Frame

CRC Packet Request

Sequence Number Frame

Data Data

Data Data

Clock Clock

Devic

e B

x1 Lane

78

Transaction Types, address Spaces ¨  Request are translated to one of four

transaction types by the Transaction Layer: ¤  Memory Read or Memory Write.

Used to transfer data from or to a memory mapped location

¤  I/O Read or I/O Write. Used to transfer data from or to an I/O location

¤  Configuration Read/Write. Used to discover device capabilities, program features, and check status in the 4KB PCI Express configuration space.

¤  Messages. Handled like posted writes. Used for event signaling and general purpose messaging.

PCI Express Bus

79

PCI Express Bus

Programmed I/O Transaction

DMA Transaction

Peer-to-Peer Transaction

80

PCI Express Bus

PCI Express Device Layers

81

¤  Connection infrastructure for high-bandwidth master and slave devices

¤  Fully synchronous to one clock ¤  Centralized bus arbitration—PLB arbiter ¤  32 or 64-bit address (upper 32-bit are connected to GND) ¤  32, 64, or 128-bit data bus ¤  Selectable shared bus or point-to-point interconnect topology

n  Point-to-point optimization available for 1 master, 1 slave configuration n  Point-to-point topology supports 0 cycle latency via arbitration removal

¤  Selectable address pipelining support (2-level only) ¤  Dynamic master request priority based arbitration ¤  Vectored resets and address/qualifier registers

PLB Bus

82

PLB Bus - Interconnect

¨  One to 16 PLB masters, each connect all of their signals to the PLB arbiter

¨  The PLB arbiter multiplexes signals from masters onto a shared bus to which all the inputs of the slaves are connected

¨  One to n PLB slaves OR together their outputs to drive a shared bus back to the PLB arbiter

¨  The PLB arbiter handles bus arbitration and the movement of data and control signals between masters and slaves

83

PLB Bus - Bridge

¨  The PLB-to-PLB is required when two PLB segments are connected ¤  Different bus speed ¤  Different bus width

¨  The bridge translates PLB transactions on one side into the PLB transactions of the other side

¨  The bridge functions as a slave on one PLB side and a master on the other PLB side

¨  For a typical system with two PLB segments, one bridge is necessary for transactions originating from processor ¤  A second bridge is required if a peripheral on the other side is master capable

and wants to address a peripheral on the processor side

84

OPB Bus

¨  The OPB bus decouples lower bandwidth devices from the PLB ¨  It is a less complex protocol than PLB

¤  No split transaction or address pipelining capability ¨  Centralized bus arbitration—OPB arbiter ¨  Connection infrastructure for the master and slave peripheral devices ¨  The OPB bus is designed to alleviate system performance bottlenecks by

reducing capacitive loading on the PLB ¤  Fully synchronous to one clock ¤  Shared 32-bit address bus, shared 32-bit data bus ¤  Supports single-cycle data transfers between the master and the slaves ¤  Supports multiple masters, determined by arbitration implementation ¤  The bridge function can be the master on the PLB or OPB

85

OPB Bus

¨  Supports 16 masters and an unlimited number of slaves (limited by the expected performance)

¨  The OPB arbiter receives bus requests from the OPB masters and grants the bus to one of them ¤  Fixed and dynamic (LRU) priorities

¨  Bus logic is implemented with AND-OR logic. Inactive devices drives zeros

¨  Read and write data buses can be separated to reduce loading on the OPB_DBus signal

86

AMBA

APB AHB AXI

AXI-4 Memory Map

AXI-4 Stream

AXI-4 Lite

ATB AMBA 3.0 (2003)

AMBA 4.0 (Just Announced)

Same Spec

Enhancements for FPGAs

Interface Features Similar to

Memory Map / Full

Traditional Address/Data Burst (single address, multiple data)

PLBv46, PCI

Streaming Data-Only, Burst Local Link / DSP Interfaces / FIFO / FSL

Lite Traditional Address/Data—No Burst (single address, single data)

PLBv46-single OPB

AXI is Part of AMBA: Advanced Microcontroller Bus Architecture

87

ARM AXI

Processor

Peripherals

PLB46

Arbiter

AXI Slaves Interconnect

AXI AXI

AXI

AXI

AXI

“Shared Access” Bus

AXI Interconnect IP §  Implementation is not

described in the spec § Several companies build and

sell “AXI interconnect IP” § Xilinx is building its own

Arrows indicate master/slave relationship, not direction of dataflow

Master Slave

AXI

AXI

AXI

PLB

PLB

PLB

PLB

AXI is an interface specification, not a bus specification

AXI Masters

AXI AXI

88

Basic AXI Transactions

¨  Read address channel ¨  Read data channel

¨  Write address channel ¨  Write data channel ¨  Write response channel

¤  Non-posted write model: there will always be a “write response”

89

ARM AXI – AXI4

¨  Also called full AXI, AXI Memory Mapped

¨  Single address multiple data

¨  Burst up to 256 data beats ¨  Targeted Xilinx support

AXI4 Read

AXI4 Write

90

ARM AXI – AXI4

¨  No burst ¨  Data width 32 or 64

only ¤  Xilinx IP will only

support 32 bits

¨  Simple “logic shim” to connect AXI4 master to AXI4-Lite slave ¤  Reflect master’s

transaction ID

AXI4-Lite Read

AXI4-Lite Write

91

ARM AXI – AXI4

¨  No address channel

¨  Not read and write, always just master to slave

¨  Unlimited burst length

AXI4-Streaming Transfer

92






Address Management 93

¨  Embedded processor design requires you to manage the following: ¤  Address map for the peripherals ¤  Location of the application code in the memory space

n  Block RAM n  External memory

¨  Memory requirements for your programs are based on the following: ¤  The amount of memory required for storing the instructions ¤  The amount of memory required for storing the data associated with

the program

Address Management (MicroBlaze) 94

¨  Memory and peripherals ¤  The MicroBlaze processor uses

32-bit addresses

¨  Special addresses ¤  MicroBlaze processors must have

writeable memory from x00000000 through 0x0000004F so it can be updated by boot sequence

¤  Each vector consists of two instructions IMM followed by a BRAI instruction to address full memory range

0x0000_0000 0x0000_0008 0x0000_0010

0xFFFF_FFFF

0x0000_0018

Reset Address Exception Address Interrupt Address

LMB Memory

Reserved

PLB Memory

Peripherals

0x0000_0020 0x0000_0028 0x0000_004F

Break Hardware Exception

Address Management (ARM) 95

¨  Address Management ¨  All registers for both CPUs are

grouped into two contiguous 4KB pages ¤  Accessed through a dedicated

internal bus ¨  Fixed at 0xF8F0_0000 with a

register block size of 8 KB ¤  Each CPU uses an offset into this

base address

0x0000-0x00FC SCU registers 0x0100-0x01FF Interrupt controller interface 0x0200-0x02FF Global timer 0x0600-0x06FF Private timers and watchdog timers 0x1000-0x1FFF Interrupt distributor

0xFFFC_0000


¨  The compiler includes pre-compiled startup and end files when forming the executable

¨  Startup files setup the language and platform environment before your application code executes ¤  Sets up vectors as required (reset, interrupt, exception, etc.) ¤  Sets up registers (stack pointer, small data anchors, etc.) ¤  Clears .bss memory region to zero ¤  Invokes language initialization functions, such as C++ constructors ¤  Initializes the hardware sub-system (ie. initialize profiling timers) ¤  Sets up arguments for the main procedure and invokes it

¨  End files include code that must execute after the program ends ¤  Invoke language cleanup functions, such as C++ destructors ¤  De-initialize the hardware sub-system (ie. clean profiling system sub-system)


¨  Crt0.o initialization file is used when the executable is executed in standalone mode (no debug)

¨  The C runtime file crt0.o is linked with the user program ¤  Starts at address location 0x0, immediately followed

by the user program ¤  Populates reset, interrupt, exception and

hardware exception vectors

crt0.o

main program

0x00000000


¨ Object File Sections ¨  What is an object file?

¤  An object file is an assembled piece of code n  Machine language:

li r31,0 = 0x3BE0 0000

¤  Constant data ¤  There may be references to external objects that are

defined elsewhere ¤  This file may contain debugging information


.text

.rodata

.sdata2

.sbss2

.data

.sdata

.sbss

.bss

Sectional Layout of an Object or an Executable file

•  Text section

•  Read-only data section

•  Small read-only data section (less than eight bytes)

•  Small read-only uninitialized data section

•  Read-write data section

•  Small read-write data section

•  Small uninitialized data section

•  Uninitialized data section


¨  Linker scripts control the linking process ¤  Map the code and data to a specified memory space ¤  Set the entry point to the executable ¤  Reserve space for the stack

¨  Required if the design contains a discontinuous memory space

101






Embedded Development Kit (EDK) 102

¨  What is Embedded Development Kit (EDK)? ¤  The Embedded Development Kit is the Xilinx software

suite for designing complete embedded programmable systems

¤  The kit includes all the tools, documentation, and IP that you require for designing systems with embedded hard processor cores, and/or Xilinx MicroBlaze™ soft processor cores

¤  It enables the integration of both hardware and software components of an embedded system


Data2MEM

Download Combined Image to FPGA

Compiled ELF Compiled BIT

RTOS, Board Support Package

Embedded Development Kit

Instantiate the ‘System Netlist’ and Implement

the FPGA

?

HDL Entry

Simulation/Synthesis

Implementation

Download Bitstream Into FPGA

Chipscope

Standard FPGA HW Development Flow

VHDL or Verilog

System Netlist Include the BSP and Compile the Software Image

?

Code Entry

C/C++ Cross Compiler

Linker

Load Software Into FLASH

Debugger

Standard Embedded SW Development Flow

C Code

Board Support Package

1 2 3 Compiled BIT Compiled ELF


A.  Develop the embedded hardware in XPS ¤  Quickly create a system targeting a board using Base System Builder Wizard ¤  Extend the hardware system, if necessary, by adding peripherals from the IP Catalog ¤  Generate HDL netlists using PlatGen

B.  Develop the embedded software in SDK ¤  Generate libraries and drivers with LibGen ¤  Create and debug the software application using Software Development Kit (SDK) ¤  Optionally, debug the application using Xilinx Microprocessor Debug (XMD) and

the GNU debugger (gdb) C.  Operate in hardware

¤  Generate the bitstream and configure the FPGA using iMPACT D.  Deploy

¤  Initialize external flash memory using the Flash Writer utility or boot from an external compact flash configuration file generated using the System ACE File generator (GenACE) script

Embedded Development Kit (EDK). Debugging Tools

105

¨  Debugging is an integral part of embedded systems development ¨  The debugging process is defined as testing, stabilizing, localizing, and

correcting errors ¨  Two methods of debugging:

¤  Hardware debugging via a logic probe, logic analyzer, in-circuit emulator, or background debugger

¤  Software debugging via a debugging instrument n  A software debugging instrument is source code that is added to the program for the

purpose of debugging

¨  Debugging types: ¤  Functional debugging ¤  Performance debugging


106

¨  EDK supports software debugging via: ¤ ChipScope™ Pro tool cores are available to a Xilinx

Platform Studio design n  PLB IBA (Integrated Bus Analyzer) n  ILA (Integrated Logic Analyzer) n  VIO (Virtual I/O)

¤ Enables co-debug of software with GNU gdb and hardware with ChipScope Analyzer


107

¨  EDK supports software debugging via: ¤ GNU Debugger (GDB)

n Software debugger that runs on PC ¤ Microprocessor Debug Module (MDM)

n Debug interface in MicroBlaze system ¤ Xilinx Microprocessor Debugger (XMD)

n Facilitates an interface between the GNU tools and the MicroBlaze MDM

108






109

¨  Introduction ¨  MicroBlaze and ARM ZYNQ Features ¨  Hardware/Software co-design. Address Management ¨  Embedded Development Kit (EDK). ¨  Os vs RTOS (Real Time OS) ¨  Interrupts, Exceptions, Watch-Dog, …

Os vs RTOS (Real-Times OS) 110

¨  What`s an Operating System?

¤  Provides environment for executing programs ¤  Process abstraction for multitasking/concurrency

n  Scheduling

¤  Hardware abstraction layer (device drivers) ¤  File-systems ¤  Communication ¤  We will focus on concurrent, real-time issues


¨  Real Time System ¤  A system is said to be Real Time if it is required to complete it’s

work & deliver it’s services on time. ¤  Example – Flight Control System

n  All tasks in that system must execute on time. ¤  Non Example – PC system ¤  Hard Real Time System

n  Failure to meet deadlines is fatal n  example : Flight Control System

¤  Soft Real Time System n  Late completion of jobs is undesirable but not fatal. n  System performance degrades as more & more jobs miss deadlines n  Online Databases


¨  Typical RTOS Task Model ¤  Each task a triplet: (execution time, period, deadline) ¤  Usually, deadline = period ¤  Can be initiated any time during the period

Execution time

Period

Deadline

Time

Initiation


¨  Hard real-time system with multirate behavior: Fly-by-wire Avionics

INU 1kHz

GPS 20 Hz

Air data 1 kHz

Joystick 500 Hz

Pitch control 500 Hz

Lateral Control 250 Hz

Throttle Control 250 Hz

Aileron 1 1 kHz

Aileron 2 1 kHz

Elevator 1 kHz

Rudder 1 kHz

gyros, accel.

GPS

Sensor

Stick

Aileron

Aileron

Elevator

Rudder

Sensors Signal Conditioning

Control laws Actuating Actuators


¨  Features of RTOS’s ¤  Scheduling. ¤  Resource Allocation.

n  The issues with scheduling applicable here. n  Resources can be allocated in

n  Weighted Round Robin n  Priority Based

¤  Interrupt Handling. n  Interrupt Latency should be very small

n  Kernel has to respond to real time events. n  Interrupts should be disabled for minimum possible time

¤  Other issues like kernel size. For embedded applications Kernel Size should be small. Should fit in ROM. No Virtual Memory. No Protection


¨  Scheduling Algorithms in RTOS:

¤  Clock Driven Scheduling n  All parameters about jobs (release time/ execution time/deadline)

known in advance. Minimal runtime overhead.

¤  Weighted Round Robin Scheduling n  Jobs scheduled in FIFO manner. Time quantum given to jobs is

proportional to it’s weight

¤  Priority Scheduling (Greedy / List / Event Driven) n  Processor never left idle when there are ready tasks. Processor

allocated to processes according to priorities. Static (at design time) Dynamic (at runtime).


¨  Priority-based Preemptive Scheduling ¤  Always run the highest-priority runnable process

1

2

3

¨  Multiple processes at the same priority level?


¨  Linux for Real Time Applications. ¤  Scheduling

n  Priority Driven Approach n  Optimize average case response time.

n  Interactive Processes Given Highest Priority n  Aim to reduce response times of processes.

n  Real Time Processes n  Processes with high priority. n  No notion of deadlines.

¤  Resource Allocation n  No support for handling priority inversion.


¨  Interrupt Handling in Linux ¤  Interrupts are disabled in ISR/critical sections of the kernel ¤  No worst case bound on interrupt latency avaliable

n  eg: Disk Drivers may disable interrupt for few hundred milliseconds

¤  Not suitable for Real Time Applications n  Interrupts may be missed

¤  Processes are non pre-emptible in Kernel Mode n  System calls like fork take a lot of time n  High priority thread might wait for a low priority thread to complete it’s

system call

¤  Processes are heavy weight n  Context switch takes several hundred microseconds


¨  RTLinux ¤  Real Time Kernel at the lowest level. ¤  Linux Kernel is a low priority thread.

n  Executed only when no real time tasks ¤  Interrupts trapped by the Real Time Kernel and passed onto Linux

Kernel n  Software emulation to hardware interrupts

n  Interrupts are queued by RTLinux. n  Software emulation to disable_interrupt().

¤  Real Time Tasks n  Statically allocate memory. No address space protection

¤  Non Real Time Tasks are developed in Linux. ¤  Communication

n  Queues, Shared memory.


¨  RTLinux Framework


¨  LynxOS ¤  Microkernel Architecture

n  Kernel provides scheduling/interrupt handling ¤  Additional features through Kernel Plug Ins(KPIs)

n  TCP/IP stack, Filesystem. KPI’s are multithreaded ¤  Memory Protection/ Demand Paging Optional. ¤  Development and Deployment on the same host.

n  OS support for compilers/debuggers

¨  VxWorks ¤  Monolithic Architecture. RT Posix compliant. Cross development

System

¨  pSOS - Object Oriented OS

Documents

1 Tema IV – Sistemas Embebidos – MicroBlaze y ARM ZYNQumh1759.edu.umh.es/.../uploads/sites/783/2013/02/Tema-4.pdf · 2014-11-11 · MicroBlaze y ARM ZYNQ 1 Roberto Gutiérrez