Virtualisation Front Side Buses SMP systems COMP311 2007 Jamie Curtis

Virtualisation Front Side Buses

SMP systems

COMP311 2007

Jamie Curtis

Virtualisation

Allows a “Host” OS to execute a “Guest” OS as a task No need for “Guest” OS’s cooperation Host is often called a Hypervisor or VMM

VMM controls devices Often presents virtual devices to guest OS’s

Virtualisation cont.

PC architecture makes this tricky Protected mode introduced privilege rings

4 Levels – 0,1,2,3 with decreasing privilege Operating systems assume full control (ring 0)

and userspace at ring 3 There is no control on reading this state

Some instructions silently fail if not run at the correct privilege level instead of causing a fault

Virtualisation cont.

4 approaches to fix this Emulation Paravirtualisation Binary Translation Hardware Assisted

Emulation Fake the entire machine Very slow but doesn’t require the same host

architecture as the guest

Paravirtualisation

Alter the guest so that it doesn’t use the “bad” instructions. Guest instead calls out to the VMM

Very fast with minimal overhead Requires support from all guest OS’s

Effectively limited to open source OS’s Championed by Xen

Binary Translation

The VMM now dynamically translates the byte stream before it’s executed, replacing “bad” instructions as it goes Lots of optimisations make this not nearly as bad

as it sounds. Allows running un-modified OS’s Performance hit can be anything from 5 –

60% depending on workload. Championed by VMware

Hardware Virtualisation

Adds another privilege level for the VMM Hardware maintains state for each guest VMM gets very flexible control of what

causes faults Intel and AMD again have similar specs (but

incompatible !) AMD – “Pacifica”, Intel – “Vanderpool”

Hardware Virtualisation cont.

Initial solutions don’t deal with enough in hardware, causing them often to be slower than BT Transition into and out of guests is very slow

VMware and Xen both have added support Hardware support is getting more complete in

newer versions

Virtual I/O Devices

Typically VMMs provide virtual hardware drivers to the guests These may then map onto real hardware inside

the VMM Allowing secure and separated direct access

to hardware is difficult

AMD 10h Family

AMD’s latest architecture adds a number of important virtualisation enhancements Primarily focused around making fewer calls into

the VMM Three main enhancements

Nested Page Tables Tagged TLB Device Exclusion Vector (DEV)

Nested Page Tables

Current designs use shadowed page tables The MMU is setup for the current guest / process Changes to the MMU are trapped by the VMM

Nested page tables replicate the page table per VM. Looks similar to the virtual address space

provided by the OS to processes, just one more level up

Makes lookups through the page table slower

Tagged TLB

TLB’s cache lookups through the page tables Typically they are flushed each time you switch

process or VM Tagged TLB’s add a tag to reference which

VM this tag relates to No longer have to flush TLB’s when switching VM Makes VM – VMM – VM transitions cheaper Reduces the performance hit of nested page

tables

Device Exclusion Vector

Contains a mapping of what pages a device can access

Allows the VMM to give DMA access to specific pages Allows a device to do DMA directly into a specific

VM’s memory space

Virtualisation Future

Linux now has 4 full virtualisation packages VMWare , Xen, KVM, Lguest

Intel and AMD working on next-gen hardware support

Dense, multi-core solutions making virtualisation very attractive Power and space efficient

Keeps getting more and more important

Traditional Intel Architecture

Front Side Bus Northbridge

Southbridge

Clock vs. Data rates

Clock rates no longer equal data rates Watch specifications as both can look

identical e.g.

Intel's FSB is normally referred to as an 1333MHz bus

It is actually a 333MHz clock with a quad-pumped data rate

Should really be 1333MT/s

Frontside Bus & Dual Core

64 bit, quad-pumped 333MHz 8 * 4 * 333 = 10.6GB/s

Half Duplex Traditional P4 FSB has been point to point

Xeon’s SMP requires chipsets support a proper multipoint bus for the FSB.

How did a dual core Pentium D work ?

Intel Dual Core

Slap two cores on the same die ! How do they communicate ?

Over the FSB like a Xeon ! Requires a new chipset to support it.

Caches

There are two sets of L1 and L2 cache, one for each core.

What happens if both cores are caching the same memory location ?

Need a protocol to make sure this doesn’t cause problems Cache Coherency Protocol

Cache Coherency

Intel use the MESI Protocol Modified

1 cache contains a modified copy of the location Exclusive

1 cache contains a un-modified copy of the location Shared

2 or more caches contain un-modified copies of the location Invalid

Another cache contains a modified copy of the location

MESI

What happens if a core has an Invalid entry but tries to access it ? The cache with the Modified entry needs to write

the entry back to memory and become a Shared entry

This means an Intel Pentium D has to involve the FSB in all of it’s cache coherency updates even though they are on the same die

Intel Core

Core architecture is designed for dual-core Cache is now shared between cores Cache coherency is now between each L1 and L2

BusControl

L2 Cache

L2Control

Core1

Core2

FSB

Intel Core 2 Quad

Current quad core design adds two dual core designs together

Cache coherency between dies again happens over the FSB

AMD K8 Architecture

Integrated Memory Controller Very low latency CPU determines memory technology

Requires both CPU and Motherboard to be changed for a new type of memory

Athlon 64, Socket 754 Single channel, 64bit DDR at 200MHz

8 * 2 * 200 = 3.2GB/s

AMD K8 Memory Interfaces

Athlon 64, Socket 939 Dual channel, 64bit DDR at 200MHz

2 * 8 * 2 * 200 = 6.4 GB/s

Athlon 64, Socket AM2 Dual channel, 64bit DDR2 at 400MHz

2 * 8 * 2 * 400 = 12.8 GB/s

Opteron, Socket 940 Dual channel, 64bit DDR at 200MHz

2 * 8 * 2 * 200 = 6.4 GB/s

AMD K8 FSB

AMD use the HyperTransport technology Don’t confuse with the very different Intel Hyper Threading

Technology Packet based, point to point link that provides the lowest

possible latency Full duplex bi-directional link Available in 2, 4, 8, 16 or 32 bits wide 50MHz – 1.4GHz clock rates Clock rates and bit widths can be asymmetric Double-pumped data rate

1.4GHz * 2 * 4 = 11.2GB/s per direction

Athlon 64 FSB

The Athlon 64 has a single HyperTransport link to connect to I/O subsystem 16bit, bi-directional, DDR at 800MHz (754) or

1GHz (939) 2 * 2 * 2 * 1000 = 8GB/s

Can have either a single chip solution or stay with two chip, northbridge, southbridge combination

Opteron

Uses HT for both the I/O interconnect and CPU interconnect

CPU interconnect requires an additional Cache Coherency protocol addition to HT

Come in three variants 1xx – Memory controller and 3 HT links 2xx – Memory controller and 3 HT links 8xx – Memory controller and 3 HT links

Opteron

What makes them different ? Number of HT busses that support the CC

protocol 1xx – Zero 2xx – One 8xx – Three

This allows you to scale to different numbers of processors 1xx – 1 way, 2xx – 2way, 8xx – 4 or 8 way

AMD MOESI

Cache Coherency slightly different to Intel’s Owner – This cache owns this memory location

and it (not memory) services all requests for it from other caches

The request goes across the high speed dedicated HT bus

AMD Dual Core

In the dual core situation, it becomes even better:

AMD Dual Core

Requests now go across the System Request Interface This runs at CPU core frequency

System Request Interface also controls HT – Memory access as well as HT – HT communication in 2xx and 8xx Opterons

AMD Quad Core

AMD again targeted native quad core instead of dual-die, dual core Introduced shared L3

cache

Documents

Virtualisation Front Side Buses SMP systems COMP311 2007 Jamie Curtis