Upload
magnus-hall
View
217
Download
1
Embed Size (px)
Citation preview
Virtualisation Front Side Buses
SMP systems
COMP311 2007
Jamie Curtis
Virtualisation
Allows a “Host” OS to execute a “Guest” OS as a task No need for “Guest” OS’s cooperation Host is often called a Hypervisor or VMM
VMM controls devices Often presents virtual devices to guest OS’s
Virtualisation cont.
PC architecture makes this tricky Protected mode introduced privilege rings
4 Levels – 0,1,2,3 with decreasing privilege Operating systems assume full control (ring 0)
and userspace at ring 3 There is no control on reading this state
Some instructions silently fail if not run at the correct privilege level instead of causing a fault
Virtualisation cont.
4 approaches to fix this Emulation Paravirtualisation Binary Translation Hardware Assisted
Emulation Fake the entire machine Very slow but doesn’t require the same host
architecture as the guest
Paravirtualisation
Alter the guest so that it doesn’t use the “bad” instructions. Guest instead calls out to the VMM
Very fast with minimal overhead Requires support from all guest OS’s
Effectively limited to open source OS’s Championed by Xen
Binary Translation
The VMM now dynamically translates the byte stream before it’s executed, replacing “bad” instructions as it goes Lots of optimisations make this not nearly as bad
as it sounds. Allows running un-modified OS’s Performance hit can be anything from 5 –
60% depending on workload. Championed by VMware
Hardware Virtualisation
Adds another privilege level for the VMM Hardware maintains state for each guest VMM gets very flexible control of what
causes faults Intel and AMD again have similar specs (but
incompatible !) AMD – “Pacifica”, Intel – “Vanderpool”
Hardware Virtualisation cont.
Initial solutions don’t deal with enough in hardware, causing them often to be slower than BT Transition into and out of guests is very slow
VMware and Xen both have added support Hardware support is getting more complete in
newer versions
Virtual I/O Devices
Typically VMMs provide virtual hardware drivers to the guests These may then map onto real hardware inside
the VMM Allowing secure and separated direct access
to hardware is difficult
AMD 10h Family
AMD’s latest architecture adds a number of important virtualisation enhancements Primarily focused around making fewer calls into
the VMM Three main enhancements
Nested Page Tables Tagged TLB Device Exclusion Vector (DEV)
Nested Page Tables
Current designs use shadowed page tables The MMU is setup for the current guest / process Changes to the MMU are trapped by the VMM
Nested page tables replicate the page table per VM. Looks similar to the virtual address space
provided by the OS to processes, just one more level up
Makes lookups through the page table slower
Tagged TLB
TLB’s cache lookups through the page tables Typically they are flushed each time you switch
process or VM Tagged TLB’s add a tag to reference which
VM this tag relates to No longer have to flush TLB’s when switching VM Makes VM – VMM – VM transitions cheaper Reduces the performance hit of nested page
tables
Device Exclusion Vector
Contains a mapping of what pages a device can access
Allows the VMM to give DMA access to specific pages Allows a device to do DMA directly into a specific
VM’s memory space
Virtualisation Future
Linux now has 4 full virtualisation packages VMWare , Xen, KVM, Lguest
Intel and AMD working on next-gen hardware support
Dense, multi-core solutions making virtualisation very attractive Power and space efficient
Keeps getting more and more important
Traditional Intel Architecture
Front Side Bus Northbridge
Southbridge
Clock vs. Data rates
Clock rates no longer equal data rates Watch specifications as both can look
identical e.g.
Intel's FSB is normally referred to as an 1333MHz bus
It is actually a 333MHz clock with a quad-pumped data rate
Should really be 1333MT/s
Frontside Bus & Dual Core
64 bit, quad-pumped 333MHz 8 * 4 * 333 = 10.6GB/s
Half Duplex Traditional P4 FSB has been point to point
Xeon’s SMP requires chipsets support a proper multipoint bus for the FSB.
How did a dual core Pentium D work ?
Intel Dual Core
Slap two cores on the same die ! How do they communicate ?
Over the FSB like a Xeon ! Requires a new chipset to support it.
Caches
There are two sets of L1 and L2 cache, one for each core.
What happens if both cores are caching the same memory location ?
Need a protocol to make sure this doesn’t cause problems Cache Coherency Protocol
Cache Coherency
Intel use the MESI Protocol Modified
1 cache contains a modified copy of the location Exclusive
1 cache contains a un-modified copy of the location Shared
2 or more caches contain un-modified copies of the location Invalid
Another cache contains a modified copy of the location
MESI
What happens if a core has an Invalid entry but tries to access it ? The cache with the Modified entry needs to write
the entry back to memory and become a Shared entry
This means an Intel Pentium D has to involve the FSB in all of it’s cache coherency updates even though they are on the same die
Intel Core
Core architecture is designed for dual-core Cache is now shared between cores Cache coherency is now between each L1 and L2
BusControl
L2 Cache
L2Control
Core1
Core2
FSB
Intel Core 2 Quad
Current quad core design adds two dual core designs together
Cache coherency between dies again happens over the FSB
AMD K8 Architecture
Integrated Memory Controller Very low latency CPU determines memory technology
Requires both CPU and Motherboard to be changed for a new type of memory
Athlon 64, Socket 754 Single channel, 64bit DDR at 200MHz
8 * 2 * 200 = 3.2GB/s
AMD K8 Memory Interfaces
Athlon 64, Socket 939 Dual channel, 64bit DDR at 200MHz
2 * 8 * 2 * 200 = 6.4 GB/s
Athlon 64, Socket AM2 Dual channel, 64bit DDR2 at 400MHz
2 * 8 * 2 * 400 = 12.8 GB/s
Opteron, Socket 940 Dual channel, 64bit DDR at 200MHz
2 * 8 * 2 * 200 = 6.4 GB/s
AMD K8 FSB
AMD use the HyperTransport technology Don’t confuse with the very different Intel Hyper Threading
Technology Packet based, point to point link that provides the lowest
possible latency Full duplex bi-directional link Available in 2, 4, 8, 16 or 32 bits wide 50MHz – 1.4GHz clock rates Clock rates and bit widths can be asymmetric Double-pumped data rate
1.4GHz * 2 * 4 = 11.2GB/s per direction
Athlon 64 FSB
The Athlon 64 has a single HyperTransport link to connect to I/O subsystem 16bit, bi-directional, DDR at 800MHz (754) or
1GHz (939) 2 * 2 * 2 * 1000 = 8GB/s
Can have either a single chip solution or stay with two chip, northbridge, southbridge combination
Opteron
Uses HT for both the I/O interconnect and CPU interconnect
CPU interconnect requires an additional Cache Coherency protocol addition to HT
Come in three variants 1xx – Memory controller and 3 HT links 2xx – Memory controller and 3 HT links 8xx – Memory controller and 3 HT links
Opteron
What makes them different ? Number of HT busses that support the CC
protocol 1xx – Zero 2xx – One 8xx – Three
This allows you to scale to different numbers of processors 1xx – 1 way, 2xx – 2way, 8xx – 4 or 8 way
AMD MOESI
Cache Coherency slightly different to Intel’s Owner – This cache owns this memory location
and it (not memory) services all requests for it from other caches
The request goes across the high speed dedicated HT bus
AMD Dual Core
In the dual core situation, it becomes even better:
AMD Dual Core
Requests now go across the System Request Interface This runs at CPU core frequency
System Request Interface also controls HT – Memory access as well as HT – HT communication in 2xx and 8xx Opterons
AMD Quad Core
AMD again targeted native quad core instead of dual-die, dual core Introduced shared L3
cache