8/13/2019 IO Virtualization on ARM_Part3
1/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 1/11
Search
Future Chips
ARM Virtualization I/O Virtualization (Part 3)
In the second partof the series we introduced memory management and interrupt
handling support provided by virtualization hardware extensions. But effective
virtualization solutions need to reach beyond the core to communicate with peripheral
devices. In this post we discuss the various techniques used for virtualizing I/O,the
problems faced, and the hardware solutions to mitigate these problems.
The Difficulty Of Virtualizing I/O
Before we talk about the hardware solutions at the system level for virtualization we
need to set up a motivation for what is driving these features. To appreciate the
problems we have to recognize that in some ways communicating with I/O in a
virtualized environment is a paradox. We want to run an operating system in a
sandboxed environment where it is oblivious to the the system outside the virtual
environment. But I/O cannot be oblivious to the outside environment because it iscommunicating with that environment. So, understandably virtualizing I/O becomes a
difficult problem.
So moving away from the philosophical questions, what is the goal of virtualization and
how does I/O fit into that goal? In my view it is to provide a managed environment for
hosting a VM that improves the overall user experience. To achieve this goal, ideally
wed like I/O in a VM to have the following properties:
1. The guest has access to the same I/O devices it would use in a nativeenvironment.
April 1, 2013
Posted byAli Hussain
at 9:25 am
Add comments
Chip Design for All,
Tips f or Power Coders,
Understanding Chips
Tagged w ith:AMD-Vi,
arm, cortex a15, Cortex
A57, drivers, emulation,
Intel VT-d, IOMMU,
paravirtualization, System
MMU, virtualization
Subscribe
QR Code
Chip Design for All(21)
Fun(11)
Parallel Programming(13)
Software for Hardware guys(22)
Thoughts for Researchers(16)
Thoughts on Latest Happenings(8)
Tips for Power Coders(25)
Understanding Chips(3)
Categories
Meet Flux7 Labs (update + shameless
marketing)
ARM Virtualization ARM vs x86 (Part 5)
ARM Virtualization Applications (Part 4)
Recent Posts
http://en.wikipedia.org/wiki/QR_Codehttp://en.wikipedia.org/wiki/QR_Codehttp://en.wikipedia.org/wiki/QR_Codehttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://twitter.com/futurechipshttp://www.futurechips.org/feedhttp://www.facebook.com/pages/FutureChips/163794200349948http://delicious.com/futurechipshttp://www.linkedin.com/pub/future-chips/35/1b5/2b9http://www.futurechips.org/http://www.futurechips.org/tag/amd-vihttp://www.futurechips.org/category/chip-design-for-allhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/thoughts-on-latest-happenings/arm-virtualization-part-4-applications.htmlhttp://www.futurechips.org/thoughts-on-latest-happenings/arm-virtualization-arm-x86-part-5.htmlhttp://www.futurechips.org/thoughts-on-latest-happenings/quick-post-meet-flux7-labs-blatant-marketing.htmlhttp://www.futurechips.org/category/understanding-chipshttp://www.futurechips.org/category/tips-for-power-codershttp://www.futurechips.org/category/thoughts-on-latest-happeningshttp://www.futurechips.org/category/thoughts-for-researchershttp://www.futurechips.org/category/software-for-hardware-guyshttp://www.futurechips.org/category/parallel-programming-2http://www.futurechips.org/category/funhttp://www.futurechips.org/category/chip-design-for-allhttp://en.wikipedia.org/wiki/QR_Codehttp://www.linkedin.com/pub/future-chips/35/1b5/2b9http://delicious.com/futurechipshttp://www.facebook.com/pages/FutureChips/163794200349948http://www.futurechips.org/feedhttp://twitter.com/futurechipshttp://www.futurechips.org/category/understanding-chipshttp://www.futurechips.org/category/tips-for-power-codershttp://www.futurechips.org/category/chip-design-for-allhttp://www.futurechips.org/tag/virtualizationhttp://www.futurechips.org/tag/system-mmuhttp://www.futurechips.org/tag/paravirtualizationhttp://www.futurechips.org/tag/iommuhttp://www.futurechips.org/tag/intel-vt-dhttp://www.futurechips.org/tag/emulationhttp://www.futurechips.org/tag/drivershttp://www.futurechips.org/tag/cortex-a57http://www.futurechips.org/tag/cortex-a15http://www.futurechips.org/tag/armhttp://www.futurechips.org/tag/amd-vihttp://www.futurechips.org/author/ali-hussainhttp://www.futurechips.org/understanding-chips/arm-virtualization-part-2-memory-interrupts.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/8/13/2019 IO Virtualization on ARM_Part3
2/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 2/11
2. The guest OS cannot affect the I/O operations or memory of other guests.
3. The software changes to the guest OS must be minimal.
4. The guest OS needs to be able to recover from a failure of the hardware or
migration of the VM.
5. The I/O operations on the guest OS should have similar performance to running
natively.
In this list we can see how several items on the list are competing with other items on
the list. So the final solution will require trade-offs based on the particular use-case.
Now, With these goals in mind let us look at the various techniques for implementing I/O
virtualization and the problems faced.
Emulated Or Paravirtualized Dev ices
When implementing full virtualization, one of the simplest options is for the guest OS to
emulate a virtual device on the host. The guest communicates with this virtual deviceand the hypervisor detects the guests communication. This can be done using trapping
of device accesses, or permissions to certain pages of memory. The hypervisor
understands the operations by the guest OS on the virtual device and performs the
corresponding operation on the physical device. This technique is called hosted or split
I/O.
ARM Virtualization I/O Virtualization
(Part 3)
ARM Virtualization Extensions Memory
and Interrupts (Part 2)
Writing and Optimizing Parallel
Programs A complete example57
comment(s)
What makes parallel programming
hard?46 comment(s)
Quick Post: Should you ever use Linked-
Lists?44 comment(s)
Parallel Programming: When Amdahls
law is inapplicable?23 comment(s)
How to trick C/C++ compilers into
generating terrible code?21
comment(s)
Q & A: Do mul ticores s ave energy? Not
really.15 comment(s)
Which little PC should I buy? Raspberry
Pi? Mele A1000? or 14 comment(s)
What every Programmer should know
about the memory system12
comment(s)
Open MP vs p threads11 comment(s)
Ten things every programmer must
know about hardware10 comment(s)
Popular Posts
Log in
Entries RSS
Comments RSS
Meta
http://www.futurechips.org/comments/feedhttp://www.futurechips.org/feedhttp://www.futurechips.org/wp-login.phphttp://www.futurechips.org/tips-for-power-coders/programmer-hardware.htmlhttp://www.futurechips.org/tips-for-power-coders/open-mp-pthreads.htmlhttp://www.futurechips.org/chip-design-for-all/what-every-programmer-should-know-about-the-memory-system.htmlhttp://www.futurechips.org/thoughts-for-researchers/comparison-small-pcs-rasberry-pi.htmlhttp://www.futurechips.org/chip-design-for-all/a-multicore-save-energy.htmlhttp://www.futurechips.org/tips-for-power-coders/how-to-trick-cc-compilers-into-generating-terrible-code.htmlhttp://www.futurechips.org/thoughts-for-researchers/parallel-programming-gene-amdahl-said.htmlhttp://www.futurechips.org/thoughts-for-researchers/quick-post-linked-lists.htmlhttp://www.futurechips.org/tips-for-power-coders/parallel-programming.htmlhttp://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.htmlhttp://www.futurechips.org/understanding-chips/arm-virtualization-part-2-memory-interrupts.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html8/13/2019 IO Virtualization on ARM_Part3
3/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 3/11
The advantage of this technique is that since every call goes through the hypervisor,
the hypervisor can provide the desired functionality. For example the hypervisor can
track every I/O operation the device is presently waiting on. Similarly restricting a guest
from affecting other guests becomes simplified because all physical device accesses
are managed by the hypervisor. But this technique has a high CPU overhead. The data
needs to be copied multiple times, processed through multiple I/O stacks, etc.
The performance can be improved by using paravirtualization. In this case the device
drivers in the OS implement an ABI with the hypervisor. The device drivers interfacewith the hypervisor and the hypervisor directly communicates with the physical device
Twittecounter
It seems there has been an internal
server error with the page you've
requested. Our coding monkeys
have been notified and we'll be backreal soon, promise!
Send us a noteif the problem
persists!
@liliputingnewsWhich little PC should I
buy? Raspberry Pi? Mele A1000? or
http://t.co/ydryDF1Kvia @sharethis
2012-07-16
@raspberry_piWhich little PC should I
buy? Raspberry Pi? Mele A1000? or
http://t.co/GSF5bICT2012-07-16
@DrQzAgreed. In systems/disk usage,
they are related directly but do you not
agree that a channels latency and
throughput are indep? 2012-06-30
More updates...Powered by Twitter Tools
What I'm Doing...
http://alexking.org/projects/wordpresshttp://twitter.com/FutureChipshttp://twitter.com/FutureChips/statuses/219105673345118208http://twitter.com/DrQzhttp://twitter.com/FutureChips/statuses/224863504673419265http://t.co/GSF5bICThttp://twitter.com/raspberry_pihttp://twitter.com/FutureChips/statuses/224866862851293184http://twitter.com/sharethishttp://t.co/ydryDF1Khttp://twitter.com/liliputingnewshttp://twitter.com/home?source=twitterremote&status=Hey%20@Boris,%20@SamWierema%20and%20@TheCounter!!!%20Wake%20up,%20Twitter%20Counter%20is%20down:%20http://twittercounter.com8/13/2019 IO Virtualization on ARM_Part3
4/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 4/11
as is shown in the figure below.
This technique provides better performance with similar control but there is still a
significant performance overhead, for example, in trapping to the hypervisor. Figure
below shows the difference observed by IBM in using an emulated IDE controller vs
IBMs virtio-blk paravirtualized device drivers in KVM.
June 2013
April 2013
March 2013
July 2012
June 2012
August 2011
July 2011
June 2011
May 2011
Archives
About Us
Pages
http://www.futurechips.org/about-ushttp://www.futurechips.org/2011/05http://www.futurechips.org/2011/06http://www.futurechips.org/2011/07http://www.futurechips.org/2011/08http://www.futurechips.org/2012/06http://www.futurechips.org/2012/07http://www.futurechips.org/2013/03http://www.futurechips.org/2013/04http://www.futurechips.org/2013/068/13/2019 IO Virtualization on ARM_Part3
5/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 5/11
When looking at this overhead it is important to keep in mind it is very use-case
dependent. A CPU bound benchmark will not show much sensitivity to the virtualization
of I/O. Alternatively for an I/O heavy benchmark this overhead can be significant. As an
example the conjugate-gradient method for solving a system of linear equation spends
around 70% of CPU cycles in the user mode and spends the remaining time in the
hypervisor kernel engaged in disk I/O.
Passthrough I/O
Passthrough I/O greatly improves performance by remapping the guest page tables to
directly write to the physical device. This eliminates most of the overhead in trapping to
the hypervisor for every operation. This technique brings the bulk of I/O processing to
near-native speeds.
8/13/2019 IO Virtualization on ARM_Part3
6/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 6/11
There are several issues that need to be addressed to effectively virtualize I/O using
this technique. Consider the case of a guest using DMA accesses to communicate with
a device. In this scenario we need to account for the following issues.
Isolation
The goal of virtualization is to to sandbox the guest OS to keep it from accessing the
data of other guest OSes. We do this in the guest by adding a second stage
translation. However, the DMA devices operate on physical addresses and are not
8/13/2019 IO Virtualization on ARM_Part3
7/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 7/11
aware of second stage translations. So if a guest is given unrestricted access to a DMA
device it can read or write to any physical address in memory and corrupt the memory
of other guests. So there needs to be a protection mechanism instituted to make sure a
device only directs DMA requests from a particular guest to go to memory associated
with that guest.
Furthermore, more than one guest may need to access the same device. The device
needs to be able to distinguish between the accesses coming from different devices
and redirect them correctly.
Physical Address
To complete the DMA transaction the guest OS needs to provide the device with the
proper physical address in memory to find the data. But the guest does not know the
physical address of the data, only the Intermediate Physical Address (IPA) which is in
essence a virtual address. For the DMA access to work the device must be able to
translate the IPA to the correct physical address.
Contiguous Memory Blocks
The problem cannot be solved by just providing the device with the correct PA. The
device expects the DMA target region to be located in a contiguous region of memory.
In a virtualized environment this is not guaranteed. The hypervisor may allocate guest
pages that are not contiguous in as small as 4K blocks. So the device must be able to
do this translation for the entire DMA region.
32 Bit Devices In Larger Address Spaces
This problem is similar to the problem with a 32 bit guest on a 64 bit host discussed in
the previous post. The system may have older devices that cannot access the complete
larger address spaces of newer systems. An address translation is necessary to use
these devices with a DMA outside their normal addressable range.
Hardware Support
8/13/2019 IO Virtualization on ARM_Part3
8/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 8/11
The problems mentioned above are not easily solved in software and need a hardware
solution that correctly maps device addresses to the correct guest. Most platforms have
hardware solutions for this. This mechanism is called IOMMUfor IO Memory
Management Unit. Intel calls their implementation VT-d, AMD calls their implementation
AMD-Vi, and ARM calls their implementation SystemMMU.
The basic idea for the IOMMU is simple. An address translation unit is placed in
between any devices that may be used by a guest OS. When the hypervisor is setting
up second stage page tables for a guest OS to access the device, it sets up the IOMMU
too. Similar to tablewalks in the core, address translations are expensive. So TLBs are
implemented to reduce the overhead of address translations.
An example system showing where the System MMU can be located.Transactions with the device are translated through the system MMU.
System MM U
The ARM System MMU is programmed with different translation contexts. It maps eachtransaction to the corresponding context by matching against expected streams. Based
http://www.futurechips.org/wp-content/uploads/2013/04/system-using-smmu.pnghttp://en.wikipedia.org/wiki/IOMMU8/13/2019 IO Virtualization on ARM_Part3
9/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 9/11
on the context the System MMU may either bypass the translation, cause a fault, or
perform a translation. The System MMU in the ARM architecture provides full 2 stage
translation support (as described in the previous post) and depending on the context
we may either do a first stage translation or a second stage translation. To perform the
translation the System MMU has registers analogous to the TTBRs and other control
registers for each context.
The system MMU may also receive faults during its translation process or if a context is
not setup. Depending on the type of fault and how the System MMU is configured it may
take certain actions. A translation fault can trigger an interrupt. This allows an
opportunity for the hypervisor to service the interrupt and restart the translation so it
can come to completion. The System MMU may also send a BUSERROR to the
appropriate requestor. There are syndrome registers present to ease the process of
diagnosing and fixing the problem.
Some advantages of System MMU dont even need virtualization. Since the System
MMU enables every device to perform VA to PA translations, I/O operations can be
performed by drivers in user-space using VAs. The permission checking and translation
maps can ensure one user application does not corrupt the memory of another
application . This would eliminate the traps to kernel presently required further reducing
I/O overhead. Another problem is dealing with contiguous memory. Many operations
result in very large DMA accesses that cannot be allocated a single chunk of memory
by the OS. Presently they need to either be split into multiple DMA requests or
performed with complex DMA scatter-gather operations. The System MMU enables the
device to communicate via a DMA based on a contiguous VA instead of fragmentedPAs. This both reduces the CPU overhead and simplifies the software and device.
It should be noted that the System MMU is a part of the platform rather than a part of
the core architecture. This means it only affects the drivers. Because of this many
features are implementation defined. For example the bits used to match a stream and
map it to a context are implementation defined. Since there is no user code that is
aware of this part of the system, changes to the system MMU architecture wouldnt
require as many legacy code issues.
So using these techniques the hypervisor can provide an appropriate implementation of
8/13/2019 IO Virtualization on ARM_Part3
10/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 10/11
Leave a Reply
virtualized I/O according to the use-case. This concludes the third installment of this
series on virtualization. This series continues in the next postdiscussing the use-cases
for virtualization especially the use cases targeted in the mobile space by ARM.
References
For more information check out the following resources.
http://xpgc.vicp.net/course/svt/TechDoc/ch12-
IOArchitecturesForVirtualization.pdf
http://nowlab.cse.ohio-state.edu/NOW/dissertations/huang.pdf
http://www.ibm.com/developerworks/linux/library/l-virtio/
http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaat/liaatbestpractices_pdf.
http://www.mulix.org/lectures/xen-iommu/xen-io.pdf
http://developer.amd.com/wordpress/media/2012/10/IOMMU-ben-yehuda.pdf
http://www.arm.com/files/pdf/System-MMU-Whitepaper-v8.0.pdf
http://software.intel.com/en-us/articles/intel-virtualization-technology-for-directed-
io-vt-d-enhancing-intel-platforms-for-efficient-virtualization-of-io-devices
http://support.amd.com/us/Processor_TechDocs/48882.pdf
You may also like -
Which little PC should I buy? Raspberry Pi? Mele A1000? or ...
Why computer architects MUST benchmark Javascript?
Tips for iPhone Dev elopers: The web-based sandbox for understandingCortex A8 is ready (Part 3)
Answers to Computer Science Self-assessment Quiz
Computer Science Self-assessment Quiz
0 0share 0 295
http://www.futurechips.org/chip-design-for-all/software-interview-quiz.htmlhttp://www.futurechips.org/chip-design-for-all/answers-computer-science-self-assessment-quiz.htmlhttp://www.futurechips.org/chip-design-for-all/tips-for-iphone-developers-the-web-based-sandbox-for-understanding-cortex-a8-is-ready-part-3.htmlhttp://www.futurechips.org/thoughts-for-researchers/computer-architects-benchmark-javascript.htmlhttp://www.futurechips.org/thoughts-for-researchers/comparison-small-pcs-rasberry-pi.htmlhttp://support.amd.com/us/Processor_TechDocs/48882.pdfhttp://software.intel.com/en-us/articles/intel-virtualization-technology-for-directed-io-vt-d-enhancing-intel-platforms-for-efficient-virtualization-of-io-deviceshttp://www.arm.com/files/pdf/System-MMU-Whitepaper-v8.0.pdfhttp://developer.amd.com/wordpress/media/2012/10/IOMMU-ben-yehuda.pdfhttp://www.mulix.org/lectures/xen-iommu/xen-io.pdfhttp://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaat/liaatbestpractices_pdf.pdfhttp://www.ibm.com/developerworks/linux/library/l-virtio/http://nowlab.cse.ohio-state.edu/NOW/dissertations/huang.pdfhttp://xpgc.vicp.net/course/svt/TechDoc/ch12-IOArchitecturesForVirtualization.pdfhttp://www.futurechips.org/thoughts-on-latest-happenings/arm-virtualization-part-4-applications.html8/13/2019 IO Virtualization on ARM_Part3
11/11
12/3/13 IO Virtualization on ARM
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 11/11
2012 Future Chips Suffusion theme by Sayontan Sinha
Name
URI
ARM Virtualiz ation Extensions Memory and Interrupts (Part 2) ARM Virtualiz ation ARM vs x86 (Part 5)
(required)
(required)
Your Comment
You may use these HTML tags and attributes:
Submit Comment
http://www.futurechips.org/thoughts-on-latest-happenings/arm-virtualization-arm-x86-part-5.htmlhttp://www.futurechips.org/understanding-chips/arm-virtualization-part-2-memory-interrupts.htmlhttp://www.aquoid.com/news/themes/suffusion/http://www.futurechips.org/