Download pdf - IO Virtualization on ARM_Part3

8/13/2019 IO Virtualization on ARM_Part3

1/11

12/3/13 IO Virtualization on ARM

www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 1/11

Search

Future Chips

ARM Virtualization I/O Virtualization (Part 3)

In the second partof the series we introduced memory management and interrupt

handling support provided by virtualization hardware extensions. But effective

virtualization solutions need to reach beyond the core to communicate with peripheral

devices. In this post we discuss the various techniques used for virtualizing I/O,the

problems faced, and the hardware solutions to mitigate these problems.

The Difficulty Of Virtualizing I/O

Before we talk about the hardware solutions at the system level for virtualization we

need to set up a motivation for what is driving these features. To appreciate the

problems we have to recognize that in some ways communicating with I/O in a

virtualized environment is a paradox. We want to run an operating system in a

sandboxed environment where it is oblivious to the the system outside the virtual

environment. But I/O cannot be oblivious to the outside environment because it iscommunicating with that environment. So, understandably virtualizing I/O becomes a

difficult problem.

So moving away from the philosophical questions, what is the goal of virtualization and

how does I/O fit into that goal? In my view it is to provide a managed environment for

hosting a VM that improves the overall user experience. To achieve this goal, ideally

wed like I/O in a VM to have the following properties:

1. The guest has access to the same I/O devices it would use in a nativeenvironment.

April 1, 2013

Posted byAli Hussain

at 9:25 am

Add comments

Chip Design for All,

Tips f or Power Coders,

Understanding Chips

Tagged w ith:AMD-Vi,

arm, cortex a15, Cortex

A57, drivers, emulation,

Intel VT-d, IOMMU,

paravirtualization, System

MMU, virtualization

Subscribe

QR Code

Chip Design for All(21)

Fun(11)

Parallel Programming(13)

Software for Hardware guys(22)

Thoughts for Researchers(16)

Thoughts on Latest Happenings(8)

Tips for Power Coders(25)

Understanding Chips(3)

Categories

Meet Flux7 Labs (update + shameless

marketing)

ARM Virtualization ARM vs x86 (Part 5)

ARM Virtualization Applications (Part 4)

Recent Posts
http://en.wikipedia.org/wiki/QR_Codehttp://en.wikipedia.org/wiki/QR_Codehttp://en.wikipedia.org/wiki/QR_Codehttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://twitter.com/futurechipshttp://www.futurechips.org/feedhttp://www.facebook.com/pages/FutureChips/163794200349948http://delicious.com/futurechipshttp://www.linkedin.com/pub/future-chips/35/1b5/2b9http://www.futurechips.org/http://www.futurechips.org/tag/amd-vihttp://www.futurechips.org/category/chip-design-for-allhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/thoughts-on-latest-happenings/arm-virtualization-part-4-applications.htmlhttp://www.futurechips.org/thoughts-on-latest-happenings/arm-virtualization-arm-x86-part-5.htmlhttp://www.futurechips.org/thoughts-on-latest-happenings/quick-post-meet-flux7-labs-blatant-marketing.htmlhttp://www.futurechips.org/category/understanding-chipshttp://www.futurechips.org/category/tips-for-power-codershttp://www.futurechips.org/category/thoughts-on-latest-happeningshttp://www.futurechips.org/category/thoughts-for-researchershttp://www.futurechips.org/category/software-for-hardware-guyshttp://www.futurechips.org/category/parallel-programming-2http://www.futurechips.org/category/funhttp://www.futurechips.org/category/chip-design-for-allhttp://en.wikipedia.org/wiki/QR_Codehttp://www.linkedin.com/pub/future-chips/35/1b5/2b9http://delicious.com/futurechipshttp://www.facebook.com/pages/FutureChips/163794200349948http://www.futurechips.org/feedhttp://twitter.com/futurechipshttp://www.futurechips.org/category/understanding-chipshttp://www.futurechips.org/category/tips-for-power-codershttp://www.futurechips.org/category/chip-design-for-allhttp://www.futurechips.org/tag/virtualizationhttp://www.futurechips.org/tag/system-mmuhttp://www.futurechips.org/tag/paravirtualizationhttp://www.futurechips.org/tag/iommuhttp://www.futurechips.org/tag/intel-vt-dhttp://www.futurechips.org/tag/emulationhttp://www.futurechips.org/tag/drivershttp://www.futurechips.org/tag/cortex-a57http://www.futurechips.org/tag/cortex-a15http://www.futurechips.org/tag/armhttp://www.futurechips.org/tag/amd-vihttp://www.futurechips.org/author/ali-hussainhttp://www.futurechips.org/understanding-chips/arm-virtualization-part-2-memory-interrupts.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.htmlhttp://www.futurechips.org/


2/11



2. The guest OS cannot affect the I/O operations or memory of other guests.

3. The software changes to the guest OS must be minimal.

4. The guest OS needs to be able to recover from a failure of the hardware or

migration of the VM.

5. The I/O operations on the guest OS should have similar performance to running

natively.

In this list we can see how several items on the list are competing with other items on

the list. So the final solution will require trade-offs based on the particular use-case.

Now, With these goals in mind let us look at the various techniques for implementing I/O

virtualization and the problems faced.

Emulated Or Paravirtualized Dev ices

When implementing full virtualization, one of the simplest options is for the guest OS to

emulate a virtual device on the host. The guest communicates with this virtual deviceand the hypervisor detects the guests communication. This can be done using trapping

of device accesses, or permissions to certain pages of memory. The hypervisor

understands the operations by the guest OS on the virtual device and performs the

corresponding operation on the physical device. This technique is called hosted or split

I/O.

ARM Virtualization I/O Virtualization

(Part 3)

ARM Virtualization Extensions Memory

and Interrupts (Part 2)

Writing and Optimizing Parallel

Programs A complete example57

comment(s)

What makes parallel programming

hard?46 comment(s)

Quick Post: Should you ever use Linked-

Lists?44 comment(s)

Parallel Programming: When Amdahls

law is inapplicable?23 comment(s)

How to trick C/C++ compilers into

generating terrible code?21

comment(s)

Q & A: Do mul ticores s ave energy? Not

really.15 comment(s)

Which little PC should I buy? Raspberry

Pi? Mele A1000? or 14 comment(s)

What every Programmer should know

about the memory system12

comment(s)

Open MP vs p threads11 comment(s)

Ten things every programmer must

know about hardware10 comment(s)

Popular Posts

Log in

Entries RSS

Comments RSS

Meta
http://www.futurechips.org/comments/feedhttp://www.futurechips.org/feedhttp://www.futurechips.org/wp-login.phphttp://www.futurechips.org/tips-for-power-coders/programmer-hardware.htmlhttp://www.futurechips.org/tips-for-power-coders/open-mp-pthreads.htmlhttp://www.futurechips.org/chip-design-for-all/what-every-programmer-should-know-about-the-memory-system.htmlhttp://www.futurechips.org/thoughts-for-researchers/comparison-small-pcs-rasberry-pi.htmlhttp://www.futurechips.org/chip-design-for-all/a-multicore-save-energy.htmlhttp://www.futurechips.org/tips-for-power-coders/how-to-trick-cc-compilers-into-generating-terrible-code.htmlhttp://www.futurechips.org/thoughts-for-researchers/parallel-programming-gene-amdahl-said.htmlhttp://www.futurechips.org/thoughts-for-researchers/quick-post-linked-lists.htmlhttp://www.futurechips.org/tips-for-power-coders/parallel-programming.htmlhttp://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.htmlhttp://www.futurechips.org/understanding-chips/arm-virtualization-part-2-memory-interrupts.htmlhttp://www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html


3/11



The advantage of this technique is that since every call goes through the hypervisor,

the hypervisor can provide the desired functionality. For example the hypervisor can

track every I/O operation the device is presently waiting on. Similarly restricting a guest

from affecting other guests becomes simplified because all physical device accesses

are managed by the hypervisor. But this technique has a high CPU overhead. The data

needs to be copied multiple times, processed through multiple I/O stacks, etc.

The performance can be improved by using paravirtualization. In this case the device

drivers in the OS implement an ABI with the hypervisor. The device drivers interfacewith the hypervisor and the hypervisor directly communicates with the physical device

Twittecounter

It seems there has been an internal

server error with the page you've

requested. Our coding monkeys

have been notified and we'll be backreal soon, promise!

Send us a noteif the problem

persists!

@liliputingnewsWhich little PC should I

buy? Raspberry Pi? Mele A1000? or

http://t.co/ydryDF1Kvia @sharethis

2012-07-16

@raspberry_piWhich little PC should I

buy? Raspberry Pi? Mele A1000? or

http://t.co/GSF5bICT2012-07-16

@DrQzAgreed. In systems/disk usage,

they are related directly but do you not

agree that a channels latency and

throughput are indep? 2012-06-30

More updates...Powered by Twitter Tools

What I'm Doing...
http://alexking.org/projects/wordpresshttp://twitter.com/FutureChipshttp://twitter.com/FutureChips/statuses/219105673345118208http://twitter.com/DrQzhttp://twitter.com/FutureChips/statuses/224863504673419265http://t.co/GSF5bICThttp://twitter.com/raspberry_pihttp://twitter.com/FutureChips/statuses/224866862851293184http://twitter.com/sharethishttp://t.co/ydryDF1Khttp://twitter.com/liliputingnewshttp://twitter.com/home?source=twitterremote&status=Hey%20@Boris,%20@SamWierema%20and%20@TheCounter!!!%20Wake%20up,%20Twitter%20Counter%20is%20down:%20http://twittercounter.com


4/11



as is shown in the figure below.

This technique provides better performance with similar control but there is still a

significant performance overhead, for example, in trapping to the hypervisor. Figure

below shows the difference observed by IBM in using an emulated IDE controller vs

IBMs virtio-blk paravirtualized device drivers in KVM.

June 2013

April 2013

March 2013

July 2012

June 2012

August 2011

July 2011

June 2011

May 2011

Archives

About Us

Pages
http://www.futurechips.org/about-ushttp://www.futurechips.org/2011/05http://www.futurechips.org/2011/06http://www.futurechips.org/2011/07http://www.futurechips.org/2011/08http://www.futurechips.org/2012/06http://www.futurechips.org/2012/07http://www.futurechips.org/2013/03http://www.futurechips.org/2013/04http://www.futurechips.org/2013/06


5/11



When looking at this overhead it is important to keep in mind it is very use-case

dependent. A CPU bound benchmark will not show much sensitivity to the virtualization

of I/O. Alternatively for an I/O heavy benchmark this overhead can be significant. As an

example the conjugate-gradient method for solving a system of linear equation spends

around 70% of CPU cycles in the user mode and spends the remaining time in the

hypervisor kernel engaged in disk I/O.

Passthrough I/O

Passthrough I/O greatly improves performance by remapping the guest page tables to

directly write to the physical device. This eliminates most of the overhead in trapping to

the hypervisor for every operation. This technique brings the bulk of I/O processing to

near-native speeds.


6/11



There are several issues that need to be addressed to effectively virtualize I/O using

this technique. Consider the case of a guest using DMA accesses to communicate with

a device. In this scenario we need to account for the following issues.

Isolation

The goal of virtualization is to to sandbox the guest OS to keep it from accessing the

data of other guest OSes. We do this in the guest by adding a second stage

translation. However, the DMA devices operate on physical addresses and are not


7/11



aware of second stage translations. So if a guest is given unrestricted access to a DMA

device it can read or write to any physical address in memory and corrupt the memory

of other guests. So there needs to be a protection mechanism instituted to make sure a

device only directs DMA requests from a particular guest to go to memory associated

with that guest.

Furthermore, more than one guest may need to access the same device. The device

needs to be able to distinguish between the accesses coming from different devices

and redirect them correctly.

Physical Address

To complete the DMA transaction the guest OS needs to provide the device with the

proper physical address in memory to find the data. But the guest does not know the

physical address of the data, only the Intermediate Physical Address (IPA) which is in

essence a virtual address. For the DMA access to work the device must be able to

translate the IPA to the correct physical address.

Contiguous Memory Blocks

The problem cannot be solved by just providing the device with the correct PA. The

device expects the DMA target region to be located in a contiguous region of memory.

In a virtualized environment this is not guaranteed. The hypervisor may allocate guest

pages that are not contiguous in as small as 4K blocks. So the device must be able to

do this translation for the entire DMA region.

32 Bit Devices In Larger Address Spaces

This problem is similar to the problem with a 32 bit guest on a 64 bit host discussed in

the previous post. The system may have older devices that cannot access the complete

larger address spaces of newer systems. An address translation is necessary to use

these devices with a DMA outside their normal addressable range.

Hardware Support


8/11



The problems mentioned above are not easily solved in software and need a hardware

solution that correctly maps device addresses to the correct guest. Most platforms have

hardware solutions for this. This mechanism is called IOMMUfor IO Memory

Management Unit. Intel calls their implementation VT-d, AMD calls their implementation

AMD-Vi, and ARM calls their implementation SystemMMU.

The basic idea for the IOMMU is simple. An address translation unit is placed in

between any devices that may be used by a guest OS. When the hypervisor is setting

up second stage page tables for a guest OS to access the device, it sets up the IOMMU

too. Similar to tablewalks in the core, address translations are expensive. So TLBs are

implemented to reduce the overhead of address translations.

An example system showing where the System MMU can be located.Transactions with the device are translated through the system MMU.

System MM U

The ARM System MMU is programmed with different translation contexts. It maps eachtransaction to the corresponding context by matching against expected streams. Based
http://www.futurechips.org/wp-content/uploads/2013/04/system-using-smmu.pnghttp://en.wikipedia.org/wiki/IOMMU


9/11



on the context the System MMU may either bypass the translation, cause a fault, or

perform a translation. The System MMU in the ARM architecture provides full 2 stage

translation support (as described in the previous post) and depending on the context

we may either do a first stage translation or a second stage translation. To perform the

translation the System MMU has registers analogous to the TTBRs and other control

registers for each context.

The system MMU may also receive faults during its translation process or if a context is

not setup. Depending on the type of fault and how the System MMU is configured it may

take certain actions. A translation fault can trigger an interrupt. This allows an

opportunity for the hypervisor to service the interrupt and restart the translation so it

can come to completion. The System MMU may also send a BUSERROR to the

appropriate requestor. There are syndrome registers present to ease the process of

diagnosing and fixing the problem.

Some advantages of System MMU dont even need virtualization. Since the System

MMU enables every device to perform VA to PA translations, I/O operations can be

performed by drivers in user-space using VAs. The permission checking and translation

maps can ensure one user application does not corrupt the memory of another

application . This would eliminate the traps to kernel presently required further reducing

I/O overhead. Another problem is dealing with contiguous memory. Many operations

result in very large DMA accesses that cannot be allocated a single chunk of memory

by the OS. Presently they need to either be split into multiple DMA requests or

performed with complex DMA scatter-gather operations. The System MMU enables the

device to communicate via a DMA based on a contiguous VA instead of fragmentedPAs. This both reduces the CPU overhead and simplifies the software and device.

It should be noted that the System MMU is a part of the platform rather than a part of

the core architecture. This means it only affects the drivers. Because of this many

features are implementation defined. For example the bits used to match a stream and

map it to a context are implementation defined. Since there is no user code that is

aware of this part of the system, changes to the system MMU architecture wouldnt

require as many legacy code issues.

So using these techniques the hypervisor can provide an appropriate implementation of


10/11



Leave a Reply

virtualized I/O according to the use-case. This concludes the third installment of this

series on virtualization. This series continues in the next postdiscussing the use-cases

for virtualization especially the use cases targeted in the mobile space by ARM.

References

For more information check out the following resources.

http://xpgc.vicp.net/course/svt/TechDoc/ch12-

IOArchitecturesForVirtualization.pdf

http://nowlab.cse.ohio-state.edu/NOW/dissertations/huang.pdf

http://www.ibm.com/developerworks/linux/library/l-virtio/

http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaat/liaatbestpractices_pdf.

http://www.mulix.org/lectures/xen-iommu/xen-io.pdf

http://developer.amd.com/wordpress/media/2012/10/IOMMU-ben-yehuda.pdf

http://www.arm.com/files/pdf/System-MMU-Whitepaper-v8.0.pdf

http://software.intel.com/en-us/articles/intel-virtualization-technology-for-directed-

io-vt-d-enhancing-intel-platforms-for-efficient-virtualization-of-io-devices

http://support.amd.com/us/Processor_TechDocs/48882.pdf

You may also like -

Which little PC should I buy? Raspberry Pi? Mele A1000? or ...

Why computer architects MUST benchmark Javascript?

Tips for iPhone Dev elopers: The web-based sandbox for understandingCortex A8 is ready (Part 3)

Answers to Computer Science Self-assessment Quiz

Computer Science Self-assessment Quiz

0 0share 0 295
http://www.futurechips.org/chip-design-for-all/software-interview-quiz.htmlhttp://www.futurechips.org/chip-design-for-all/answers-computer-science-self-assessment-quiz.htmlhttp://www.futurechips.org/chip-design-for-all/tips-for-iphone-developers-the-web-based-sandbox-for-understanding-cortex-a8-is-ready-part-3.htmlhttp://www.futurechips.org/thoughts-for-researchers/computer-architects-benchmark-javascript.htmlhttp://www.futurechips.org/thoughts-for-researchers/comparison-small-pcs-rasberry-pi.htmlhttp://support.amd.com/us/Processor_TechDocs/48882.pdfhttp://software.intel.com/en-us/articles/intel-virtualization-technology-for-directed-io-vt-d-enhancing-intel-platforms-for-efficient-virtualization-of-io-deviceshttp://www.arm.com/files/pdf/System-MMU-Whitepaper-v8.0.pdfhttp://developer.amd.com/wordpress/media/2012/10/IOMMU-ben-yehuda.pdfhttp://www.mulix.org/lectures/xen-iommu/xen-io.pdfhttp://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaat/liaatbestpractices_pdf.pdfhttp://www.ibm.com/developerworks/linux/library/l-virtio/http://nowlab.cse.ohio-state.edu/NOW/dissertations/huang.pdfhttp://xpgc.vicp.net/course/svt/TechDoc/ch12-IOArchitecturesForVirtualization.pdfhttp://www.futurechips.org/thoughts-on-latest-happenings/arm-virtualization-part-4-applications.html


11/11



2012 Future Chips Suffusion theme by Sayontan Sinha

Name

E-mail

URI

ARM Virtualiz ation Extensions Memory and Interrupts (Part 2) ARM Virtualiz ation ARM vs x86 (Part 5)

(required)

(required)

Your Comment

You may use these HTML tags and attributes:

Submit Comment
http://www.futurechips.org/thoughts-on-latest-happenings/arm-virtualization-arm-x86-part-5.htmlhttp://www.futurechips.org/understanding-chips/arm-virtualization-part-2-memory-interrupts.htmlhttp://www.aquoid.com/news/themes/suffusion/http://www.futurechips.org/