Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
www.virtualopensystems.com
EC H2020 dRedBox: Seminar School at Polytech'Clermont-
Ferrand
Kevin CHAPPUIS 2017-11-6
Virtual Open Systems Confidential & Proprietary 2
Virtual Open Systems
Virtual Open Systems Proprietary
Part 1: Virtual Open Systems Company Overview
Virtual Open Systems Confidential & Proprietary 3
Virtual Open Systems
Virtual Open Systems Proprietary
Part 2: Data-Centers Disaggregation in dRedBox
Virtual Open Systems Confidential & Proprietary 4
H2020 dRedBox project description
➢ dRedBox (disaggregated recursive data-center in a box)
➢ Project duration: January 2016 - December 2018
➢ Total Cost: EUR 6 451 500
➢ Objective: To innovate the datacentres architecture, shifting from monolithic clusters of machines to disaggregated pool of components
➢ The dReDBox proposition has the ambition to lead to
significantly improved levels of utilization, scalability,
reliability and power efficiency, both in conventional
cloud and edge datacentres.
Virtual Open Systems Confidential & Proprietary 5
Current data-centers design
• Physical servers compound of CPUs, memory, accelerators, storage
• Impose fixed resources assignment ratio
– Low resources utilisation
– Energy waste (unused HW still powered on)
– Higher price
Towards Data-Centers disaggregation (1/2)
Virtual Open Systems Confidential & Proprietary 6
Towards data-centers disaggregation (2/2)
Disaggregated data-centers design
• Memory and accelerators separate from CPU brick
• Flexible resources assignment
– High resources utilisation
– Energy optimization (Power off unused resources)
– Lower TCO (Total Cost of Ownership)
Virtual Open Systems Confidential & Proprietary 7
Virtualization: Memory disaggregation
Host with minimal local RAM (hypervisor, services)
Memory for a guest obtained from a disaggregated pool
Guest VM uses disaggregated resources exclusively
QEMU is a virtualizer for the guest and each QEMU/VM is just a process for hypervisor. QEMU uses HVA and exposes it as GPA (Hotplug in guest)
Physical RAM
– Local only for the hypervisor
– QEMU process (VMs) uses remote memory only
– More remote memory attached on demand by orchestrator
How to balance it to limit physical reconfiguration?Memory Ballooning
Virtual Open Systems Confidential & Proprietary 8
Virtualization: Memory Ballooning
Guest is launched with specific RAM size
Ballon driver operates within the guest RAM capability
– Inflate – reserving VM’s pages (make them unusable)
– Deflate – releasing pages
Reserved pages are reported to the hypervisor – may be reused
When ballon is empty, it is possible to hotplug new memory to the guest and pass it to the ballon.
Virtual Open Systems Confidential & Proprietary 9
Linux KVM
Non-secureVirtual Machines
ARMv8-A hardware
VOSYSmonitor
SecureVirtual Machine 1
Secure RTOS(monitoring,
secure gateway, etc)
CPU1 CPU2 CPU3 CPU4
SecureVirtual Machine 2
TEE(Secure services)
Normal world Secure world
Secure Computing Bricks: Multi-OSs consolidation on ARMv8
➢ Provide spatial and temporal isolation through TrustZone
➢ Support legacy RTOS for monitoring applications
➢ Virtualization features (KVM) remain intact for the GPOS
➢ Flexibility for static allocation or overcommitment of hardware resources
Virtual Open Systems Confidential & Proprietary 10
Computing Node 1Disagregated Memory
VOSYSmonitor VOSYSmonitor
VOSYSmonitorVOSYSmonitor
CPU 1 CPU 2
CPU 4CPU 3
Linux
LinuxLinux
Computing Node 2
VOSYSmonitor VOSYSmonitor
VOSYSmonitorVOSYSmonitor
CPU 1 CPU 2
CPU 4CPU 3
Linux TEELinux
Shared memory area
Linux
Secure RTOSLinux
Secure RTOSLinux
IP Stack communication
➢ CPU disaggregation – Secure computing brick:
Possibility to deploy secure execution
environments to remote cores through a proper
communication link between computation
bricks.
Secure Computing bricks: VM deployment
Virtual Open Systems Confidential & Proprietary 11
Virtual Open Systems
Virtual Open Systems Proprietary
Part 3: Introduction to Virtualization Concepts
http://www.virtualopensystems.com/en/solutions/guides/kvm-on-armv8/
Virtual Open Systems Confidential & Proprietary 12
Virtual Open Systems
Virtual Open Systems Proprietary
Part 4: ARMv8 Architecture Introduction
Virtual Open Systems Confidential & Proprietary 13
Cortex-A15, Cortex-A9 ...
Cortex-A72, Cortex-A57, Cortex-A53
ARM Architecture evolution
Virtual Open Systems Confidential & Proprietary 14
31 General Purpose (GP) registers 64-bit GP registers X0-X30 (32 bit access W0-W30) No banking of GP register Stack pointer (SP) is a specific register (one per Exception Level) Program counter is not a GP registers
Support for Floating Point and Advanced SMID (32 registers 128-bits) PSTATE register (e.g., ALU flags, exception masks) System register access
– MRS x2, sp_el3
ARMv8-A overall description
Architecture profiles:
ARMv8-A - AARCH64 Execution state:
A – application / R – real-time / M - microcontroller
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 15
ARMv8-A instruction set
mov x16, #0x10 => Write a value in a register
ldr x4, [x21] => Read the memory space pointed by x21 and put the value in x4.
str x5, [x11] => Write in the memory space pointed by x11 the value contained in X5.
cmp x0, #0x20 => Compare the value contained in X0 with 0x20
beq _label => If it is equal, branch to _label
bl function => Branch to a function by linking
lsl x18, x4, #2 => Shift the value contained in x4 by 2 and put the result in x18
and x6, x2, x4 => Do a logical “and” operation between x2 and x4 and put the result in x6
orr x0, x1, x2 => Do a logical “orr” operation between x1 and x2 and put the result in x0
Virtual Open Systems Confidential & Proprietary 16
Exception level changing through specific instructions SMC, SVC, HVC, ERET
Secure world is completely isolated
(memory, devices, etc) from the Normal world by ARM TrustZone security
extensions. Since TrustZone is implemented in hardware, it reduces the
security vulnerabilities. The secure world could be used to run a secure OS to provide secure services to
the OS running in the Normal world.
ARM Virtualization extensions address the needs of devices for the partitioning and
management of complex software environments into
virtual machines.
Normal world to run concurrently another OS (e.g Linux) without impacting the secure OS.
Monitor layer is the highest priority level which provides a bridge between each world to allow some interactions.
ARMv8-A exception level
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 17
ARMv8-A features: ARM TrustZone
Normal world Secure world
Sharedmemory
Secure monitor firmware
Safety/Secure OS
Hardware
Rich OS
Secure applications
Normal HW resourcesand peripherals
Secure HW resourcesand peripherals
Rich OS applications
TrustZone splits core into two compartments (e.g., Normal world / Secure world)
Secure monitor firmware (EL3) is needed to support context switching between worlds
Each compartment has access to its own MMU allowing the isolation of Secure and Normal translation tables.
Cache has tag bits to discern content cached by either secure or normal world.
Security information is propagated on AXI/AHB bus
Memory/Peripheral can also be made secured
Provide security interrupts
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 18
ARMv8-A features: virtualization extension
Virtual Machines
Hypervisor (EL2)
ARMv8-A architecture includes hardware virtualization extension and Large Physical Address Extension (LPAE) to support the efficient implementation of vitual machine hypervisors:
Some hypervisors compliant with the ARM architecture
• Linux-KVM
• XEN
Dedicated exception level (EL2) for hypervisor.
Full virtualization capacity to run an OS in a virtual machine without any modification.
Combination of hardware features to minimize the need of hypervisor intervention.
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 19
TLBsPage tables
ARM core Caches
MMU
Memory
MMU handles translation of virtual addresses to physical addresses.
The address translation is performed through the TLB or a table walk.
*Translation Look-aside Buffers
ARMv8-A features: Memory Management
TTBR1Kernel space
TTBR0User space
Virtual address
Not Mapped(MMU fault)
AARCH64 supports up to 48-bits of Virtual Address
All ELs have independent MMU configuration The page table supports different translation granules Each page table requires different attributes
– Access permissions (Read/Write - User/Privileged modes)
– Memory types (Caching/Buffering rules, Shareable, etc)
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 20
ARMv8-A features: Cache memory
Virtual Open Systems Proprietary
0x00
0x04
0x08
0x0C
0xDEADBEFF
0xDEB0CAD0
0xBABA0000
0xFEFEFEFE
Main Memory Index 0
Index 1
Index 2
Index 3
Cache Memory way 0
Index 0
Index 1
Index 2
Index 3
Cache Memory way 1
0x10
0x14
0x18
0x00000000
0x01234567
0xDADAD1D1
Cortex - A53
L1 cache Instruction and data separated. Instruction 2 ways / Data 4 ways Size 8KB to 64KB - Cache line length 64 bytes L1 cache access => ~1 cycle
L2 cache 16-way set associative Size 128KB to 2MB Cache line length 64 bytes L2 cache access => ~10 cycles
Virtual Open Systems Confidential & Proprietary 21
Interrupt Distributor
Interrupt Controller
CPU Interface CPU Interface
CPU 0 CPU 1
External sources
IRQ FIQ FIQIRQ
ARM provides a Generic Interrupt Controller (GIC) which supports routing of software generated, private and shared peripheral interrupts between cores. It is composed by:
• Distributor: All interrupt sources are connected. It controls the type of the interrupt, priority, state, core targeted through the CPU interface.
• CPU interface: Through this a core receives an interrupt. The CPU interface provides abilities to mask, identify and control the state of interrupts.
ARM processors include two types of interrupts:
– Fast Interrupt (FIQ) is the highest priority. Some banked registers are allocated to the FIQ handler. FIQ could be used for secure applications.
– General Interrupt Request (IRQ)
ARMv8-A features: Interrupt management
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 22
ARMv8-A Vector Table (cntd)
Virtual Open Systems Proprietary
0x780
0x700
0x680
0x600
0x580
0x500
0x480
0x400
0x380
0x300
0x280
0x2000x180
0x1000x080
0x000
Serror / vSerror
FIQ / vFIQ
IRQ / vIRQ
SynchronousSerror / vSerror
Serror / vSerror
Serror / vSerror
FIQ / vFIQ
FIQ / vFIQ
FIQ / vFIQ
IRQ / vIRQ
IRQ / vIRQ
Synchronous
Synchronous
Synchronous
IRQ / vIRQLower EL using
AARCH32
Lower EL using AARCH64
Current EL with SPx
Current EL with SP0
Exception generated during an EL AARCH32 is routed to a higher EL
Exception generated during an EL AARCH64 is routed to a higher EL
Exception directly caught in the current EL with SP_ELx
Exception directly caught in the current EL with SP_EL0
ARMv8 vector table
Virtual Open Systems Confidential & Proprietary 23
ARMv8-A Vector Table
Virtual Open Systems Proprietary
Separate vector tables for each exception level. Define the location in VBAR_ELn register.
Synchronous exception
• Aborts from MMU
• SP & PC alignment fault
• Undefined instruction
• Service calls: SVC, SMC, HVC
Serror => Asynchronous data abort (ex: abort triggered by writeback of dirty cache line)
0x780
0x700
0x680
0x600
0x580
0x500
0x480
0x400
0x380
0x300
0x280
0x200
0x180
0x100
0x080
0x000
Serror / vSerror
FIQ / vFIQ
IRQ / vIRQ
SynchronousSerror / vSerror
Serror / vSerror
Serror / vSerror
FIQ / vFIQ
FIQ / vFIQ
FIQ / vFIQ
IRQ / vIRQ
IRQ / vIRQ
Synchronous
Synchronous
Synchronous
IRQ / vIRQLower EL using
AARCH32
Lower EL using AARCH64
Current EL with SPx
Current EL with SP0
ESR_ELx => Include info about the reasons
FAR_Elx => Hold the faulting address
ELR_Elx => Hold the instruction address which caused the data abort.
Information registers for exceptions:
Virtual Open Systems Confidential & Proprietary 24
Virtual Open Systems
Virtual Open Systems Proprietary
Part 5: Linux kernel Introduction
Virtual Open Systems Confidential & Proprietary 25
Interaction Linux – User
Hundreds of kernel modules (and device drivers) are include in the upstream version of Linux
Thousands are used in every day life (embedded in all sort of devices)
Linux uses the file system to allow interaction between a Linux user (process) and the module (kernel)
Kernel and users are different, interaction between the two has to be kept to a minimum
– Obvious problems of security might arise if a user process can liberally access the kernel structures
– The user activity shouldn’t affect the kernel execution (crashing the kernel means halting all the processes running in the machine)
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 26
Interaction Linux – User
Two main types of interaction:
– Read/Write to the device node in /dev/ - This is commonly the case of modules that are exposing information of the kernel that has to be retrieved from the shell Example: cat /dev/kmsg or for some file-system oriented uses: Example: dd if=/dev/zero of=./zeroed_file bs=1MB count=4
– IOCTLs – A module can export some functions that can be called in a user application implemented in C/C++, or in all the programming languages that supports system calls. Example: ioctl(fd, KVM_CREATE_VCPU, (void *)vcpu_id);
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 27
Files operations
Both ways of interactions use the device node as entry point
These files are defined in the kernel module. When the kernel module is loaded, the file appears in /dev/
Every file is associate with a set of methods (operations) that are all defined by the structure struct file_operations. Such structure defines the prototype of methods like:
– int (*open) (struct inode *, struct file *);
– ssize_t(*read) (struct file *, char __user *, size_t, loff_t *);
– ssize_t(*write) (struct file *, const char __user *, size_t, loff_t *);
– long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 28
Files operations
open(): Called when a process attempts to open the file. This function is always called before any other operation is done on the file
read(): Called when a process attempts to read from the file. The kernel pass the buffer where the data has to be written to as an argument
write(): Called when a process attempts to write to the file. The kernel pass the pointer to the buffer containing the data to be written as an argument
unlocked_ioctl(): Called whenever a process issues an ioctl() call on the device file descriptor. For instance:
int fd = open(’/dev/slm’, O_RDWR);
ioctl(fd, ID_OF_IOCTL, args ...);
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 29
Char device
There are several type of devices in Linux: block devices, network devices, char devices, etc.
Usually, all the modules that are not bound to a specific physical device are char devices
– These devices are the most convenient way to exchange information between the kernel and the user space
– They are called ‘char-acter’ devices because the /dev/file belonging to the device is used to write and read characters
– In these devices, the operations of read(), write(), seek(), etc. are not handled by any file system (e.g.: EXT4) because just read and write in a buffer
Virtual Open Systems Proprietary
Virtual Open Systems Confidential & Proprietary 30
Anatomy of a Linux module
Virtual Open Systems Proprietary
The entry point of a Linux module is set with module_init(init_fn). init_fn() takes care of:
• Initialization of buffers
• Definition of device ID (for char device, Major number and Minor number)
– It can be done automatically using alloc_chrdev_region()
• Creation Initialization of device node in /dev/ with class_create() and device_create()
– Alternatively, this can be done in user space with the mknod command
• Creation of the actual device since the kernel has to know which subsystem has to be associated with a given device
– cdev_* methods implement all the necessary for this task
Virtual Open Systems Confidential & Proprietary 31
Anatomy of a Linux module
Virtual Open Systems Proprietary
The exit point of a Linux module is set with module_exit(exit_fn). exit_fn() takes care of:
• Free of buffers
• Device and class removal
– It can be done using device_destroy() and class_destroy()
• Free of the char device Major and Minor numbers (in case of a char device)
– unregister_chrdev_region() serves this purpose
• Deletion of the device
– In case of a char device: cdev_del()