Upload
the-linux-foundation
View
3.680
Download
1
Embed Size (px)
Citation preview
APara-virtualizedInterfaceforSocketCalls
DimitriStiliadisFounder/CEO– Aporeto
@dstiliadis
StefanoStabelliniLinuxKernelLead- Aporeto
@stabelinnist
Overview
• Whyareweworkingonthis(andwhatitisnot)?• Usecases• Proposedprotocol• Performance• Demo
SecurityThreatsinaContainerEnvironment
Namespaceconfiguration(capabilities,seccomp,SELinux,AppArmor
ContainerImagesandSources(validation,vulnerabilityanalysis)
Accesscontroltomanagementdaemon
Networking&Communications
TheKernelItself
SecuredefaultsBut..severalwaystomessitup
AddressedbyseveraltoolsImagescanning,signatures
Delegatedtomanagementsystems
Severaloptionsavailable
?
SecurityRecommendations(fromNCCWhitePaper)• From“Understanding andHardeningLinuxContainers”byNCCGroup:
• Rununprivilegedcontainers(usernamespaces,rootcapability,dropping)• ApplyaMandatoryAccessControlsystem,suchasSELinux• Buildacustomkernelbinarywithasfewmodulesaspossible• Applysysctl hardening• Applydiskandstoragelimits• Controldeviceaccessandlimitresourceusagewithcgroups• Dropanycapabilitieswhichare notrequiredfortheapplicationwithinthecontainer• Usecustommountoptions to increasedefenseindepth• ApplyGRSecurity andPAXpatchestoLinux• ReduceLinuxattacksurfacewithSeccomp-bpf• Isolatecontainersbasedontrustandexposure• Logging,auditingandmonitoringisimportantforcontainerdeployment• Usehardwarevirtualizationalongapplicationtrustzones
It’sComplicated
PicturefromDonNorman’stalk“LivingwithComplexity”
SecurityRisks- ”TheKernelItself”
KernelExploitDisablesSecurity
LinuxKernel
Apps/Docker
netfilterSeccomp/bfp
Apps/Docker
Attack
AndtheKernelisnotFreeofBugs
http://www.cvedetails.com/product/47/Linux-Linux-Kernel.html?vendor_id=33
TheAlternative:ContainersinVMs
Kernel
Root
Ring0
Ring3 Container
OSContainers
Container
HypervisorRoot
Ring0
Ring3
KernelKernel
Container Container
HWVirtualization
VirtualDev VirtualDev
VirtualDev VirtualDev
Isolation,significantI/OoverheadsDifferentOSbetweenHypervisorandGuests
DeviceAbstraction
Simplicity,limitedhardwareisolationSameKernelforallContainers
TheVirtualizationOverhead:ExampleNetwork
HypervisorRoot
Ring0
Ring3
Kernel
Container
VirtualDev
VirtualDev
TCP/IPstack
NSBridge
Bridge
IPStack
Andofcourse,managingsecurityinmultiplekernels
Ring0
Ring3 Container
OSContainers
Container
Kernel
Dev
TCP/IPstack
NSBridge
HardwareVirtualization
WhatWeReallyWant
ContainerPerformance
VirtualMachineIsolation ?
WhatIfweThoughtofVirtualizationALittleDifferent?
HardwareVirtualizationAssumptions
HostandGuestOSaredifferentRunanyGuestonanyHost
VMmovesetc
OSVirtualizationAssumptions
AllGuestssharethesametype ofKernelAllGuestsareofthesametypeWedon’tcareaboutmoves
(DockerModel)
SystemCallVirtualization
• Introduceproxykernel• Sameasrootkernel• Allowsmemorypagesre-use• Singlekerneltomanage
• Subsetofsyscalls deliveredtomachinekernel• Socket, file,time
• Majorityofsystemcallsrestrictedwithinsyscall proxy
Syscall KernelProxy
RootKernelRoot
Ring0
Ring3 Container Container
SyscallVirtualization
Unprotected
Proxied/TranslatedHypercall
Syscall KernelProxy
ExampleImplementationInXen
Dom0App(Container)
XenPV Interface
VM
Syscallbackend
Syscallfrontend
PV CallsAll othersyscalls
LinuxDomUinternals
WhyXen?
• Efficientpara-virtualizationinterface• Allowdeploymentsinbaremetalandcloud
• XenonGCP• (MorecomplexthoughtodoXen-on-Xen inAWSwithpara-virtualizedIO)
Example:NetworkAccess
• Translatesocketcallstohypercalls• Containeropensa“paravirtualizedsocket”insidethehostOS• UsesnativelyIPstackofhost• Securityandforwardingpoliciesappliedatthehost
Syscall KernelProxy
KernelRoot
Ring0
Ring3 Container Container
SyscallVirtualization
Syscall KernelProxy
Connect
Connect
10.1.1.5NIC
10.1.1.5
Example:NetworkAccesswithNamespaces
• Containernamespace createdatthehostasbefore
• Containerprocess is launchedinside aprotected VM
• ThroughSystemCallvirtualizationsystemcallsappliedtonamespacecontext
• Containergets IPaddressofnetworknamespace
• TransparenttoDockerandothercontainersystems
Syscall KernelProxy
Root
Ring0
Ring3 Container Container
SyscallVirtualization
Syscall KernelProxy
Connect
Connect
192.168.2.1
Bridge 192.168.2.1
PVCallsfornetworking
PortsopenedinaVM,areopenedonthehost
Enablecross-domainsnetworknamespacesandSELinuxlabels
Zero-confnetworkinginVMs• noneedforabridgeindom0• workswithwireless networks,VPNs,anyotherspecial configurations inDom0
FirstImplementation
• Designdocument• http://marc.info/?l=xen-devel&m=147033114613017
• Code
• First,simpleimplementationonXen• 1Commandring• Persocket:
• dataring• event ring
• Variableringdatasizesconfigurablepersocket• Supportedfunctions(socket,connect,release,bind,listen,accept,poll)• git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.gitpvcalls-5
PVCallsBenchmarks
Xen4.7.0-rc3 Linuxv4.6-rc2Dom04vcpus,pinned,28GBRAMDomU4vcpus,pinned,4GBRAM
App(Container)
LinuxDomU
Xen
POSIX
PV Interface
VM Dom0
Iperf-c127.0.0.1 Iperf-s
PVCalls
PVCalls
App(Container)
LinuxDomU
Xen
POSIX
PV Interface
VM Dom0
Iperf-s Iperf-c127.0.0.1
PVCalls
PVCalls
?!
PVCalls
Howisthatpossible?
Howisthatpossible?
PVCalls
And,youusesomethinglikethattoday(DockerforMacandVPNKit)
MacOSx
Root
Ring0
Ring3
Kernel
DockerforMac
VirtualDev
VirtualDev VirtualDev
The“simplistic”versionofthesyscall proxy
SocketProxy
Container Container
“VPNKit operatesbyreconstructingEthernettrafficfromtheVMandtranslatingitintotherelevantsocketAPIcallsonOSXorWindows.Thisallowsthehostapplicationtogeneratetrafficwithoutrequiringlow-levelEthernetbridgingsupport.”
FirstImplementation
• Designdocument• http://marc.info/?l=xen-devel&m=147033114613017
• Code
• First,simpleimplementationonXen• 1Commandring• Persocket:
• dataring• event ring
• Variableringdatasizesconfigurablepersocket• Supportedfunctions(socket,connect,release,bind,listen,accept,poll)• git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.gitpvcalls-5
Extensions
• Mechanismisgenericandcanbeextendedtoothersystemcalls
• Co-processingofsystemcallsisalsopossible• Guestcanprocesssystemcallparametersandtranslate athypercall
• Resolvememoryreferences(pointers)• Resolves TOCTOU riskofsystemcallinterposition
• TimeofCheck/TimeofUse
• UsingN/N+1kernelversionscanreduceattacksurfacefarther
Demo1:PerformanceComparison
Kernel
Root
Ring0
Ring3 Container
Syscall Proxy
KernelRoot
Ring0
Ring3 Container Container
SyscallVirtualization
Syscall Proxy
Container
Nonoticeableperformancedifference
Demo2:KernelExploit
Kernel
Root
Ring0
Ring3 Container
Docker
Container
Vulnerablecontainercrashesmachineandallothercontainers
Syscall Proxy
KernelRoot
Ring0
Ring3 Container Container
SyscallVirtualization
Syscall Proxy
VulnerablecontainercrashesitselfonlyAttackcontained