Upload
buiquynh
View
241
Download
1
Embed Size (px)
Citation preview
Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas
VIDAS: OBJECT-BASED VIRTUALIZED D AT A S H A R I N G F O R H I G H PERFORMANCE STORAGE I/O
ScienceCloud New York City, June 2013
Pablo Llopis, Javier Garcia Blas, Florin Isaila, Jesus Carretero
Computer Architecture and Technology Group (ARCOS)University Carlos III of Madrid, Spain
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Agenda
1. Motivation
2. Goals and common problems
3. Common storage I/O virtualization solutions
4. VIDAS Design and Implementation
5. Evaluation
6. Conclusion
2
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Motivation3
‣ HPC in virtualized clouds gaining popularity
‣ Increasing amount of data stresses importance of I/O performance
‣ I/O overhead in virtualized environments is still significant
‣ Agreement that POSIX not suitable for high performance
‣ Virtualized environments trade off protection for performance
‣ Lack of I/O coordination mechanisms across domains
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Goals and contributions4
Propose new node-level abstractions and mechanisms in virtualized environments which:
‣ Enable building efficient virtualized data sharing
‣ Coordinate I/O across domains
‣ Provide shared access spaces across node-local domains
‣ Relax POSIX consistency
‣ Allow flexible data write and data read policies
‣ Expose data locality
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Agenda
1. Motivation
2. Goals and common problems
3.Common storage I/O virtualization solutions
3.1. Block-level solution
3.2. Filesystem-level solution
4. VIDAS Design and Implementation
5. Evaluation
6. Conclusion
5
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Virtualized block device drivers
Host
Guest Domain
Hypervisor
Device Driver
Appl icat ion
Host
Guest Domain
Hypervisor
Appl icat ion
Host
Guest Domain
Hypervisor
Device Driver
Appl icat ion
Device Driver Front-end
Device Driver Back-end
Device DriverDevice Driver
Device Driver
Scheduler
Buffer Cache
A1 A2 A3
A1 virtualized block-level device on top of a physical device driver
A2 virtualized block-level device on top of filesystem
A3 paravirtualized device drivers
6
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Virtualized filesystems
B1 virtualized filesystem within a file (stacked filesystems)
B2 paravirtualized filesystem
B3 sophisticated paravirtualized filesystem, inter-domain coordination
7
Host
Guest Domain
Hypervisor
Device Driver
Scheduler
Buffer Cache
Device Driver
Scheduler
Buffer Cache
Appl icat ion
Host
Guest Domain
Hypervisor
Appl icat ion
Device Driver
Scheduler
Host
Guest Domain
Hypervisor
Device Driver
Appl icat ion
Scheduler
Buffer Cache
Paravirtualized filesystem front-end
Paravirtualized filesystem back-end
Device Driver Front-end
Device Driver Back-end
B1 B2 B3
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Virtualized filesystems
B1 virtualized filesystem within a file (stacked filesystems)
B2 paravirtualized filesystem
B3 sophisticated paravirtualized filesystem, inter-domain coordination
8
Host
Guest Domain
Hypervisor
Device Driver
Scheduler
Buffer Cache
Device Driver
Scheduler
Buffer Cache
Appl icat ion
Host
Guest Domain
Hypervisor
Appl icat ion
Device Driver
Scheduler
Host
Guest Domain
Hypervisor
Device Driver
Appl icat ion
Scheduler
Buffer Cache
Paravirtualized filesystem front-end
Paravirtualized filesystem back-end
Device Driver Front-end
Device Driver Back-end
B1 B2 B3
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Agenda
1. Motivation
2. Goals and common problems
3.Common storage I/O virtualization solutions
4. VIDAS Design and Implementation
4.1. Abstractions
4.2. Design
4.3. Implementation
5. Evaluation
9
viernes, 14 de junio de 13
Guest 0 Guest 1 Guest N-1
Storage
ScienceCloud ’13 New York City, 17th June 2013 _
VIDAS10
viernes, 14 de junio de 13
Guest 0 Guest 1 Guest N-1
Storage
ScienceCloud ’13 New York City, 17th June 2013 _
VIDAS10
Object 0 Object 1 Object 2 Object M-1
viernes, 14 de junio de 13
Guest 0 Guest 1 Guest N-1
Object 0 Object 1 Object 2 Object M-1
offset0, sz0 offset0, size0
write update
ScienceCloud ’13 New York City, 17th June 2013 _
Abstractions11
Objects Containers
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Containers12
Guest 0 Guest 1 Guest N-1
Object 0 Object 1 Object 2 Object M-1
offset0, sz0 offset0, size0
write update
Containers are access control domains which allow to share objects among a set of guests.
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Objects: POSIX differences13
Data writes and updates guided by configurable policy:write-through or write-back
Guest 0 Guest 1 Guest N-1
Object 0 Object 1 Object 2 Object M-1
offset0, sz0 offset0, size0
write update
Provide locality awareness
Strong consistency not enforced, but optional.
Objects are uniquely associated to an external storage resource through its name
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
API Overview14
Container operationsint container create(char *name, int domain ids[])
int container destroy(char* name)int container attach(char *name)int container leave(char* name)
Object metadata operationsobj handle t object create(char* ext storage rsc, size t o↵set, size t size, char* cname)
obj handle t object join(char* ext storage rsc, size t o↵set, size t size, char* cname)
int object get locality(char* ext storage rsc, obj handle t *objects[])
int object leave(obj handle t o)int object destroy(obj handle t o)int object getattr(obj handle t o, char *name,
void *value, size t size)int object setattr(obj handle t o, char *name,
void *value, size t size)
int obj write(obj handle t o, char *buf, size t o↵set, size t sz)
int obj read(obj handle t o, char *buf, size t o↵set, size t sz)
int obj flush(obj handle t o)int obj update(obj handle t o)
Object data operations
Object synchronization operationsint object wait(obj handle t o, char **bufp)
int object notify(obj handle t o, char *buf)
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Agenda
1. Motivation
2. Goals and common problems
3.Common storage I/O virtualization solutions
4. VIDAS Design
5. VIDAS Implementation
5.1. Inter-domain communication mechanisms
5.2. Implementation
6. Using VIDAS
7. Evaluation
15
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Inter-domain communication16
Ring buffers Shared memory
Domain B
Domain A
Ringbuffer
Domain N
Domain 1 Domain 2
Shareddata
Inter-domain data communication techniques
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Inter-domain communication
• A high-efficient inter-domain data transferring system for virtual machines. (2009) pp. 385–390 Presented at the Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication, New York, NY, USA: ACM.
• A study of Inter-domain Communication Mechanisms on Xen-based Hosting Platforms. (2009).• Diakthaté F., Perache, M., Namyst, R., & Jourdren, H. (2009). Efficient Shared Memory Message
Passing for Inter-VM Communications. Euro-Par 2008 Workshop• Huang, W., Koop, M. J., Gao, Q., & Panda, D. K. (2007). Virtual machine aware communication
libraries for high performance computing (p. 9). Presented at the SC '07• Inter-domain socket communications supporting high performance and full binary compatibility on
Xen (pp. 11–20). Presented at the Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, New York, NY, USA: ACM
• Wei Huang, Koop, M. J., & Panda, D. K. (2008). Efficient one-copy MPI shared memory communication in Virtual Machines (pp. 107–115). Presented at the 2008 IEEE International Conference on Cluster Computing (CLUSTER), IEEE.
17
All of these optimize inter-vm data sharing and communication, but none of them work around the limitation of being able to share pages with only 2 domains.
viernes, 14 de junio de 13
Domain 2Domain 2
Memorypage
Domain 1 Domain 2
Memorypage
Domain 1
ScienceCloud ’13 New York City, 17th June 2013 _
Xen/Linux grant reference extension18
Hypervisors usually provide a mechanism to share a page with only 2 domains domains at once.
In our work we extended Linux/Xen to provide user-space applications with the ability to share a page with any number of domains.
viernes, 14 de junio de 13
VM A
Appl icat ion
I /O Stack
PV device dr iver
VM B
Appl icat ion
I /O Stack
PV device dr iver
I/O stack, device drivers
Buffer cache
RingbufferRingbuffer
ScienceCloud ’13 New York City, 17th June 2013 _
Paravirtualized data sharing19
VM A
Appl icat ion
VM B
Appl icat ion
Shareddata object
I/O stack, device drivers
While most solutions use mainly ringbuffers for paravirtualization,VIDAS combines ringbuffers and multi-domain shared memory
Point-to-point Collective
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Implementation
Host
Guest DomainGuest Domain
Front-end Object dataObject attributes
Hypervisor
Back-endRing buffer transport
Container indexObject index
ApplicationApplicationobject_createobject_joinobject_destroyobject_leaveobject_get_localityobject_flushobject_updatecontainer_createcontainer_destroy
object_getattrobject_setattrobject_writeobject_readobject_notifyobject_wait
Front-end
Ring buffer transport
20
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Implementation
Host
Guest DomainGuest Domain
Front-end
Hypervisor
Back-endRing buffer transport
Container indexObject index
ApplicationApplication
Front-end
Ring buffer transport
object_createcontainer_create
A container creator specifies which domains are allowed to access the objects within the container
21
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Implementation
Another domain can then access the objects in a container by joining in
Host
Guest DomainGuest Domain
Front-end
Hypervisor
Back-endRing buffer transport
Container indexObject index
ApplicationApplication
Front-end
Ring buffer transport
object_joincontainer_attach
22
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
23
Host
Guest DomainGuest Domain
Front-end Object dataObject attributes
Hypervisor
Back-endRing buffer transport
Container indexObject index
ApplicationApplication
Front-end
Ring buffer transport
object_waitobject_read
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
24
Host
Guest DomainGuest Domain
Front-end Object dataObject attributes
Hypervisor
Back-endRing buffer transport
Container indexObject index
ApplicationApplication
Front-end
Ring buffer transport
object_getattrobject_setattobject_writeobject_notify
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Agenda
1. Motivation
2. Goals and common problems
3.Common storage I/O virtualization solutions
4. VIDAS Design and Implementation
5. Evaluation
6. Conclusion
25
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Evaluation
¨ CPU Intel Xeon 12 core¨ RAM 64GB DDR3 synchronous RAM @ 1333Mhz¨ HDD 1TB Toshiba MK1002TS
¤ Linux LVM¤ ext4
¨ Hypervisor Xen 4.2¨ Privileged domain Linux 3.5.7¨ Guest domains Linux 3.5.7
¨ Synthetic Benchmarks: Broadcast, Independent read, Collective r/w
26
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Bandwidth: In-memory communication
0"
100"
200"
300"
400"
500"
600"
2" 4" 8" 16"
Band
width)(M
B/s))
Number)of)Virtual)Machines)
MPI-Bcast"
VIDAS-Bcast"
ringbuffer-Bcast"
MPI-Bcast: share 128MB of data using MPI_Bcast
VIDAS-Bcast: share 128MB of data using VIDAS
ringbuffer-bcast: transfer 128MB of data to each VM using Xen ringbuffer transports
27
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Independent file read (512MB)
0"
20"
40"
60"
80"
100"
120"
140"
160"
1" 2" 4" 8" 16"
Throughp
ut)(M
B/s))
Number)of)Virtual)Machines)MPI+IO"(NFS)" VIDAS"I/O+Bcast" MPI+IO"(PVFS2)"
MPI-IO (NFS) Concurrent file read on a node-local NFS mountMPI-IO (PVFS2) Concurrent file read on a node-local PVFS2 mount
VIDAS I/O+Bcast Load data into object, broadcast to all VMs
28
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Collective I/O (Two-phase I/O)
0"
20"
40"
60"
80"
100"
120"
140"
160"
1" 2" 4" 8" 16" 1" 2" 4" 8" 16"
Throughp
ut)(M
B/s))
Number)of)Virtual)Machines)
Read))))))))))))))))))))))))))))))))))))))))))))))Write)
MPI+IO"collec1ve"(NFS)" VIDAS"collec1ve" MPI+IO"collec1ve"(PVFS2)"
Non-overlappingly interleaved strided vectors of 2MB blocks
29
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Agenda
1. Motivation
2. Goals and common problems
3.Common storage I/O virtualization solutions
4. VIDAS Design and Implementation
5. Evaluation
6. Conclusion
30
viernes, 14 de junio de 13
ScienceCloud ’13 New York City, 17th June 2013 _
Conclusion and Contributions31
‣ I/O overhead in virtualized environments is still significant
‣ Agreement that POSIX not suitable for high performance
‣ Virtualized environments trade off protection for performance
‣ Lack of I/O coordination mechanisms across domains
Reduce memory copy operationsReduce domain context switches
Relax POSIX consistencyControl data write and update policiesControl data locality
Created shared access spacesReduced memory copy operations and context switches
Coordinate storage I/O across domainsProposed new multi-domain data sharing mechanisms
viernes, 14 de junio de 13