Upload
scott-morton
View
219
Download
0
Embed Size (px)
DESCRIPTION
Commercial Off the Shelf (COTS) Virtualization
Citation preview
WHITE PAPER | AUGUST 2015
Replacing the Black-Box Commercial Off the Shelf (COTS) VirtualizationHigh Capacity Security, Media and Analytics Solutions Using High Performance Virtual Machines (HPVMs) on Industry Standard Servers
Overview Black-box or "appliance" products are prolific in industry, and for good reason: they perform. Black-boxes provide tightly coupled functionality, processing and throughput needed for demanding applications in small, efficient 1U and 2U form-factors. The 1U "pizza box" form-factor is especially popular for low SWaP (size, weight, and power) applications and installation is simple and efficient.
At the same time, the disadvantages of the black box approach are also well known:
• Proprietary, closed architecture make it difficult or impossible to addcustom features
• Expensive – often times very expensive
• Upgrades may be slow in coming
• End-users are restricted and vendor dependent
• Warranties and RMA policies not on par with major server vendors
This whitepaper discusses Using Gazoo’s High Performance Virtual Machines (HPVMs), running on industry standard server architectures, Vendors will now be able to develop much more powerful applications without the need for fixed function appliances like “Black Boxes”.
High Performance Virtual Machines provide unprecedented scales of efficiency. By utilizing hundreds of compute-intensive multicores in a single server, Gazoo has changed the game. The result is an order of magnitude increase in the numbers of virtual machines per server, machine compute power and teraflops-per-watt realized across the enterprise.
Contents
Introduction pg. 2
Hardware Requirements pg. 2
Architecture & Data Flow
Programming Interface pg. 7
CPU vs GPU Overview pg. 8
ISS Example – HP DL380 pg. 9
ISS Capacity Table pg. 10
ISS Example - XstreamCore pg. 11
Summary pg. 12
• Accelerator Only pg. 3
• 3rd Party & Encryption pg. 4
• Security Accelerator pg. 5
• Accelerator + HT pg. 6
High Performance Computing
Replacing The Black-Box | Gazoo White Paper 2
Introduction For a given application, what if the black box could be replaced by a server? Typically a replacement is possible if a server is custom engineered; and there are many “High Performance Server” vendors who are more than willing to provide customized hardware.
But what if the black box can be replaced entirely by an industry standard server (ISS) from HP, Dell, Supermicro or others? What if you could go online, order a 1U or 2U server, a few off-the-shelf cards, add state-of-the-art software and achieve "black box" performance? All without violating any ISS vendor reliability or MTBF warranties?
Thanks to the evolution of high performance computing (HPC) and the proliferation of very stable, off-the-shelf, multicore CPU hardware, Gazoo has made this possible. This white paper discusses how to leverage current hardware and server infrastructure to deliver HPC supporting virtually any compute intensive application.
Hardware/Software Requirements The following are typical hardware specifications for HPC in an Industry Standard Server:
• Minimum of four (4) x86 cores. A typicalmotherboard might be 8-core Sandy Bridge
• One or more CIM® (Compute Intensive Multicore)accelerators featuring 32 or 64 C66x cores per x8PCIe card
• Optionally (depending on throughput requirements),one or more high speed network adapters, forexample the Mellanox ConnectX-3 card (x8 PCIe)
• Intel® DPDK (Data Plane Development Kit) Software(open source)
• Gazoo® advanced software supporting API andOpenMP programming interfaces
Server Architecture and Data Flow: Accelerator Only Below is a server software architecture block diagram showing a dual Sandy Bridge box (16 total physical cores) and from one to six (6) CIM® accelerator cards (32 to 384 total C66x cores per server).
Control plane software –signaling, session monitoring and statistics, resource allocation
Data plane software, dedicated to high performance, outside of virtualization scope. May include encryption, routing, load balancing
x86 cores
DPDK
Gazoo DPD igh Capacity Media Architecture In dustry Standard Server C66x Accelerator Block Diagram © Gazoo 2015
Network MemoryHost CPU cores C66x cores PCIe
1 to N cores
x86 cores
Linux
1 to N cores
8 cores
2 GB DDR3 Mem
PCIe risers
C66x cores
voice/video framework
8 cores
2 GB DDR3 Mem
8 cores
2 GB DDR3 Mem
C66x cores
voice/video framework
8 cores
2 GB DDR3 Mem
C66x accelerator cards
Up to 6 cards (3 per riser)50 W per card
C66x cores
voice/video framework
C66x cores
voice/video framework
Industry Standard Server
Enet1 GbE
CIM® SoftwareDirectCore® Software SigMRF Software
DPDK
Enet1 GbE
Enet1 GbE
PCIe switch (root
complex)
PCIe switch / bridge
Data Plane Path Control Plane Path
Figure 1 - Server Architecture and Data Flow, Accelerator Only
Replacing The Black-Box | Gazoo White Paper 3
In the above diagram, note that network I/O on the accelerator is not used. This is appropriate for applications where data plane throughput and latency, while important, are not the crucial factors, for example media applications such as VoIP, video content delivery, and video analytics. In these applications, network I/O is handled by x86 data plane cores, which are outside the Linux “Petri dish”, i.e. outside of the Linux (or other GP OS) and virtual machines (VMs). In this environment control plane flow stays within the Linux environment.
In applications requiring high network throughput and low latency, the network I/O on the accelerator can be used. In these applications signal processing code is offloaded to the accelerator and frees up x86 data plane cores, while control plane data stays within the customary Linux environment. With a higher percentage of application code running on CIM® cores, the issue of programming interface – API or OpenMP – must be carefully considered. This is discussed further below.
Server Architecture - Accelerator + High Throughput Below is a server software architecture block diagram showing a dual Sandy Bridge box (16 total physical cores) and from one to five (5) CIM® accelerator cards (32 to 320 total C66x cores per server), and a 40 GbE network adapter card.
Figure 2 - Server Architecture, Accelerator + High Throughput Network Adapter
Replacing The Black-Box | Gazoo White Paper 4
Custom and Third Party Data/Control Plane Application Environment
Below is a software architecture block diagram showing a modular adaptation of HPC using custom and/or 3rd Party Application/Analytics. In this approach, integrated applications use 2GB of P-Space and D-Space provided natively by the x86 data plane core environment and scheduled by simple configurations based upon application requirements, scaling needs and run time priority. One or multiple applications can run on a single core as required. One use case example is the application of high-speed encryption. For example: identity management can be enforced and secured within each core along with inline encrypt/decrypt capabilities using traditional PKI encryption methodologies along with on-board algorithms or other plug-in software designed for advanced encryption. Resource loading and other load balancing capabilities or network analysis are additional third party application examples.
Figure 3 - Server Architecture, Accelerator + Custom and Third Party Data/Control Plane Application Environment
Replacing The Black-Box | Gazoo White Paper 5
Security AcceleratorThe security accelerator provides wire-speed processing on 1-Gbps Ethernet traffic on IPSec, SRTP, and 3GPP Air interface security protocols. It functions on the packet level with the packet and the associated security context being one of these above three types. The security accelerator is coupled with network coprocessor, and receives the packet descriptor containing the security context in the buffer descriptor, and the data to be encrypted/decrypted in the linked buffer descriptor.
– Authentication Engine provides hardware modules tosupport keyed (HMAC) and non-keyed hash calculations:› CMAC authentication› GMAC authentication› HMAC MD5 authentication› HMAC SHA1 authentication› HMAC SHA2 224 authentication› HMAC SHA2 256 authentication
– Air Cipher Engine› AES CTR cipher› AES CMAC authentication› Kasumi F8 cipher› Snow3G F8 cipher› Kasumi F9 authentication
– Programmable Header Parsing module› PDSP based header processing engine for packet parsing,algorithmcontrol and decode› Carry out protocol related packet header and trailerprocessing
• Support null cipher and null authentication for debugging
• True Random number generator– True (not pseudo) random number generator– FIPS 140-1 Compliant (if KeyStone II)– Non-deterministic noise source for generating keys, IV, etc.
• Public Key accelerator– High performance public key engine for large vector mathoperation– Supports modulus size up to 4096 bits– Extremely useful for public key computations
• Context cache module to automatically fetch securitycontext
Protocol stack features provided:
– Provides IPsec protocol stack› Support transport mode for both AH and ESPprocessing› Support tunnel mode for both AH and ESP processing› Full header parsing and padding checks› Constructs initialization vector from header› Supports anti-replay› Supports true 64K bytes packet processing
– Provides SRTP protocol stack› Supports F8 mode of processing› Supports replay protection› Supports true 64K bytes packet processing
– Provides 3GPP protocol stack, Wireless Air cipherstandard› AES counter› ECSD A5/3 key generation› GEA3 (GPRA) key generation› GSM A5/3 key generation› Kasumi F› Snow3G
Features provided by respective hardware modules:
– Encryption and Decryption Engine› 3DES CBC cipher› AES CTR cipher› AES CBC cipher› AES F8 cipher› AES XCBC authentication› CCM cipher› DES CBC cipher› GCM cipher
Features provided by the security accelerator
Replacing The Black-Box | Gazoo White Paper 6
Server Data Flow - Accelerator + High Throughput Below is a server Data Flow diagram showing a dual Sandy Bridge box (16 total physical cores) and from one to five (5) CIM® accelerator cards (32 to 320 total C66x cores per server), and a 40 GbE network adapter card.
Figure 4 - Server Data Flow, Accelerator + High Throughput Network Adapter
Note again the optional data flow paths through x86 data plane cores, or through accelerator I/O. In this case, these paths can be used to augment data plane throughput. Also note that control flow also has some flexibility, for example if CIM® cores are operating as Hadoop worker nodes, it may be desirable to route some control flow through the high speed network adapter; one example might be a network file system (NFS) using AoE (ATA over Ethernet).
Figure 4 – Data Flow, Accelerator + High Throughput Network Adapter
Replacing The Black-Box | Gazoo White Paper 7
Programming Interface CIM® accelerators offer two types of programming interface:
• API
• OpenMP
The type of interface used depends on the nature of application code. For applications containing relatively short, identifiable sections of compute-intensive code, an OpenMP interface may be suitable.
For applications where (i) entire, complex processes must be offloaded to the accelerator, or (ii) where network I/O must be “right at the network edge”, and not subject to virtualization or other system performance constraints, an API interface, or some combination of API + OpenMP, may be more suitable.
OpenMP offers an easy-to-use programming interface, and is especially effective when used with multicore accelerators, where source code must be “partitioned” between many heterogeneous CPU cores within the same system. Below is a source code example showing MPEG2 to H.264 transcoding, inside OpenMP pragmas.
Figure 5 – Video Transcoding Source Code Example Using OpenMP
Replacing The Black-Box | Gazoo White Paper 8
CPU vs GPU Overview
CPU and GPU devices are constructed from fundamentally different chip architectures. Both are very good at certain things, and both are not so good at some things -- and these strengths and weaknesses are mostly opposites, or complementary. In general:
CPUs tend to be good at complex algorithms that require random memory access, non-uniform treatment of data sets, unpredictable decision paths, and interface to peripherals (network I/O, PCIe, USB, etc).
GPUs tend to be good at well-defined algorithms that operate uniformly on large data sets, accurate and very fast math, and graphics applications of all types
Neither CPUs or GPUs provide a panacea for complex computing as neither is fundamentally superior to the other (in direct contrast to prevailing marketing hype). People tend to forget that top global semiconductor manufacturers are all on the same technology curve -- which of course makes sense, as they all use the state-of-the-art semiconductor manufacturing technology. If you look closely at the two key factors that form the basis for Moore's Law, performance and power consumption (the chip metric is GFlops / Watt), there is very little difference between Intel, Nvidia, Texas Instruments, Xilinx, etc. What you will find are differences in corporate practice and marketing culture, ingrained over very long periods of time -- 30 years or more -- that makes one manufacturer or another more adept at serving certain market segments, with advantages (or disadvantages!) in package size, memory bandwidth, on-chip integrated peripherals, programmability, etc.
In comparing the diagrams above, some obvious differences and similarities stand out:
GPU cores are sometimes called "CUDA cores", in reference to Nvidia's programming language. It's not easy to compare CPU and GPU cores (apples and oranges). Maybe the easiest way to think about it is (i) for any given math calculation, a GPU core can almost always do it much faster, and (ii) GPU cores do not run arbitrary C/C++, Java, or Python code, so they're not programmable in the conventional sense
A GPU accelerator can bring far more processing force to bear on massively parallel problems. Examples include graphics, bitcoin mining, climate simulations, DNA sequencing -- any problem where the data set can be subdivided into "regions", such that the results of one region do not depend on others
A CPU accelerator typically has its own NIC (or more than one), which can provide advantages in reduced latency and "data localization" -- bringing the compute cores closer to the data. Onboard NICs are typically not found on GPU accelerators as GPU cores are not designed to run device drivers, TCP/IP stack, etc.
Both types of accelerators take full advantage of high performance PCIe interfaces found in modern servers, including multiple PCIe slots and risers, accessibility to DPDK cores, and excellent software support in Linux
c66x multicore CPU accelerator diagram (shown for a CIM-64 card, with 8 cores per CPU and 2 GB mem per CPU). All CPU cores have NIC access
GPU accelerator diagram (shown for a Kepler 80 card, with 13 Streaming Multiprocessors (SMs) per GPU, 192 CUDA cores per SM, and 12 GB
mem per GPU). A GPU can have literally 1000s of "CUDA cores"
Replacing The Black-Box | Gazoo White Paper 9
ISS Example – HP DL380 HP Proliant series DL380 servers are economical, workhorse machines. In a 2U configuration with dual Sandy Bridge CPUs (16 cores), these servers provide a cost-effective mix of performance, efficiency, and relatively small size. However they present limitations for use of PCIe cards:
• Riser board power consumption limited to 150 W
• Riser cards are designed for single-slot PCIecards (i.e. the height of the card is about 0.65”, orwidth of a standard PCIe slot)
These limitations tend to rule out GPU boards, which because of their 2 slot width and 300 to 400 W power consumption, will void the MTBF warranty of the server. Gazoo CIM® accelerators however, operate very well within these constraints.
Using voice and video (media) algorithms as a metric, below are some example performance figures for an HP DL380p Gen8 server, configured as follows:
• Dual Sandy Bridge (16 cores, 2.2 GHz clockrate)
• Dual riser cards (3 slots each)
• Three (3) CIM® accelerator cards (x8 PCIe, 32total CIM® cores @ 1.25 GHz per core, 2 GByteDDR3 mem per core)
• Two (2) CIM® accelerator card (x8 PCIe, 64total CIM® cores @ 1.25 GHz per core, 2 GByteDDR3 mem per core)
• 32 GByte motherboard memory
256total physical compute cores
Figure 6 - HP DL380 Server with low SWaP CIM® Accelerator Cards
Replacing The Black-Box | Gazoo White Paper 10
ISS Capacity Table ( )
The above figures were measured on an HP DL380, configured as specified above. Average throughput across the PCIe bus was about 50 Mbps per CIM® core, which is typical for voice applications.
x86 1 CIM 2
16 643.00 1.25
30%
Fps
Encode,
Decode,
or
Both
Encode
Core
s 3
Decode
Core
s 3
H.264 720p BP 1 Mbps 15 E 2 1 36VP8 720p 1 Mbps 15 E 4 1 18MPEG2 720p BP 4 Mbps 15 E 2 36H.264 1080p BP 1 Mbps 15 E 2 1 36VP8 1080p 1 Mbps 15 E 4 1 18MPEG2 1080p BP 4 Mbps 15 E 2 36H.264 CIF MP 500 kbps 15 E 2 1 36H.264 QCIF MP 250 kbps 15 E 1 1 72
G.711 B
AMR-NB B 8576AMR-WB B 3620EVRC B
G.722 B 8777
G.722.1 (16 kHz Fs) B 10566
G.723.1A B 17554G.729AB B 11042GSM FR B 47458GSM HR B 7800iLBC B
1 Intel Sandy Bridge2 Texas Inst C66x3 Dedicated cores required due to optimized algorithm
CPU / Accelerator Type
CIM™ 64C Accelerator Voice / Video Capacity
Copyright ©
Number of cores
Clock rate (GHz)
Framework overhead (%)
Vid
eo
Sp
eech
Codec(s)
Figure 7 – HP DL380p Gen8 Server Performance Figures for High Capacity Media Application
94916
5204
6460
Replacing The Black-Box | Gazoo White Paper 11
Gazoo – XstreamCore XstreamCore is different from the competing solutions- Not constrained to proprietary architecture- Uses standard PCIE cards for microserver and
media processing resources which can run on anyserver
- Offers both PCI Express and EthernetCommunications Infrastructure:
- Unmatched Density per RU for processingand media applications – 15 slots in 3U
• Container for cold-swappable compute and I/Omodules
• Reliability with redundant/hot-swappable powerand cooling
• Designed for NEBS certification
• 15 PCIe slots to create application specificappliances
Configuration options include:- Intel Xeon-D MicroServer cards- Gazoo accelerator cards- 100GE I/O with load balancer- All standard 3rd party PCIE cards
Innovate Express Fabric- Virtual functions (VF) are assigned to virtual
Machines (VM) on several CPUs- Ability to share I/O between VMs- 25-100GB per card via PCIE Gen3
Direct I/O- Shared MAC with 2x 10GBaseT at chassis front- Optional Ethernet switch with 1GE ports to- Server Cards and two 10GBaseT at chassis front
736total physical compute cores
Figure 8 - XstreamCore Server with low SWaP CIM® Accelerator Cards
XstreamCore PCIe Chassis configurations available in densities up 15,000 physical cores in a 21U rack
Replacing The Black-Box | Gazoo White Paper 12
Summary
Gazoo provides hardware and software that adds hundreds of compute cores to industry standard servers. This functionality covers a broad range of leading business applications including the following:
Artificial Intelligence –
Image Analytics –
Data Analytics –
Virtual Desktop Infrastructure (VDI) –
Media Transcoding –
Today, the HPC industry is transforming from traditional hardware-based deployment models to software-based, and ultimately, cloud- based. By deploying virtualized solutions, operators have the opportunity to gain valuable experience in the technology and processes related to virtualization, while reaping the benefits of improved service agility, deployment flexibility and reduced CAPEX. Gazoo’s virtualized portfolio of solutions leverage NFV and SDN technologies for deployments on cloud-based infrastructure.
Our patented High Performance Virtual Machines (HPVMs) allow more complex applications to be virtualized and moved to standard server architectures, and their compute intensive functions accelerated.
Inspiration for this white paper is derived from actual customer situations where this approach has been utilized to replace one or more costly black-boxes from a customer rack. Specific case study information is available under NDA.
At Gazoo we invent, perfect, patent, and license software solutions that accelerate High Performance Computing (HPC).
Gazoo, partnered with Signalogic© has worked closely with TI for 25+ years and is an authorized TI Design Network Member. TI’s globalNetwork provides turnkey products and services, system modules,embedded software and development tools that help customersaccelerate development and reduce time-to-market.
*
Intel® IOT Solutions
Alliance
Gazoo, partnered with Signalogic® is a General Member of the Intel®
Internet of Things Solutions Alliance. Intel and 250+ global member companies of the Alliance provide the hardware, software, firmware, tools and systems integration that developers need to take a leading role in the rise of the Internet of Things. Learn more at: iotsolutionsalliance.intel.com/member-roster/signalogic-inc
Gazoo, partnered with Signalogic® is an HP AllianceOne Partner. AllianceOne is a worldwide program composed of hundreds of ISVs, IHVs, consultants, SIs, service providers and OEMs who develop market-leading solutions running on key HP technologies and platforms.
Design Network
3891 S. Traditions Dr. College Station, TX 77845 USA
979-220-7753
www.gazoohpc.com | [email protected]
© 2015 Gazoo, Inc.
High Performance Computing